Notes and Quiz Pipeline (Local)

Local, model-agnostic tooling to turn lecture audio or transcripts into structured study notes and multiple-choice quizzes.

What this does

(Optional) Transcribe audio with Whisper.
Clean transcripts for prompting.
Generate structured Markdown notes with a local LLM.
Generate multiple-choice quizzes from notes or transcripts.

Pipeline flow

Audio (optional)
  -> Whisper transcription
  -> Clean transcript
  -> Notes generation (Markdown)
  -> Quiz generation (Markdown)

Model-agnostic LLM support

This pipeline is model-agnostic. Any local Hugging Face causal LLM can be used as long as it supports:

Text generation
Configurable max_new_tokens
A compatible tokenizer

Defaults are provided, but you can override them with --model. Multiple LLMs have been tested successfully (e.g., Qwen2.5-3B-Instruct and Mistral-7B-Instruct). Other models such as Vissal, Llama, or similar instruction-tuned LLMs should work as long as they meet the requirements above.

Repository structure

.
|-- main.py
|-- requirements.txt
|-- README.md
|-- .gitignore
|-- src/
|   |-- notes.py
|   |-- questions.py
|   |-- transcribe.py
|   `-- utils.py
`-- data/
    |-- raw_audio/      (ignored; keep your recordings here)
    |-- transcripts/    (ignored; generated transcripts)
    |-- notes/          (ignored; generated notes)
    `-- quizzes/        (ignored; generated quizzes)

Installation

Python 3.10+ is recommended.

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Whisper requires FFmpeg for audio decoding. If FFmpeg is already on your PATH, no extra steps are needed.

Models are downloaded from Hugging Face on first use. If a model requires authentication, use huggingface-cli login or set HUGGINGFACE_HUB_TOKEN.

How to run

Generate notes from a cleaned transcript

python main.py .\data\transcripts\lecture_clean.txt --task notes --output-dir .\data

Clean a raw transcript, then generate notes

python main.py .\data\transcripts\lecture_raw.txt --task notes --clean --output-dir .\data

Transcribe audio and generate notes in one run

python main.py .\data\raw_audio\lecture.mp3 --task notes --transcribe --whisper-model medium --output-dir .\data

Generate a quiz from notes

python main.py .\data\notes\lecture_notes.md --task questions --num-questions 12 --output-dir .\data

Use a specific LLM

python main.py .\data\transcripts\lecture_clean.txt --task notes --model Qwen/Qwen2.5-3B-Instruct --output-dir .\data
python main.py .\data\notes\lecture_notes.md --task questions --model mistralai/Mistral-7B-Instruct-v0.3 --num-questions 12 --output-dir .\data

Force CPU or GPU

python main.py .\data\transcripts\lecture_clean.txt --task notes --device cpu --output-dir .\data
python main.py .\data\notes\lecture_notes.md --task questions --device cuda --num-questions 8 --output-dir .\data

Automatic token budgeting for notes

Notes generation automatically computes max_new_tokens using the tokenizer associated with the selected LLM. The empirical rule is:

max_new_tokens = int(transcript_tokens * 0.8)

Additional details:

A minimum safe lower bound is enforced via --notes-min-tokens.
The value is capped to fit within the model context window when possible.
Empirical observation from real usage: recordings of ~40 minutes required ~1000 tokens or more.
Users do not manually set max_new_tokens; it is computed by the pipeline.

Minimal validation

A lightweight, CPU-only test verifies the token estimation logic:

python -m unittest tests\test_token_budget.py

Outputs

Artifacts are written under data/ and are ignored by git:

data/transcripts/ for raw and cleaned transcripts
data/notes/ for Markdown notes
data/quizzes/ for Markdown quizzes

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Notes and Quiz Pipeline (Local)

What this does

Pipeline flow

Model-agnostic LLM support

Repository structure

Installation

How to run

Generate notes from a cleaned transcript

Clean a raw transcript, then generate notes

Transcribe audio and generate notes in one run

Generate a quiz from notes

Use a specific LLM

Force CPU or GPU

Automatic token budgeting for notes

Minimal validation

Outputs

License

About

Uh oh!

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

mgdicesare/lecture-notes-generator

Folders and files

Latest commit

History

Repository files navigation

Notes and Quiz Pipeline (Local)

What this does

Pipeline flow

Model-agnostic LLM support

Repository structure

Installation

How to run

Generate notes from a cleaned transcript

Clean a raw transcript, then generate notes

Transcribe audio and generate notes in one run

Generate a quiz from notes

Use a specific LLM

Force CPU or GPU

Automatic token budgeting for notes

Minimal validation

Outputs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages