Local, model-agnostic tooling to turn lecture audio or transcripts into structured study notes and multiple-choice quizzes.
- (Optional) Transcribe audio with Whisper.
- Clean transcripts for prompting.
- Generate structured Markdown notes with a local LLM.
- Generate multiple-choice quizzes from notes or transcripts.
Audio (optional)
-> Whisper transcription
-> Clean transcript
-> Notes generation (Markdown)
-> Quiz generation (Markdown)
This pipeline is model-agnostic. Any local Hugging Face causal LLM can be used as long as it supports:
- Text generation
- Configurable
max_new_tokens - A compatible tokenizer
Defaults are provided, but you can override them with --model. Multiple LLMs have been tested successfully
(e.g., Qwen2.5-3B-Instruct and Mistral-7B-Instruct). Other models such as Vissal, Llama, or similar
instruction-tuned LLMs should work as long as they meet the requirements above.
.
|-- main.py
|-- requirements.txt
|-- README.md
|-- .gitignore
|-- src/
| |-- notes.py
| |-- questions.py
| |-- transcribe.py
| `-- utils.py
`-- data/
|-- raw_audio/ (ignored; keep your recordings here)
|-- transcripts/ (ignored; generated transcripts)
|-- notes/ (ignored; generated notes)
`-- quizzes/ (ignored; generated quizzes)
Python 3.10+ is recommended.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtWhisper requires FFmpeg for audio decoding. If FFmpeg is already on your PATH, no extra steps are needed.
Models are downloaded from Hugging Face on first use. If a model requires authentication, use
huggingface-cli login or set HUGGINGFACE_HUB_TOKEN.
python main.py .\data\transcripts\lecture_clean.txt --task notes --output-dir .\datapython main.py .\data\transcripts\lecture_raw.txt --task notes --clean --output-dir .\datapython main.py .\data\raw_audio\lecture.mp3 --task notes --transcribe --whisper-model medium --output-dir .\datapython main.py .\data\notes\lecture_notes.md --task questions --num-questions 12 --output-dir .\datapython main.py .\data\transcripts\lecture_clean.txt --task notes --model Qwen/Qwen2.5-3B-Instruct --output-dir .\data
python main.py .\data\notes\lecture_notes.md --task questions --model mistralai/Mistral-7B-Instruct-v0.3 --num-questions 12 --output-dir .\datapython main.py .\data\transcripts\lecture_clean.txt --task notes --device cpu --output-dir .\data
python main.py .\data\notes\lecture_notes.md --task questions --device cuda --num-questions 8 --output-dir .\dataNotes generation automatically computes max_new_tokens using the tokenizer associated with the selected LLM.
The empirical rule is:
max_new_tokens = int(transcript_tokens * 0.8)
Additional details:
- A minimum safe lower bound is enforced via
--notes-min-tokens. - The value is capped to fit within the model context window when possible.
- Empirical observation from real usage: recordings of ~40 minutes required ~1000 tokens or more.
- Users do not manually set
max_new_tokens; it is computed by the pipeline.
A lightweight, CPU-only test verifies the token estimation logic:
python -m unittest tests\test_token_budget.pyArtifacts are written under data/ and are ignored by git:
data/transcripts/for raw and cleaned transcriptsdata/notes/for Markdown notesdata/quizzes/for Markdown quizzes
MIT. See LICENSE.