Skip to content

Automated pipeline to transcribe lecture audio with Whisper, generate structured Markdown notes using LLMs, and create multiple-choice quizzes. Model-agnostic, with automatic token budgeting for long recordings.

License

Notifications You must be signed in to change notification settings

mgdicesare/lecture-notes-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Notes and Quiz Pipeline (Local)

Local, model-agnostic tooling to turn lecture audio or transcripts into structured study notes and multiple-choice quizzes.

What this does

  • (Optional) Transcribe audio with Whisper.
  • Clean transcripts for prompting.
  • Generate structured Markdown notes with a local LLM.
  • Generate multiple-choice quizzes from notes or transcripts.

Pipeline flow

Audio (optional)
  -> Whisper transcription
  -> Clean transcript
  -> Notes generation (Markdown)
  -> Quiz generation (Markdown)

Model-agnostic LLM support

This pipeline is model-agnostic. Any local Hugging Face causal LLM can be used as long as it supports:

  • Text generation
  • Configurable max_new_tokens
  • A compatible tokenizer

Defaults are provided, but you can override them with --model. Multiple LLMs have been tested successfully (e.g., Qwen2.5-3B-Instruct and Mistral-7B-Instruct). Other models such as Vissal, Llama, or similar instruction-tuned LLMs should work as long as they meet the requirements above.

Repository structure

.
|-- main.py
|-- requirements.txt
|-- README.md
|-- .gitignore
|-- src/
|   |-- notes.py
|   |-- questions.py
|   |-- transcribe.py
|   `-- utils.py
`-- data/
    |-- raw_audio/      (ignored; keep your recordings here)
    |-- transcripts/    (ignored; generated transcripts)
    |-- notes/          (ignored; generated notes)
    `-- quizzes/        (ignored; generated quizzes)

Installation

Python 3.10+ is recommended.

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Whisper requires FFmpeg for audio decoding. If FFmpeg is already on your PATH, no extra steps are needed.

Models are downloaded from Hugging Face on first use. If a model requires authentication, use huggingface-cli login or set HUGGINGFACE_HUB_TOKEN.

How to run

Generate notes from a cleaned transcript

python main.py .\data\transcripts\lecture_clean.txt --task notes --output-dir .\data

Clean a raw transcript, then generate notes

python main.py .\data\transcripts\lecture_raw.txt --task notes --clean --output-dir .\data

Transcribe audio and generate notes in one run

python main.py .\data\raw_audio\lecture.mp3 --task notes --transcribe --whisper-model medium --output-dir .\data

Generate a quiz from notes

python main.py .\data\notes\lecture_notes.md --task questions --num-questions 12 --output-dir .\data

Use a specific LLM

python main.py .\data\transcripts\lecture_clean.txt --task notes --model Qwen/Qwen2.5-3B-Instruct --output-dir .\data
python main.py .\data\notes\lecture_notes.md --task questions --model mistralai/Mistral-7B-Instruct-v0.3 --num-questions 12 --output-dir .\data

Force CPU or GPU

python main.py .\data\transcripts\lecture_clean.txt --task notes --device cpu --output-dir .\data
python main.py .\data\notes\lecture_notes.md --task questions --device cuda --num-questions 8 --output-dir .\data

Automatic token budgeting for notes

Notes generation automatically computes max_new_tokens using the tokenizer associated with the selected LLM. The empirical rule is:

max_new_tokens = int(transcript_tokens * 0.8)

Additional details:

  • A minimum safe lower bound is enforced via --notes-min-tokens.
  • The value is capped to fit within the model context window when possible.
  • Empirical observation from real usage: recordings of ~40 minutes required ~1000 tokens or more.
  • Users do not manually set max_new_tokens; it is computed by the pipeline.

Minimal validation

A lightweight, CPU-only test verifies the token estimation logic:

python -m unittest tests\test_token_budget.py

Outputs

Artifacts are written under data/ and are ignored by git:

  • data/transcripts/ for raw and cleaned transcripts
  • data/notes/ for Markdown notes
  • data/quizzes/ for Markdown quizzes

License

MIT. See LICENSE.

About

Automated pipeline to transcribe lecture audio with Whisper, generate structured Markdown notes using LLMs, and create multiple-choice quizzes. Model-agnostic, with automatic token budgeting for long recordings.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages