YouTube Knowledge Pipeline — 32K Videos, Local ML Transcription

"My best teachers have often been YouTube tutorials, reviews and demos. I found myself spending more than five hours a day watching videos at 2–3× speed." — 2025-10 | YouTube Pipeline Saved Me 4 Hours a Day | markdown

That was the problem. More than 5 hours a day. Watching. Not searchable. Not queryable. Gone after watching.

The Insight

"do not get transcript from youtube at all i want to manually transcribe on my mac need mp3s of all the videos" — 2025-09 | YouTube Pipeline | claude-code

Local ML. No API costs. Full control. MacWhisper Pro + Parakeet-MLX on Apple Silicon.

The Numbers

Metric	Value
Channels monitored	6,095
Videos processed	32,579
Transcripts generated	15,162 (local ML)
Words searchable	41.8M
Storage compression	99.9% (MP3 → transcript)
Ongoing API costs	$0

The Pipeline

yt-dlp Download → MacWhisper Pro → Parakeet-MLX → PostgreSQL + pgvector
      ↓               ↓                ↓                    ↓
  MP3 audio     File watcher    Local transcription   Semantic search

"please look into this much deeper in my repo in the db in the youtube table... YouTube Knowledge Pipeline... Absolutely massive scale. This is enterprise-level data engineering" — 2025-09 | Sparkii Command Center | claude-code

The Stack

yt-dlp — Channel monitoring, audio download
MacWhisper Pro — File watcher, auto-transcription
Parakeet-MLX — Local ML transcription (Apple Silicon)
PostgreSQL + pgvector — Vector embeddings for semantic search
OpenRouter — AI processing (summaries, chapters)
Railway — Cloud deployment

The Origin

"Data Gathering: Obtain transcripts of the YouTube videos you want to include. This could be automated through YouTube's API if the videos have captions. Data Indexing: Store these transcripts in Elasticsearch or a similar search engine, making them searchable by content." — 2023-10 | Educational Shiurim App Development | chatgpt

The idea started in October 2023. Build a searchable library of educational content. But YouTube captions weren't good enough. The solution: local ML transcription.

The Workflow

Discovery — Monitors 6,095 channels every 2 hours
Download — yt-dlp extracts audio-only MP3
Transcription — MacWhisper Pro auto-transcribes with Parakeet-MLX
Processing — Generate embeddings, create summaries
Storage — Supabase PostgreSQL with vector search

The Pattern

This is part of a larger knowledge acquisition ecosystem:

System	Purpose	Link
YouTube Pipeline	Video transcription (this repo)	You're here
Sparkii RAG	Conversation search (287K messages)	View →
Brain MCP	90+ tools over 359K messages	View →
JewTube	Audio-only YouTube for shiurim	View →

The Result

5 hours a day watching → searchable in 256ms.

41.8 million words. Local ML. Zero ongoing costs.

Built in Beit Shemesh, Israel

32K videos. 15K transcripts. Knowledge preserved.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
railway.toml		railway.toml
requirements.txt		requirements.txt
transcript_processor.py		transcript_processor.py
youtube_downloader.py		youtube_downloader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Knowledge Pipeline — 32K Videos, Local ML Transcription

The Insight

The Numbers

The Pipeline

The Stack

The Origin

The Workflow

The Pattern

The Result

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mordechaipotash/youtube-transcription-pipeline

Folders and files

Latest commit

History

Repository files navigation

YouTube Knowledge Pipeline — 32K Videos, Local ML Transcription

The Insight

The Numbers

The Pipeline

The Stack

The Origin

The Workflow

The Pattern

The Result

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages