Skip to content

YouTube knowledge pipeline: 32K+ videos, 41.8M words, local ML transcription. Python + yt-dlp + Parakeet-MLX + Supabase pgvector. $0 API costs. Production.

Notifications You must be signed in to change notification settings

mordechaipotash/youtube-transcription-pipeline

Repository files navigation

YouTube Knowledge Pipeline — 32K Videos, Local ML Transcription

"My best teachers have often been YouTube tutorials, reviews and demos. I found myself spending more than five hours a day watching videos at 2–3× speed." — 2025-10 | YouTube Pipeline Saved Me 4 Hours a Day | markdown

That was the problem. More than 5 hours a day. Watching. Not searchable. Not queryable. Gone after watching.


The Insight

"do not get transcript from youtube at all i want to manually transcribe on my mac need mp3s of all the videos" — 2025-09 | YouTube Pipeline | claude-code

Local ML. No API costs. Full control. MacWhisper Pro + Parakeet-MLX on Apple Silicon.


The Numbers

Metric Value
Channels monitored 6,095
Videos processed 32,579
Transcripts generated 15,162 (local ML)
Words searchable 41.8M
Storage compression 99.9% (MP3 → transcript)
Ongoing API costs $0

The Pipeline

yt-dlp Download → MacWhisper Pro → Parakeet-MLX → PostgreSQL + pgvector
      ↓               ↓                ↓                    ↓
  MP3 audio     File watcher    Local transcription   Semantic search

"please look into this much deeper in my repo in the db in the youtube table... YouTube Knowledge Pipeline... Absolutely massive scale. This is enterprise-level data engineering" — 2025-09 | Sparkii Command Center | claude-code


The Stack

  • yt-dlp — Channel monitoring, audio download
  • MacWhisper Pro — File watcher, auto-transcription
  • Parakeet-MLX — Local ML transcription (Apple Silicon)
  • PostgreSQL + pgvector — Vector embeddings for semantic search
  • OpenRouter — AI processing (summaries, chapters)
  • Railway — Cloud deployment

The Origin

"Data Gathering: Obtain transcripts of the YouTube videos you want to include. This could be automated through YouTube's API if the videos have captions. Data Indexing: Store these transcripts in Elasticsearch or a similar search engine, making them searchable by content." — 2023-10 | Educational Shiurim App Development | chatgpt

The idea started in October 2023. Build a searchable library of educational content. But YouTube captions weren't good enough. The solution: local ML transcription.


The Workflow

  1. Discovery — Monitors 6,095 channels every 2 hours
  2. Download — yt-dlp extracts audio-only MP3
  3. Transcription — MacWhisper Pro auto-transcribes with Parakeet-MLX
  4. Processing — Generate embeddings, create summaries
  5. Storage — Supabase PostgreSQL with vector search

The Pattern

This is part of a larger knowledge acquisition ecosystem:

System Purpose Link
YouTube Pipeline Video transcription (this repo) You're here
Sparkii RAG Conversation search (287K messages) View →
Brain MCP 90+ tools over 359K messages View →
JewTube Audio-only YouTube for shiurim View →

The Result

5 hours a day watching → searchable in 256ms.

41.8 million words. Local ML. Zero ongoing costs.


Built in Beit Shemesh, Israel

32K videos. 15K transcripts. Knowledge preserved.

About

YouTube knowledge pipeline: 32K+ videos, 41.8M words, local ML transcription. Python + yt-dlp + Parakeet-MLX + Supabase pgvector. $0 API costs. Production.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages