"My best teachers have often been YouTube tutorials, reviews and demos. I found myself spending more than five hours a day watching videos at 2–3× speed." — 2025-10 | YouTube Pipeline Saved Me 4 Hours a Day | markdown
That was the problem. More than 5 hours a day. Watching. Not searchable. Not queryable. Gone after watching.
"do not get transcript from youtube at all i want to manually transcribe on my mac need mp3s of all the videos" — 2025-09 | YouTube Pipeline | claude-code
Local ML. No API costs. Full control. MacWhisper Pro + Parakeet-MLX on Apple Silicon.
| Metric | Value |
|---|---|
| Channels monitored | 6,095 |
| Videos processed | 32,579 |
| Transcripts generated | 15,162 (local ML) |
| Words searchable | 41.8M |
| Storage compression | 99.9% (MP3 → transcript) |
| Ongoing API costs | $0 |
yt-dlp Download → MacWhisper Pro → Parakeet-MLX → PostgreSQL + pgvector
↓ ↓ ↓ ↓
MP3 audio File watcher Local transcription Semantic search
"please look into this much deeper in my repo in the db in the youtube table... YouTube Knowledge Pipeline... Absolutely massive scale. This is enterprise-level data engineering" — 2025-09 | Sparkii Command Center | claude-code
- yt-dlp — Channel monitoring, audio download
- MacWhisper Pro — File watcher, auto-transcription
- Parakeet-MLX — Local ML transcription (Apple Silicon)
- PostgreSQL + pgvector — Vector embeddings for semantic search
- OpenRouter — AI processing (summaries, chapters)
- Railway — Cloud deployment
"Data Gathering: Obtain transcripts of the YouTube videos you want to include. This could be automated through YouTube's API if the videos have captions. Data Indexing: Store these transcripts in Elasticsearch or a similar search engine, making them searchable by content." — 2023-10 | Educational Shiurim App Development | chatgpt
The idea started in October 2023. Build a searchable library of educational content. But YouTube captions weren't good enough. The solution: local ML transcription.
- Discovery — Monitors 6,095 channels every 2 hours
- Download — yt-dlp extracts audio-only MP3
- Transcription — MacWhisper Pro auto-transcribes with Parakeet-MLX
- Processing — Generate embeddings, create summaries
- Storage — Supabase PostgreSQL with vector search
This is part of a larger knowledge acquisition ecosystem:
| System | Purpose | Link |
|---|---|---|
| YouTube Pipeline | Video transcription (this repo) | You're here |
| Sparkii RAG | Conversation search (287K messages) | View → |
| Brain MCP | 90+ tools over 359K messages | View → |
| JewTube | Audio-only YouTube for shiurim | View → |
5 hours a day watching → searchable in 256ms.
41.8 million words. Local ML. Zero ongoing costs.
Built in Beit Shemesh, Israel
32K videos. 15K transcripts. Knowledge preserved.