-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Build a script that monitors Aurora C# YouTube channels for new content, extracts transcripts, and matches them against open verification issues and manual claims. Complements the forum monitor (#1288) by targeting a different content type — video demonstrations that show mechanics in action.
Why YouTube
Forum posts describe how mechanics work. Videos show them. If a creator records themselves testing "what happens without an HQ unit," that's empirical evidence — weaker than database verification but stronger than forum speculation. YouTube content is tier 5 (community knowledge) in the verification hierarchy, but demonstrations of in-game behavior approach "live testing" in value.
What to Monitor
- Aurora C# gameplay channels (identify active creators producing tutorial/mechanics content)
- New uploads detected via RSS:
https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID - Focus on tutorial, mechanics explanation, and "testing X" style videos over Let's Play narrative
Execution Environment
Runs as a GitHub Actions scheduled workflow on the public repository.
# .github/workflows/youtube-monitor.yml
on:
schedule:
- cron: '0 9 * * *' # Daily at 9am UTC (staggered from other monitors)
workflow_dispatch:
inputs:
mode:
description: 'Run mode'
required: true
default: 'steady-state'
type: choice
options:
- steady-state
- backfillWhy GitHub Actions:
- Free for public repos (the manual repo is public)
- Native
GITHUB_TOKENfor commenting on issues and posting digests - Built-in secrets management for any API keys
- No infrastructure to provision or maintain
yt-dlpcan be installed in the workflow viapip install yt-dlp
State persistence: state.json committed to the aurora-monitor/ folder on the main branch to track seen video IDs across runs.
Note: Transcript fetching via yt-dlp may take longer than the other monitors. GitHub Actions allows up to 6 hours per job, which is more than sufficient for daily new uploads from a handful of channels.
Channel and Video Selection
How Channels Get on the Watch List
Initial seeding:
- Search YouTube for "Aurora 4X" / "Aurora C#" and identify channels with multiple relevant uploads
- Import YouTube links discovered by the Reddit monitor (Build Reddit monitor for r/aurora4x content and verification matching #1290 cross-reference output)
- Manual additions by maintainers via
channels.yaml
Channel qualification criteria:
- At least 3 Aurora C# videos in the last 12 months
- Content is primarily tutorial/mechanics-focused, not purely narrative Let's Play
- English language or has English auto-captions available
- Channel is still active (uploaded within the last 6 months)
Ongoing discovery:
- Reddit monitor (Build Reddit monitor for r/aurora4x content and verification matching #1290) automatically flags YouTube links in Aurora posts — new channels surface organically
- Periodic YouTube search (monthly) for new "Aurora 4X" / "Aurora C#" channels
- Community suggestions via GitHub issues or Discord (Build Discord integration for community-driven content flagging and verification #1291)
Channel list maintenance:
channels.yamlstores channel ID, name, URL, date added, and content category- Channels with no new Aurora content in 12 months are flagged for review
- Removed channels stay in the file (marked
active: false) for audit trail
How Individual Videos Are Filtered
Not every video on a watched channel is relevant. Videos pass through a two-stage filter before transcript processing:
Stage 1: Metadata filter (cheap — no transcript download)
| Filter | Rule | Why |
|---|---|---|
| Title keywords | Must contain at least one Aurora term (see keyword list below) | Skip non-Aurora content on mixed channels |
| Description keywords | Fallback if title is generic (e.g., "Episode 47") | Catch videos with vague titles but relevant descriptions |
| Duration | Skip < 2 minutes (intros/trailers) and > 3 hours (unedited streams) | Low signal-to-noise at extremes |
| Language | English title/description or English auto-captions available | Transcript processing requires English |
| Age (steady-state) | Skip videos older than 7 days | Already in state.json from prior runs |
Stage 2: Transcript keyword density (after download)
After downloading the transcript, check keyword density before full matching:
- If fewer than 3 Aurora-specific terms appear in the transcript, skip full matching
- This catches videos that passed Stage 1 on a generic title match but aren't actually about Aurora mechanics
Aurora keyword list for filtering:
Decision Flow
First Run: Historical Backfill
On first run, state.json doesn't exist. Without special handling, every existing video on tracked channels would be treated as "new" — triggering transcript downloads for potentially hundreds of videos, mass GitHub issue comments, and a massive digest.
The monitor operates in two modes:
Backfill Mode (first run, manual trigger)
Triggered manually via workflow_dispatch with mode: backfill:
- PAGINATE — Walk tracked channels' full upload history via YouTube API/RSS
- TRANSCRIPT — Fetch transcripts for relevant-looking videos (filter by title keywords first to avoid downloading hundreds of irrelevant transcripts)
- CLEAN — Run transcripts through Aurora-specific dictionary
- MATCH — Run all cleaned transcripts through matcher.py
- REPORT — Generate a one-time backfill report (NOT auto-comment on issues). Split into Matched, Triage, and Statistics sections.
- SEED — Write all seen video IDs to state.json
- PUBLISH — Post backfill report as a GitHub Discussion for human review
Key difference from steady-state: Backfill mode never auto-comments on GitHub issues. All matches go into the report for human review. This prevents noise from old videos, outdated content, and false positives.
Optimization: Backfill can pre-filter videos by title before downloading transcripts. Only videos with Aurora-relevant titles (containing keywords like "Aurora", "4X", game mechanic terms) need transcript processing. This avoids downloading transcripts for hundreds of irrelevant uploads.
Safety: Backfill Must Run Before Steady-State
The steady-state cron checks for state.json (or a backfill_complete flag). If missing, the workflow exits with a warning rather than accidentally treating all channel history as new.
Steady-State Mode (daily cron)
Runs automatically after backfill is complete. Only processes videos uploaded since the last run.
What Happens When a New Video Is Detected
End-to-End Flow
1. FETCH — Daily cron checks tracked channel RSS feeds for new uploads
2. DEDUP — Check video ID against state.json, skip if already seen
3. TRANSCRIPT— Fetch via: yt-dlp --write-auto-sub --sub-lang en --skip-download
4. CLEAN — Run transcript through Aurora-specific dictionary (see below)
5. CAPTURE — Store attribution metadata (creator, channel URL, video title,
video URL, upload date)
6. MATCH — Run cleaned transcript through matcher.py against two corpora:
a. Open GitHub issues labeled "unverified" (title + body keywords)
b. Manual section terminology and numeric values
7. ROUTE — Based on match result:
→ Issue match found: comment on GitHub issue + add to digest
→ Relevant, no match: add to digest Triage section for human review
→ Not relevant: record as seen, no action
8. STATE — Mark video ID as seen in state.json
9. DIGEST — Weekly: compile all matches and triage items into markdown summary
On Issue Match
When a transcript matches an open unverified issue:
- Post a comment on the matching GitHub issue with:
- Video URL, creator name, and channel URL
- Relevant transcript segment with timestamp
- Match confidence indicator
- Add to the Matched section of the weekly digest with full attribution
On Relevant Content Without an Issue Match
Not all valuable content maps to an existing unverified claim. Four scenarios:
| Category | Description | Example |
|---|---|---|
| Potential correction | Contradicts something the manual states as verified | Video demonstrates a mechanic working differently than documented |
| New coverage | Topic the manual doesn't address at all | Tutorial covering an undocumented mechanic |
| Version change | Demonstrates v2.8.0+ behavior differing from v2.7.1 baseline | Gameplay showing changed formula results |
| Expanded detail | Adds depth to an existing section without contradicting it | Testing that quantifies a vaguely described mechanic |
These go into the Triage section of the weekly digest. Each entry includes:
- Full attribution metadata (creator, video URL, timestamp)
- Suggested category (correction / new coverage / version change / expanded detail)
- The matched manual section (if any)
- Relevant transcript excerpt with timestamp
A human reviews the triage section and either:
- Creates a new issue if the content is actionable
- Dismisses if it's noise or already known
- Routes to an existing issue that the matcher missed
No automatic issue creation for unmatched content — that would be noisy. But the content is captured and surfaced so nothing falls through the cracks.
On No Relevance
Video is recorded in state.json as seen. No action taken, no noise generated.
Output Destinations
| Output | Destination | Frequency |
|---|---|---|
| Issue match comments | Posted directly on the matching GitHub issue | Each run (real-time) |
| Weekly digest | GitHub Discussion in a "Monitor Digests" category | Weekly (Sunday) |
| Backfill report | GitHub Discussion in "Monitor Digests" category | One-time (first run) |
| State file | monitor-state branch (state.json) |
Each run |
| Run logs | GitHub Actions log (visible in Actions tab) | Each run |
Weekly Digest Format
Posted as a GitHub Discussion with the title [YouTube Monitor] Week of YYYY-MM-DD:
## Matched (N items)
Videos with transcripts matching open unverified issues. Auto-commented on the relevant issues.
| Issue | Creator | Video | Timestamp | Quote | Confidence |
|-------|---------|-------|-----------|-------|------------|
| #NNN | name | title | MM:SS | "..." | High/Medium |
## Triage (N items)
Relevant videos not matching any open issue. Requires human review.
| Category | Creator | Video | Timestamp | Manual Section | Quote |
|----------|---------|-------|-----------|---------------|-------|
| Correction | name | title | MM:SS | 13.1 | "..." |
## Statistics
- Videos checked: N
- Transcripts processed: N
- Matches: N
- Triage items: N
- Skipped (not relevant): NGitHub Discussions Category
All four monitors post to a shared "Monitor Digests" category (see Prerequisites below), prefixed by source name (e.g., [YouTube Monitor]).
Aurora-Specific Transcript Dictionary
Auto-generated captions mangle game-specific terms. A custom dictionary would map common misrecognitions:
| Auto-caption | Actual Term |
|---|---|
| uranium, durainium | Duranium |
| thorium, sorium | Sorium |
| see-wis, see wiz | CIWS |
| boron ide, born hide | Boronide |
| corbomite | Corbomite |
| trans newtonian | Trans-Newtonian |
| merc-assium | Mercassium |
This dictionary should be maintained alongside the glossary and expanded as new misrecognitions are discovered.
Attribution (MANDATORY)
Every YouTube video used as a verification source or content reference must be properly attributed to its creator.
The project already has YouTube source attribution rules (CLAUDE.md). These must be enforced automatically:
When a Video Is Used as a Verification Source
- Identify the creator: Use
yt-dlp --print channel --print channel_url <URL>to get the official channel name and URL - Add a manual reference in the relevant file's References section:
\hypertarget{ref-X.Y-N}{[N]}. [Creator Name] YouTube — "[Video Title]" — [specific detail verified] - Credit in issue comments — when closing a verification issue based on video content, include the video URL, creator name, channel URL, and timestamp of the relevant segment
- Digest attribution — every digest entry must include the creator name, video title, URL, and relevant timestamp
Monitor Must Capture Attribution Metadata
For every matched video, the monitor must store:
- Channel name and URL
- Video title and URL
- Upload date
- Relevant transcript segment with timestamp
- Match context (which issue or manual claim it relates to)
This metadata ensures that manual incorporation never loses provenance. Content creators deserve credit for their work — the monitor must make proper attribution the path of least resistance, not an afterthought.
Attribution in the Manual
Per project rules, YouTube credits appear in the References section of the relevant manual file (not in README.md). Specific channel names should only appear where content was directly sourced.
Design Decisions
The following decisions apply to all four monitors (#1288, #1289, #1290, #1291):
Matcher Algorithm: Keyword + Fuzzy
The shared matcher.py uses keyword matching with fuzzy string similarity (e.g., fuzzywuzzy / thefuzz). Simple, fast, transparent, and easy to debug. Each open unverified issue and manual section generates a keyword set; incoming content is scored against these sets with fuzzy matching to handle minor variations.
Match Confidence Scoring: Multiple Signals
Match confidence is determined by combining multiple signals:
| Signal | Weight | Example |
|---|---|---|
| Fuzzy match score | Primary | Score >= 80 = strong keyword match |
| Keyword count | Secondary | 5+ Aurora terms in post = higher confidence |
| Author reputation | Bonus | Steve Walmsley post = auto-High |
| Multiple issue matches | Penalty | Matches 3+ issues = likely generic, lower confidence |
Thresholds:
- High: Fuzzy score >= 80 AND (2+ keywords OR known author)
- Medium: Fuzzy score 60-79 OR single strong keyword match
- Low: Fuzzy score 40-59, included in digest but not auto-commented on issues
Only High and Medium confidence matches trigger auto-comments on GitHub issues. Low confidence matches appear in the digest Triage section only.
Repo Structure: Same Repo (Manual)
The aurora-monitor/ folder lives in the manual repository alongside the source files. Code and state in one place, single repo to manage.
State Persistence: Dedicated Folder on Main Branch
state.json lives in a dedicated aurora-monitor/ folder in the main branch of the repository. Committed after each run. Permanent, versioned, and simple.
Weekly Digest Timing: Same Workflow, Sunday Check
The daily workflow checks if today is Sunday. If so, it compiles and posts the weekly digest to the Monitor Digests discussion category after completing the daily scan. One workflow per monitor, not two.
Authentication: GITHUB_TOKEN with Elevated Permissions
Workflows use the default GITHUB_TOKEN with permissions: discussions: write set in the workflow YAML. No PAT or GitHub App required.
Error Handling: Retry with Backoff, Then Skip
On external API failure (Reddit 429, yt-dlp blocked, forum down): retry up to 3 times with exponential backoff. If still failing, skip that source and log the failure. No auto-alerting — GitHub Actions shows failed steps in the workflow log.
Testing Strategy: Fixtures + Dry-Run
- Sample data fixtures: Real examples saved as JSON fixtures with expected match results. Run as part of CI for automated regression testing.
- Dry-run mode: A
--dry-runflag that processes live data but only logs results without posting comments or digests. Used for manual validation before enabling live mode.
Prerequisites
Before the first run, create the Monitor Digests discussion category manually (GitHub does not support programmatic category creation):
- Go to github.com/ErikEvenson/aurora-manual/discussions
- Click the pencil icon next to "Categories" (or Settings > Discussions)
- Click New category
- Name:
Monitor Digests - Description:
Automated weekly digests and backfill reports from community source monitors (Forum, YouTube, Reddit, Discord). See issues #1288-#1291. - Format: Announcement (only maintainers/workflows can post, others can comment)
- Record the category ID for use in workflow config
Created. Category ID: DIC_kwDORAJjec4C2k26 | Slug: monitor-digests
Proposed Architecture
aurora-monitor/
sources/
youtube.py # RSS fetch for tracked channels
transcripts.py # yt-dlp transcript extraction and cleanup
dictionary.yaml # Aurora term corrections for auto-captions
matcher.py # Shared — match against issues and manual claims
digest.py # Shared — generate markdown digest (Matched + Triage sections)
channels.yaml # Tracked channels with IDs and content categories
state.json # Track last-seen video IDs
Differences from Forum Monitor (#1288)
| Aspect | Forum (#1288) | YouTube (this issue) |
|---|---|---|
| Authority level | #2 (dev posts) – #5 | #5 (community knowledge) |
| Signal-to-noise | Higher (text, searchable) | Lower (long videos, tangential content) |
| Transcript quality | N/A (already text) | ~90% accurate, needs term dictionary |
| Demonstration value | Describes mechanics | Shows mechanics in action |
| Volume | Moderate | Lower (fewer Aurora creators) |
Shared Infrastructure
All four monitors share matching and triage logic. Consider a shared aurora-monitor/ package with per-source adapters.