Skip to content

Build YouTube monitor for Aurora content and transcript-based verification matching #1289

@ErikEvenson

Description

@ErikEvenson

Summary

Build a script that monitors Aurora C# YouTube channels for new content, extracts transcripts, and matches them against open verification issues and manual claims. Complements the forum monitor (#1288) by targeting a different content type — video demonstrations that show mechanics in action.

Why YouTube

Forum posts describe how mechanics work. Videos show them. If a creator records themselves testing "what happens without an HQ unit," that's empirical evidence — weaker than database verification but stronger than forum speculation. YouTube content is tier 5 (community knowledge) in the verification hierarchy, but demonstrations of in-game behavior approach "live testing" in value.

What to Monitor

  • Aurora C# gameplay channels (identify active creators producing tutorial/mechanics content)
  • New uploads detected via RSS: https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID
  • Focus on tutorial, mechanics explanation, and "testing X" style videos over Let's Play narrative

Execution Environment

Runs as a GitHub Actions scheduled workflow on the public repository.

# .github/workflows/youtube-monitor.yml
on:
  schedule:
    - cron: '0 9 * * *'  # Daily at 9am UTC (staggered from other monitors)
  workflow_dispatch:
    inputs:
      mode:
        description: 'Run mode'
        required: true
        default: 'steady-state'
        type: choice
        options:
          - steady-state
          - backfill

Why GitHub Actions:

  • Free for public repos (the manual repo is public)
  • Native GITHUB_TOKEN for commenting on issues and posting digests
  • Built-in secrets management for any API keys
  • No infrastructure to provision or maintain
  • yt-dlp can be installed in the workflow via pip install yt-dlp

State persistence: state.json committed to the aurora-monitor/ folder on the main branch to track seen video IDs across runs.

Note: Transcript fetching via yt-dlp may take longer than the other monitors. GitHub Actions allows up to 6 hours per job, which is more than sufficient for daily new uploads from a handful of channels.

Channel and Video Selection

How Channels Get on the Watch List

Initial seeding:

Channel qualification criteria:

  • At least 3 Aurora C# videos in the last 12 months
  • Content is primarily tutorial/mechanics-focused, not purely narrative Let's Play
  • English language or has English auto-captions available
  • Channel is still active (uploaded within the last 6 months)

Ongoing discovery:

Channel list maintenance:

  • channels.yaml stores channel ID, name, URL, date added, and content category
  • Channels with no new Aurora content in 12 months are flagged for review
  • Removed channels stay in the file (marked active: false) for audit trail

How Individual Videos Are Filtered

Not every video on a watched channel is relevant. Videos pass through a two-stage filter before transcript processing:

Stage 1: Metadata filter (cheap — no transcript download)

Filter Rule Why
Title keywords Must contain at least one Aurora term (see keyword list below) Skip non-Aurora content on mixed channels
Description keywords Fallback if title is generic (e.g., "Episode 47") Catch videos with vague titles but relevant descriptions
Duration Skip < 2 minutes (intros/trailers) and > 3 hours (unedited streams) Low signal-to-noise at extremes
Language English title/description or English auto-captions available Transcript processing requires English
Age (steady-state) Skip videos older than 7 days Already in state.json from prior runs

Stage 2: Transcript keyword density (after download)

After downloading the transcript, check keyword density before full matching:

  • If fewer than 3 Aurora-specific terms appear in the transcript, skip full matching
  • This catches videos that passed Stage 1 on a generic title match but aren't actually about Aurora mechanics

Aurora keyword list for filtering:

Decision Flow

First Run: Historical Backfill

On first run, state.json doesn't exist. Without special handling, every existing video on tracked channels would be treated as "new" — triggering transcript downloads for potentially hundreds of videos, mass GitHub issue comments, and a massive digest.

The monitor operates in two modes:

Backfill Mode (first run, manual trigger)

Triggered manually via workflow_dispatch with mode: backfill:

  1. PAGINATE — Walk tracked channels' full upload history via YouTube API/RSS
  2. TRANSCRIPT — Fetch transcripts for relevant-looking videos (filter by title keywords first to avoid downloading hundreds of irrelevant transcripts)
  3. CLEAN — Run transcripts through Aurora-specific dictionary
  4. MATCH — Run all cleaned transcripts through matcher.py
  5. REPORT — Generate a one-time backfill report (NOT auto-comment on issues). Split into Matched, Triage, and Statistics sections.
  6. SEED — Write all seen video IDs to state.json
  7. PUBLISH — Post backfill report as a GitHub Discussion for human review

Key difference from steady-state: Backfill mode never auto-comments on GitHub issues. All matches go into the report for human review. This prevents noise from old videos, outdated content, and false positives.

Optimization: Backfill can pre-filter videos by title before downloading transcripts. Only videos with Aurora-relevant titles (containing keywords like "Aurora", "4X", game mechanic terms) need transcript processing. This avoids downloading transcripts for hundreds of irrelevant uploads.

Safety: Backfill Must Run Before Steady-State

The steady-state cron checks for state.json (or a backfill_complete flag). If missing, the workflow exits with a warning rather than accidentally treating all channel history as new.

Steady-State Mode (daily cron)

Runs automatically after backfill is complete. Only processes videos uploaded since the last run.

What Happens When a New Video Is Detected

End-to-End Flow

1. FETCH     — Daily cron checks tracked channel RSS feeds for new uploads
2. DEDUP     — Check video ID against state.json, skip if already seen
3. TRANSCRIPT— Fetch via: yt-dlp --write-auto-sub --sub-lang en --skip-download
4. CLEAN     — Run transcript through Aurora-specific dictionary (see below)
5. CAPTURE   — Store attribution metadata (creator, channel URL, video title,
               video URL, upload date)
6. MATCH     — Run cleaned transcript through matcher.py against two corpora:
               a. Open GitHub issues labeled "unverified" (title + body keywords)
               b. Manual section terminology and numeric values
7. ROUTE     — Based on match result:
               → Issue match found:     comment on GitHub issue + add to digest
               → Relevant, no match:    add to digest Triage section for human review
               → Not relevant:          record as seen, no action
8. STATE     — Mark video ID as seen in state.json
9. DIGEST    — Weekly: compile all matches and triage items into markdown summary

On Issue Match

When a transcript matches an open unverified issue:

  1. Post a comment on the matching GitHub issue with:
    • Video URL, creator name, and channel URL
    • Relevant transcript segment with timestamp
    • Match confidence indicator
  2. Add to the Matched section of the weekly digest with full attribution

On Relevant Content Without an Issue Match

Not all valuable content maps to an existing unverified claim. Four scenarios:

Category Description Example
Potential correction Contradicts something the manual states as verified Video demonstrates a mechanic working differently than documented
New coverage Topic the manual doesn't address at all Tutorial covering an undocumented mechanic
Version change Demonstrates v2.8.0+ behavior differing from v2.7.1 baseline Gameplay showing changed formula results
Expanded detail Adds depth to an existing section without contradicting it Testing that quantifies a vaguely described mechanic

These go into the Triage section of the weekly digest. Each entry includes:

  • Full attribution metadata (creator, video URL, timestamp)
  • Suggested category (correction / new coverage / version change / expanded detail)
  • The matched manual section (if any)
  • Relevant transcript excerpt with timestamp

A human reviews the triage section and either:

  • Creates a new issue if the content is actionable
  • Dismisses if it's noise or already known
  • Routes to an existing issue that the matcher missed

No automatic issue creation for unmatched content — that would be noisy. But the content is captured and surfaced so nothing falls through the cracks.

On No Relevance

Video is recorded in state.json as seen. No action taken, no noise generated.

Output Destinations

Output Destination Frequency
Issue match comments Posted directly on the matching GitHub issue Each run (real-time)
Weekly digest GitHub Discussion in a "Monitor Digests" category Weekly (Sunday)
Backfill report GitHub Discussion in "Monitor Digests" category One-time (first run)
State file monitor-state branch (state.json) Each run
Run logs GitHub Actions log (visible in Actions tab) Each run

Weekly Digest Format

Posted as a GitHub Discussion with the title [YouTube Monitor] Week of YYYY-MM-DD:

## Matched (N items)
Videos with transcripts matching open unverified issues. Auto-commented on the relevant issues.

| Issue | Creator | Video | Timestamp | Quote | Confidence |
|-------|---------|-------|-----------|-------|------------|
| #NNN  | name    | title | MM:SS     | "..." | High/Medium |

## Triage (N items)
Relevant videos not matching any open issue. Requires human review.

| Category | Creator | Video | Timestamp | Manual Section | Quote |
|----------|---------|-------|-----------|---------------|-------|
| Correction | name | title | MM:SS | 13.1 | "..." |

## Statistics
- Videos checked: N
- Transcripts processed: N
- Matches: N
- Triage items: N
- Skipped (not relevant): N

GitHub Discussions Category

All four monitors post to a shared "Monitor Digests" category (see Prerequisites below), prefixed by source name (e.g., [YouTube Monitor]).

Aurora-Specific Transcript Dictionary

Auto-generated captions mangle game-specific terms. A custom dictionary would map common misrecognitions:

Auto-caption Actual Term
uranium, durainium Duranium
thorium, sorium Sorium
see-wis, see wiz CIWS
boron ide, born hide Boronide
corbomite Corbomite
trans newtonian Trans-Newtonian
merc-assium Mercassium

This dictionary should be maintained alongside the glossary and expanded as new misrecognitions are discovered.

Attribution (MANDATORY)

Every YouTube video used as a verification source or content reference must be properly attributed to its creator.

The project already has YouTube source attribution rules (CLAUDE.md). These must be enforced automatically:

When a Video Is Used as a Verification Source

  1. Identify the creator: Use yt-dlp --print channel --print channel_url <URL> to get the official channel name and URL
  2. Add a manual reference in the relevant file's References section:
    \hypertarget{ref-X.Y-N}{[N]}. [Creator Name] YouTube — "[Video Title]" — [specific detail verified]
    
  3. Credit in issue comments — when closing a verification issue based on video content, include the video URL, creator name, channel URL, and timestamp of the relevant segment
  4. Digest attribution — every digest entry must include the creator name, video title, URL, and relevant timestamp

Monitor Must Capture Attribution Metadata

For every matched video, the monitor must store:

  • Channel name and URL
  • Video title and URL
  • Upload date
  • Relevant transcript segment with timestamp
  • Match context (which issue or manual claim it relates to)

This metadata ensures that manual incorporation never loses provenance. Content creators deserve credit for their work — the monitor must make proper attribution the path of least resistance, not an afterthought.

Attribution in the Manual

Per project rules, YouTube credits appear in the References section of the relevant manual file (not in README.md). Specific channel names should only appear where content was directly sourced.

Design Decisions

The following decisions apply to all four monitors (#1288, #1289, #1290, #1291):

Matcher Algorithm: Keyword + Fuzzy

The shared matcher.py uses keyword matching with fuzzy string similarity (e.g., fuzzywuzzy / thefuzz). Simple, fast, transparent, and easy to debug. Each open unverified issue and manual section generates a keyword set; incoming content is scored against these sets with fuzzy matching to handle minor variations.

Match Confidence Scoring: Multiple Signals

Match confidence is determined by combining multiple signals:

Signal Weight Example
Fuzzy match score Primary Score >= 80 = strong keyword match
Keyword count Secondary 5+ Aurora terms in post = higher confidence
Author reputation Bonus Steve Walmsley post = auto-High
Multiple issue matches Penalty Matches 3+ issues = likely generic, lower confidence

Thresholds:

  • High: Fuzzy score >= 80 AND (2+ keywords OR known author)
  • Medium: Fuzzy score 60-79 OR single strong keyword match
  • Low: Fuzzy score 40-59, included in digest but not auto-commented on issues

Only High and Medium confidence matches trigger auto-comments on GitHub issues. Low confidence matches appear in the digest Triage section only.

Repo Structure: Same Repo (Manual)

The aurora-monitor/ folder lives in the manual repository alongside the source files. Code and state in one place, single repo to manage.

State Persistence: Dedicated Folder on Main Branch

state.json lives in a dedicated aurora-monitor/ folder in the main branch of the repository. Committed after each run. Permanent, versioned, and simple.

Weekly Digest Timing: Same Workflow, Sunday Check

The daily workflow checks if today is Sunday. If so, it compiles and posts the weekly digest to the Monitor Digests discussion category after completing the daily scan. One workflow per monitor, not two.

Authentication: GITHUB_TOKEN with Elevated Permissions

Workflows use the default GITHUB_TOKEN with permissions: discussions: write set in the workflow YAML. No PAT or GitHub App required.

Error Handling: Retry with Backoff, Then Skip

On external API failure (Reddit 429, yt-dlp blocked, forum down): retry up to 3 times with exponential backoff. If still failing, skip that source and log the failure. No auto-alerting — GitHub Actions shows failed steps in the workflow log.

Testing Strategy: Fixtures + Dry-Run

  • Sample data fixtures: Real examples saved as JSON fixtures with expected match results. Run as part of CI for automated regression testing.
  • Dry-run mode: A --dry-run flag that processes live data but only logs results without posting comments or digests. Used for manual validation before enabling live mode.

Prerequisites

Before the first run, create the Monitor Digests discussion category manually (GitHub does not support programmatic category creation):

  1. Go to github.com/ErikEvenson/aurora-manual/discussions
  2. Click the pencil icon next to "Categories" (or Settings > Discussions)
  3. Click New category
  4. Name: Monitor Digests
  5. Description: Automated weekly digests and backfill reports from community source monitors (Forum, YouTube, Reddit, Discord). See issues #1288-#1291.
  6. Format: Announcement (only maintainers/workflows can post, others can comment)
  7. Record the category ID for use in workflow config

Created. Category ID: DIC_kwDORAJjec4C2k26 | Slug: monitor-digests

Proposed Architecture

aurora-monitor/
  sources/
    youtube.py        # RSS fetch for tracked channels
  transcripts.py      # yt-dlp transcript extraction and cleanup
  dictionary.yaml     # Aurora term corrections for auto-captions
  matcher.py          # Shared — match against issues and manual claims
  digest.py           # Shared — generate markdown digest (Matched + Triage sections)
  channels.yaml       # Tracked channels with IDs and content categories
  state.json          # Track last-seen video IDs

Differences from Forum Monitor (#1288)

Aspect Forum (#1288) YouTube (this issue)
Authority level #2 (dev posts) – #5 #5 (community knowledge)
Signal-to-noise Higher (text, searchable) Lower (long videos, tangential content)
Transcript quality N/A (already text) ~90% accurate, needs term dictionary
Demonstration value Describes mechanics Shows mechanics in action
Volume Moderate Lower (fewer Aurora creators)

Shared Infrastructure

All four monitors share matching and triage logic. Consider a shared aurora-monitor/ package with per-source adapters.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions