Build YouTube monitor for Aurora content and transcript-based verification matching

## Summary

Build a script that monitors Aurora C# YouTube channels for new content, extracts transcripts, and matches them against open verification issues and manual claims. Complements the forum monitor (#1288) by targeting a different content type — video demonstrations that show mechanics in action.

## Why YouTube

Forum posts *describe* how mechanics work. Videos *show* them. If a creator records themselves testing "what happens without an HQ unit," that's empirical evidence — weaker than database verification but stronger than forum speculation. YouTube content is tier 5 (community knowledge) in the verification hierarchy, but demonstrations of in-game behavior approach "live testing" in value.

## What to Monitor

- Aurora C# gameplay channels (identify active creators producing tutorial/mechanics content)
- New uploads detected via RSS: `https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID`
- Focus on tutorial, mechanics explanation, and "testing X" style videos over Let's Play narrative

## Execution Environment

Runs as a **GitHub Actions scheduled workflow** on the public repository.

```yaml
# .github/workflows/youtube-monitor.yml
on:
  schedule:
    - cron: '0 9 * * *'  # Daily at 9am UTC (staggered from other monitors)
  workflow_dispatch:
    inputs:
      mode:
        description: 'Run mode'
        required: true
        default: 'steady-state'
        type: choice
        options:
          - steady-state
          - backfill
```

**Why GitHub Actions:**
- Free for public repos (the manual repo is public)
- Native `GITHUB_TOKEN` for commenting on issues and posting digests
- Built-in secrets management for any API keys
- No infrastructure to provision or maintain
- `yt-dlp` can be installed in the workflow via `pip install yt-dlp`

**State persistence:** `state.json` committed to the `aurora-monitor/` folder on the main branch to track seen video IDs across runs.

**Note:** Transcript fetching via `yt-dlp` may take longer than the other monitors. GitHub Actions allows up to 6 hours per job, which is more than sufficient for daily new uploads from a handful of channels.

## Channel and Video Selection

### How Channels Get on the Watch List

**Initial seeding:**
- Search YouTube for "Aurora 4X" / "Aurora C#" and identify channels with multiple relevant uploads
- Import YouTube links discovered by the Reddit monitor (#1290 cross-reference output)
- Manual additions by maintainers via `channels.yaml`

**Channel qualification criteria:**
- At least 3 Aurora C# videos in the last 12 months
- Content is primarily tutorial/mechanics-focused, not purely narrative Let's Play
- English language or has English auto-captions available
- Channel is still active (uploaded within the last 6 months)

**Ongoing discovery:**
- Reddit monitor (#1290) automatically flags YouTube links in Aurora posts — new channels surface organically
- Periodic YouTube search (monthly) for new "Aurora 4X" / "Aurora C#" channels
- Community suggestions via GitHub issues or Discord (#1291)

**Channel list maintenance:**
- `channels.yaml` stores channel ID, name, URL, date added, and content category
- Channels with no new Aurora content in 12 months are flagged for review
- Removed channels stay in the file (marked `active: false`) for audit trail



### How Individual Videos Are Filtered

Not every video on a watched channel is relevant. Videos pass through a **two-stage filter** before transcript processing:

**Stage 1: Metadata filter (cheap — no transcript download)**

| Filter | Rule | Why |
|--------|------|-----|
| Title keywords | Must contain at least one Aurora term (see keyword list below) | Skip non-Aurora content on mixed channels |
| Description keywords | Fallback if title is generic (e.g., "Episode 47") | Catch videos with vague titles but relevant descriptions |
| Duration | Skip < 2 minutes (intros/trailers) and > 3 hours (unedited streams) | Low signal-to-noise at extremes |
| Language | English title/description or English auto-captions available | Transcript processing requires English |
| Age (steady-state) | Skip videos older than 7 days | Already in state.json from prior runs |

**Stage 2: Transcript keyword density (after download)**

After downloading the transcript, check keyword density before full matching:
- If fewer than 3 Aurora-specific terms appear in the transcript, skip full matching
- This catches videos that passed Stage 1 on a generic title match but aren't actually about Aurora mechanics

**Aurora keyword list for filtering:**



### Decision Flow



## First Run: Historical Backfill

On first run, `state.json` doesn't exist. Without special handling, every existing video on tracked channels would be treated as "new" — triggering transcript downloads for potentially hundreds of videos, mass GitHub issue comments, and a massive digest.

**The monitor operates in two modes:**

### Backfill Mode (first run, manual trigger)

Triggered manually via `workflow_dispatch` with `mode: backfill`:

1. **PAGINATE** — Walk tracked channels' full upload history via YouTube API/RSS
2. **TRANSCRIPT** — Fetch transcripts for relevant-looking videos (filter by title keywords first to avoid downloading hundreds of irrelevant transcripts)
3. **CLEAN** — Run transcripts through Aurora-specific dictionary
4. **MATCH** — Run all cleaned transcripts through matcher.py
5. **REPORT** — Generate a one-time backfill report (**NOT** auto-comment on issues). Split into Matched, Triage, and Statistics sections.
6. **SEED** — Write all seen video IDs to state.json
7. **PUBLISH** — Post backfill report as a GitHub Discussion for human review

**Key difference from steady-state:** Backfill mode **never auto-comments on GitHub issues**. All matches go into the report for human review. This prevents noise from old videos, outdated content, and false positives.

**Optimization:** Backfill can pre-filter videos by title before downloading transcripts. Only videos with Aurora-relevant titles (containing keywords like "Aurora", "4X", game mechanic terms) need transcript processing. This avoids downloading transcripts for hundreds of irrelevant uploads.

### Safety: Backfill Must Run Before Steady-State

The steady-state cron checks for `state.json` (or a `backfill_complete` flag). If missing, the workflow exits with a warning rather than accidentally treating all channel history as new.

### Steady-State Mode (daily cron)

Runs automatically after backfill is complete. Only processes videos uploaded since the last run.

## What Happens When a New Video Is Detected

### End-to-End Flow

```
1. FETCH     — Daily cron checks tracked channel RSS feeds for new uploads
2. DEDUP     — Check video ID against state.json, skip if already seen
3. TRANSCRIPT— Fetch via: yt-dlp --write-auto-sub --sub-lang en --skip-download
4. CLEAN     — Run transcript through Aurora-specific dictionary (see below)
5. CAPTURE   — Store attribution metadata (creator, channel URL, video title,
               video URL, upload date)
6. MATCH     — Run cleaned transcript through matcher.py against two corpora:
               a. Open GitHub issues labeled "unverified" (title + body keywords)
               b. Manual section terminology and numeric values
7. ROUTE     — Based on match result:
               → Issue match found:     comment on GitHub issue + add to digest
               → Relevant, no match:    add to digest Triage section for human review
               → Not relevant:          record as seen, no action
8. STATE     — Mark video ID as seen in state.json
9. DIGEST    — Weekly: compile all matches and triage items into markdown summary
```

### On Issue Match

When a transcript matches an open `unverified` issue:

1. Post a comment on the matching GitHub issue with:
   - Video URL, creator name, and channel URL
   - Relevant transcript segment with timestamp
   - Match confidence indicator
2. Add to the **Matched** section of the weekly digest with full attribution

### On Relevant Content Without an Issue Match

Not all valuable content maps to an existing unverified claim. Four scenarios:

| Category | Description | Example |
|----------|-------------|---------|
| **Potential correction** | Contradicts something the manual states as verified | Video demonstrates a mechanic working differently than documented |
| **New coverage** | Topic the manual doesn't address at all | Tutorial covering an undocumented mechanic |
| **Version change** | Demonstrates v2.8.0+ behavior differing from v2.7.1 baseline | Gameplay showing changed formula results |
| **Expanded detail** | Adds depth to an existing section without contradicting it | Testing that quantifies a vaguely described mechanic |

These go into the **Triage** section of the weekly digest. Each entry includes:
- Full attribution metadata (creator, video URL, timestamp)
- Suggested category (correction / new coverage / version change / expanded detail)
- The matched manual section (if any)
- Relevant transcript excerpt with timestamp

A human reviews the triage section and either:
- **Creates a new issue** if the content is actionable
- **Dismisses** if it's noise or already known
- **Routes to an existing issue** that the matcher missed

**No automatic issue creation** for unmatched content — that would be noisy. But the content is captured and surfaced so nothing falls through the cracks.

### On No Relevance

Video is recorded in `state.json` as seen. No action taken, no noise generated.

## Output Destinations

| Output | Destination | Frequency |
|--------|------------|-----------|
| **Issue match comments** | Posted directly on the matching GitHub issue | Each run (real-time) |
| **Weekly digest** | GitHub Discussion in a "Monitor Digests" category | Weekly (Sunday) |
| **Backfill report** | GitHub Discussion in "Monitor Digests" category | One-time (first run) |
| **State file** | `monitor-state` branch (`state.json`) | Each run |
| **Run logs** | GitHub Actions log (visible in Actions tab) | Each run |

### Weekly Digest Format

Posted as a GitHub Discussion with the title `[YouTube Monitor] Week of YYYY-MM-DD`:

```markdown
## Matched (N items)
Videos with transcripts matching open unverified issues. Auto-commented on the relevant issues.

| Issue | Creator | Video | Timestamp | Quote | Confidence |
|-------|---------|-------|-----------|-------|------------|
| #NNN  | name    | title | MM:SS     | "..." | High/Medium |

## Triage (N items)
Relevant videos not matching any open issue. Requires human review.

| Category | Creator | Video | Timestamp | Manual Section | Quote |
|----------|---------|-------|-----------|---------------|-------|
| Correction | name | title | MM:SS | 13.1 | "..." |

## Statistics
- Videos checked: N
- Transcripts processed: N
- Matches: N
- Triage items: N
- Skipped (not relevant): N
```

### GitHub Discussions Category

All four monitors post to a shared "Monitor Digests" category (see Prerequisites below), prefixed by source name (e.g., `[YouTube Monitor]`).

## Aurora-Specific Transcript Dictionary

Auto-generated captions mangle game-specific terms. A custom dictionary would map common misrecognitions:

| Auto-caption | Actual Term |
|-------------|-------------|
| uranium, durainium | Duranium |
| thorium, sorium | Sorium |
| see-wis, see wiz | CIWS |
| boron ide, born hide | Boronide |
| corbomite | Corbomite |
| trans newtonian | Trans-Newtonian |
| merc-assium | Mercassium |

This dictionary should be maintained alongside the glossary and expanded as new misrecognitions are discovered.

## Attribution (MANDATORY)

**Every YouTube video used as a verification source or content reference must be properly attributed to its creator.**

The project already has YouTube source attribution rules (CLAUDE.md). These must be enforced automatically:

### When a Video Is Used as a Verification Source

1. **Identify the creator:** Use `yt-dlp --print channel --print channel_url <URL>` to get the official channel name and URL
2. **Add a manual reference** in the relevant file's References section:
   ```
   \hypertarget{ref-X.Y-N}{[N]}. [Creator Name] YouTube — "[Video Title]" — [specific detail verified]
   ```
3. **Credit in issue comments** — when closing a verification issue based on video content, include the video URL, creator name, channel URL, and timestamp of the relevant segment
4. **Digest attribution** — every digest entry must include the creator name, video title, URL, and relevant timestamp

### Monitor Must Capture Attribution Metadata

For every matched video, the monitor must store:
- Channel name and URL
- Video title and URL
- Upload date
- Relevant transcript segment with timestamp
- Match context (which issue or manual claim it relates to)

This metadata ensures that manual incorporation never loses provenance. **Content creators deserve credit for their work** — the monitor must make proper attribution the path of least resistance, not an afterthought.

### Attribution in the Manual

Per project rules, YouTube credits appear in the References section of the relevant manual file (not in README.md). Specific channel names should only appear where content was directly sourced.

## Design Decisions

The following decisions apply to all four monitors (#1288, #1289, #1290, #1291):

### Matcher Algorithm: Keyword + Fuzzy

The shared `matcher.py` uses keyword matching with fuzzy string similarity (e.g., `fuzzywuzzy` / `thefuzz`). Simple, fast, transparent, and easy to debug. Each open `unverified` issue and manual section generates a keyword set; incoming content is scored against these sets with fuzzy matching to handle minor variations.

### Match Confidence Scoring: Multiple Signals

Match confidence is determined by combining multiple signals:

| Signal | Weight | Example |
|--------|--------|---------|
| Fuzzy match score | Primary | Score >= 80 = strong keyword match |
| Keyword count | Secondary | 5+ Aurora terms in post = higher confidence |
| Author reputation | Bonus | Steve Walmsley post = auto-High |
| Multiple issue matches | Penalty | Matches 3+ issues = likely generic, lower confidence |

**Thresholds:**
- **High:** Fuzzy score >= 80 AND (2+ keywords OR known author)
- **Medium:** Fuzzy score 60-79 OR single strong keyword match
- **Low:** Fuzzy score 40-59, included in digest but not auto-commented on issues

Only High and Medium confidence matches trigger auto-comments on GitHub issues. Low confidence matches appear in the digest Triage section only.

### Repo Structure: Same Repo (Manual)

The `aurora-monitor/` folder lives in the manual repository alongside the source files. Code and state in one place, single repo to manage.



### State Persistence: Dedicated Folder on Main Branch

`state.json` lives in a dedicated `aurora-monitor/` folder in the main branch of the repository. Committed after each run. Permanent, versioned, and simple.

### Weekly Digest Timing: Same Workflow, Sunday Check

The daily workflow checks if today is Sunday. If so, it compiles and posts the weekly digest to the Monitor Digests discussion category after completing the daily scan. One workflow per monitor, not two.

### Authentication: GITHUB_TOKEN with Elevated Permissions

Workflows use the default `GITHUB_TOKEN` with `permissions: discussions: write` set in the workflow YAML. No PAT or GitHub App required.

### Error Handling: Retry with Backoff, Then Skip

On external API failure (Reddit 429, yt-dlp blocked, forum down): retry up to 3 times with exponential backoff. If still failing, skip that source and log the failure. No auto-alerting — GitHub Actions shows failed steps in the workflow log.

### Testing Strategy: Fixtures + Dry-Run

- **Sample data fixtures:** Real examples saved as JSON fixtures with expected match results. Run as part of CI for automated regression testing.
- **Dry-run mode:** A `--dry-run` flag that processes live data but only logs results without posting comments or digests. Used for manual validation before enabling live mode.


### Prerequisites

Before the first run, create the **Monitor Digests** discussion category manually (GitHub does not support programmatic category creation):

1. Go to **github.com/ErikEvenson/aurora-manual/discussions**
2. Click the pencil icon next to "Categories" (or Settings > Discussions)
3. Click **New category**
4. Name: `Monitor Digests`
5. Description: `Automated weekly digests and backfill reports from community source monitors (Forum, YouTube, Reddit, Discord). See issues #1288-#1291.`
6. Format: **Announcement** (only maintainers/workflows can post, others can comment)
7. Record the category ID for use in workflow config

**Created.** Category ID: `DIC_kwDORAJjec4C2k26` | Slug: `monitor-digests`

## Proposed Architecture

```
aurora-monitor/
  sources/
    youtube.py        # RSS fetch for tracked channels
  transcripts.py      # yt-dlp transcript extraction and cleanup
  dictionary.yaml     # Aurora term corrections for auto-captions
  matcher.py          # Shared — match against issues and manual claims
  digest.py           # Shared — generate markdown digest (Matched + Triage sections)
  channels.yaml       # Tracked channels with IDs and content categories
  state.json          # Track last-seen video IDs
```

## Differences from Forum Monitor (#1288)

| Aspect | Forum (#1288) | YouTube (this issue) |
|--------|---------------|----------------------|
| Authority level | #2 (dev posts) – #5 | #5 (community knowledge) |
| Signal-to-noise | Higher (text, searchable) | Lower (long videos, tangential content) |
| Transcript quality | N/A (already text) | ~90% accurate, needs term dictionary |
| Demonstration value | Describes mechanics | Shows mechanics in action |
| Volume | Moderate | Lower (fewer Aurora creators) |

## Shared Infrastructure

All four monitors share matching and triage logic. Consider a shared `aurora-monitor/` package with per-source adapters.

## Related

- #1288 — Forum monitor
- #1290 — Reddit monitor
- #1291 — Discord integration








Filter	Rule	Why
Title keywords	Must contain at least one Aurora term (see keyword list below)	Skip non-Aurora content on mixed channels
Description keywords	Fallback if title is generic (e.g., "Episode 47")	Catch videos with vague titles but relevant descriptions
Duration	Skip < 2 minutes (intros/trailers) and > 3 hours (unedited streams)	Low signal-to-noise at extremes
Language	English title/description or English auto-captions available	Transcript processing requires English
Age (steady-state)	Skip videos older than 7 days	Already in state.json from prior runs

Category	Description	Example
Potential correction	Contradicts something the manual states as verified	Video demonstrates a mechanic working differently than documented
New coverage	Topic the manual doesn't address at all	Tutorial covering an undocumented mechanic
Version change	Demonstrates v2.8.0+ behavior differing from v2.7.1 baseline	Gameplay showing changed formula results
Expanded detail	Adds depth to an existing section without contradicting it	Testing that quantifies a vaguely described mechanic

Output	Destination	Frequency
Issue match comments	Posted directly on the matching GitHub issue	Each run (real-time)
Weekly digest	GitHub Discussion in a "Monitor Digests" category	Weekly (Sunday)
Backfill report	GitHub Discussion in "Monitor Digests" category	One-time (first run)
State file	`monitor-state` branch (`state.json`)	Each run
Run logs	GitHub Actions log (visible in Actions tab)	Each run

Auto-caption	Actual Term
uranium, durainium	Duranium
thorium, sorium	Sorium
see-wis, see wiz	CIWS
boron ide, born hide	Boronide
corbomite	Corbomite
trans newtonian	Trans-Newtonian
merc-assium	Mercassium

Signal	Weight	Example
Fuzzy match score	Primary	Score >= 80 = strong keyword match
Keyword count	Secondary	5+ Aurora terms in post = higher confidence
Author reputation	Bonus	Steve Walmsley post = auto-High
Multiple issue matches	Penalty	Matches 3+ issues = likely generic, lower confidence

Aspect	Forum (#1288)	YouTube (this issue)
Authority level	#2 (dev posts) – #5	#5 (community knowledge)
Signal-to-noise	Higher (text, searchable)	Lower (long videos, tangential content)
Transcript quality	N/A (already text)	~90% accurate, needs term dictionary
Demonstration value	Describes mechanics	Shows mechanics in action
Volume	Moderate	Lower (fewer Aurora creators)

Build YouTube monitor for Aurora content and transcript-based verification matching #1289

Description

Summary

Why YouTube

What to Monitor

Execution Environment

Channel and Video Selection

How Channels Get on the Watch List

How Individual Videos Are Filtered

Decision Flow

First Run: Historical Backfill

Backfill Mode (first run, manual trigger)

Safety: Backfill Must Run Before Steady-State

Steady-State Mode (daily cron)

What Happens When a New Video Is Detected

End-to-End Flow

On Issue Match

On Relevant Content Without an Issue Match

On No Relevance

Output Destinations

Weekly Digest Format

GitHub Discussions Category

Aurora-Specific Transcript Dictionary

Attribution (MANDATORY)

When a Video Is Used as a Verification Source

Monitor Must Capture Attribution Metadata

Attribution in the Manual

Design Decisions

Matcher Algorithm: Keyword + Fuzzy

Match Confidence Scoring: Multiple Signals

Repo Structure: Same Repo (Manual)

State Persistence: Dedicated Folder on Main Branch

Weekly Digest Timing: Same Workflow, Sunday Check

Authentication: GITHUB_TOKEN with Elevated Permissions

Error Handling: Retry with Backoff, Then Skip

Testing Strategy: Fixtures + Dry-Run

Prerequisites

Proposed Architecture

Differences from Forum Monitor (#1288)

Shared Infrastructure

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions