Build Aurora Forums monitor for automated verification and content updates

## Summary

Build a script that periodically scans the Aurora Forums (aurora2.pentarch.org) for new content relevant to the manual, matches it against open verification issues, and generates digests of actionable updates.

Steve Walmsley forum posts are the #2 verification source after the game database. Currently we only find relevant posts when manually searching for a specific claim. Automated monitoring would surface new mechanics clarifications, changelog entries, and bug confirmations as they happen.

## What to Monitor

| Source | Why | Priority |
|--------|-----|----------|
| Steve Walmsley posts (all) | Developer statements are authoritative for game logic | Highest |
| v2.8.0 Changes List (topic 13884) | Active changelog, directly affects version-specific content | High |
| Bug report threads (confirmed bugs) | Clarify how mechanics actually work vs assumptions | Medium |
| Mechanics discussion threads | Community testing results that could verify/contradict unverified claims | Lower |

## How to Monitor

SMF (Simple Machines Forum) supports RSS natively:
- `aurora2.pentarch.org/index.php?action=.xml;type=rss` — recent posts feed
- `aurora2.pentarch.org/index.php?action=.xml;type=rss;board=N` — per-board feeds

RSS is the polite approach — no authentication, minimal server load, explicitly offered by the forum software.

### Rate Limiting / Good Citizenship
- RSS only (no aggressive HTML scraping)
- Once-daily fetch at most
- Cache everything locally
- Respect `robots.txt`
- No authentication circumvention

## Execution Environment

Runs as a **GitHub Actions scheduled workflow** on the public repository.

```yaml
# .github/workflows/forum-monitor.yml
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8am UTC
```

**Why GitHub Actions:**
- Free for public repos (the manual repo is public)
- Native `GITHUB_TOKEN` for commenting on issues and posting digests
- Built-in secrets management for any API keys
- No infrastructure to provision or maintain
- Logs and run history visible in the Actions tab

**State persistence:** `state.json` committed to the `aurora-monitor/` folder on the main branch to track seen post IDs across runs.

## First Run: Historical Backfill

On first run, `state.json` doesn't exist. Without special handling, every reachable historical post would be treated as "new" — potentially spamming dozens of GitHub issue comments and generating a massive digest mixing years of content.

**The monitor operates in two modes:**

### Backfill Mode (first run, manual trigger)

Triggered manually via `workflow_dispatch` with `mode: backfill`:

1. **PAGINATE** — Walk available forum history via RSS pagination (depth depends on what SMF exposes)
2. **CAPTURE** — Store attribution metadata for every post
3. **MATCH** — Run all posts through matcher.py
4. **REPORT** — Generate a one-time backfill report (**NOT** auto-comment on issues). Split into Matched, Triage, and Statistics sections.
5. **SEED** — Write all seen post IDs to state.json
6. **PUBLISH** — Post backfill report as a GitHub Discussion for human review

**Key difference from steady-state:** Backfill mode **never auto-comments on GitHub issues**. All matches go into the report for human review. This prevents noise from old posts, outdated context, and false positives from historical content.

### Safety: Backfill Must Run Before Steady-State

The steady-state cron checks for `state.json` (or a `backfill_complete` flag). If missing, the workflow exits with a warning rather than accidentally treating all reachable history as new.

### Steady-State Mode (daily cron)

Runs automatically after backfill is complete. Only processes posts newer than the last run.

## What Happens When a New Post Is Detected

### End-to-End Flow

```
1. FETCH     — Daily cron pulls new posts from Aurora Forums via RSS
2. DEDUP     — Check post ID against state.json, skip if already seen
3. CAPTURE   — Store attribution metadata (author, date, topic URL,
               post anchor, topic title, board name)
4. MATCH     — Run post text through matcher.py against two corpora:
               a. Open GitHub issues labeled "unverified" (title + body keywords)
               b. Manual section terminology and numeric values
5. ROUTE     — Based on match result:
               → Issue match found:     comment on GitHub issue + add to digest
               → Relevant, no match:    add to digest Triage section for human review
               → Not relevant:          record as seen, no action
6. STATE     — Mark post ID as seen in state.json
7. DIGEST    — Weekly: compile all matches and triage items into markdown summary
```

### On Issue Match

When a post matches an open `unverified` issue:

1. Post a comment on the matching GitHub issue with:
   - Forum permalink and author attribution
   - Relevant quote from the post
   - Authority level (#2 for Steve Walmsley, #3 for changelogs, #5 for community)
   - Match confidence indicator
2. Add to the **Matched** section of the weekly digest with full attribution

### On Relevant Content Without an Issue Match

Not all valuable content maps to an existing unverified claim. Four scenarios:

| Category | Description | Example |
|----------|-------------|---------|
| **Potential correction** | Contradicts something the manual states as verified | Steve posts "actually, the formula is X not Y" |
| **New coverage** | Topic the manual doesn't address at all | Developer describes a mechanic with no manual section |
| **Version change** | Describes v2.8.0+ behavior differing from v2.7.1 baseline | Changelog entry changing a known formula |
| **Expanded detail** | Adds depth to an existing section without contradicting it | Developer clarifying edge case behavior |

These go into the **Triage** section of the weekly digest. Each entry includes:
- Full attribution metadata (author, permalink, date, authority level)
- Suggested category (correction / new coverage / version change / expanded detail)
- The matched manual section (if any)
- Relevant quote from the post

A human reviews the triage section and either:
- **Creates a new issue** if the content is actionable
- **Dismisses** if it's noise or already known
- **Routes to an existing issue** that the matcher missed

**No automatic issue creation** for unmatched content — that would be noisy. But the content is captured and surfaced so nothing falls through the cracks.

### On No Relevance

Post is recorded in `state.json` as seen. No action taken, no noise generated.

## Attribution (MANDATORY)

**Every forum post used as a verification source or content reference must be properly attributed.**

When forum content is incorporated into the manual:

1. **Identify the author and post** — capture username, post date, topic URL, and specific post anchor
2. **Add a manual reference** in the relevant file's References section:
   ```
   \hypertarget{ref-X.Y-N}{[N]}. Aurora Forums — [topic URL] — [Author username], [post date] — [specific detail verified]
   ```
3. **Developer posts (Steve Walmsley)** use authority level #2:
   ```
   \hypertarget{ref-X.Y-N}{[N]}. Aurora Forums — Steve Walmsley — [topic URL] — [specific mechanic confirmed]
   ```
4. **Community posts** use authority level #5 and should note the verification method (testing, observation, code analysis)
5. **Credit in issue comments** — when closing a verification issue based on forum content, include the source post URL and author in the closing comment
6. **Digest attribution** — every digest entry must include the post author, date, and permalink so the source can be traced

**The monitor should automatically capture and store attribution metadata** (author, date, URL, topic title) for every matched post so that manual incorporation never loses provenance.

## Output Destinations

| Output | Destination | Frequency |
|--------|------------|-----------|
| **Issue match comments** | Posted directly on the matching GitHub issue | Each run (real-time) |
| **Weekly digest** | GitHub Discussion in a "Monitor Digests" category | Weekly (Sunday) |
| **Backfill report** | GitHub Discussion in "Monitor Digests" category | One-time (first run) |
| **State file** | `monitor-state` branch (`state.json`) | Each run |
| **Run logs** | GitHub Actions log (visible in Actions tab) | Each run |

### Weekly Digest Format

Posted as a GitHub Discussion with the title `[Forum Monitor] Week of YYYY-MM-DD`:

```markdown
## Matched (N items)
Posts matching open unverified issues. Auto-commented on the relevant issues.

| Issue | Post Author | Date | Quote | Confidence |
|-------|------------|------|-------|------------|
| #NNN  | username   | date | "..." | High/Medium |

## Triage (N items)
Relevant posts not matching any open issue. Requires human review.

| Category | Post Author | Date | Manual Section | Quote |
|----------|------------|------|---------------|-------|
| Correction | username | date | 13.1 | "..." |

## Statistics
- Posts scanned: N
- Matches: N
- Triage items: N
- Skipped (not relevant): N
```

### GitHub Discussions Category

Create a "Monitor Digests" category in the repository for all four monitors to post to. Each monitor prefixes its digest title with its source name (e.g., `[Forum Monitor]`, `[Reddit Monitor]`).

## Design Decisions

The following decisions apply to all four monitors (#1288, #1289, #1290, #1291):

### Matcher Algorithm: Keyword + Fuzzy

The shared `matcher.py` uses keyword matching with fuzzy string similarity (e.g., `fuzzywuzzy` / `thefuzz`). Simple, fast, transparent, and easy to debug. Each open `unverified` issue and manual section generates a keyword set; incoming content is scored against these sets with fuzzy matching to handle minor variations.

### Match Confidence Scoring: Multiple Signals

Match confidence is determined by combining multiple signals:

| Signal | Weight | Example |
|--------|--------|---------|
| Fuzzy match score | Primary | Score >= 80 = strong keyword match |
| Keyword count | Secondary | 5+ Aurora terms in post = higher confidence |
| Author reputation | Bonus | Steve Walmsley post = auto-High |
| Multiple issue matches | Penalty | Matches 3+ issues = likely generic, lower confidence |

**Thresholds:**
- **High:** Fuzzy score >= 80 AND (2+ keywords OR known author)
- **Medium:** Fuzzy score 60-79 OR single strong keyword match
- **Low:** Fuzzy score 40-59, included in digest but not auto-commented on issues

Only High and Medium confidence matches trigger auto-comments on GitHub issues. Low confidence matches appear in the digest Triage section only.

### Repo Structure: Same Repo (Manual)

The `aurora-monitor/` folder lives in the manual repository alongside the source files. Code and state in one place, single repo to manage.



### State Persistence: Dedicated Folder on Main Branch

`state.json` lives in a dedicated `aurora-monitor/` folder in the main branch of the repository. Committed after each run. Permanent, versioned, and simple.

### Weekly Digest Timing: Same Workflow, Sunday Check

The daily workflow checks if today is Sunday. If so, it compiles and posts the weekly digest to the Monitor Digests discussion category after completing the daily scan. One workflow per monitor, not two.

### Authentication: GITHUB_TOKEN with Elevated Permissions

Workflows use the default `GITHUB_TOKEN` with `permissions: discussions: write` set in the workflow YAML. No PAT or GitHub App required.

### Error Handling: Retry with Backoff, Then Skip

On external API failure (Reddit 429, yt-dlp blocked, forum down): retry up to 3 times with exponential backoff. If still failing, skip that source and log the failure. No auto-alerting — GitHub Actions shows failed steps in the workflow log.

### Testing Strategy: Fixtures + Dry-Run

- **Sample data fixtures:** Real examples saved as JSON fixtures with expected match results. Run as part of CI for automated regression testing.
- **Dry-run mode:** A `--dry-run` flag that processes live data but only logs results without posting comments or digests. Used for manual validation before enabling live mode.


### Prerequisites

Before the first run, create the **Monitor Digests** discussion category manually (GitHub does not support programmatic category creation):

1. Go to **github.com/ErikEvenson/aurora-manual/discussions**
2. Click the pencil icon next to "Categories" (or Settings > Discussions)
3. Click **New category**
4. Name: `Monitor Digests`
5. Description: `Automated weekly digests and backfill reports from community source monitors (Forum, YouTube, Reddit, Discord). See issues #1288-#1291.`
6. Format: **Announcement** (only maintainers/workflows can post, others can comment)
7. Record the category ID for use in workflow config

**Created.** Category ID: `DIC_kwDORAJjec4C2k26` | Slug: `monitor-digests`

## Proposed Architecture

```
aurora-monitor/
  sources/
    forum.py          # RSS fetch, rate-limited
  matcher.py          # Shared — match against issues and manual claims
  digest.py           # Shared — generate markdown digest (Matched + Triage sections)
  state.json          # Track last-seen post IDs (avoid duplicates)
  config.yaml         # Boards, users, keywords to watch
```

## Blocked By

- Aurora Forums (aurora2.pentarch.org) are currently down. RSS URL structure and board numbers need to be confirmed when the forum returns.

## Implementation Notes

- The script structure, matching logic, and issue integration can be built and tested against sample data before the forum comes back
- Matching logic should use the manual's existing `*(unverified — #NNN)*` markers and issue titles as the keyword corpus
- Consider using the GitHub API to automatically comment on issues when a match is found

## Related

- #1289 — YouTube monitor
- #1290 — Reddit monitor
- #1291 — Discord integration







Source	Why	Priority
Steve Walmsley posts (all)	Developer statements are authoritative for game logic	Highest
v2.8.0 Changes List (topic 13884)	Active changelog, directly affects version-specific content	High
Bug report threads (confirmed bugs)	Clarify how mechanics actually work vs assumptions	Medium
Mechanics discussion threads	Community testing results that could verify/contradict unverified claims	Lower

Category	Description	Example
Potential correction	Contradicts something the manual states as verified	Steve posts "actually, the formula is X not Y"
New coverage	Topic the manual doesn't address at all	Developer describes a mechanic with no manual section
Version change	Describes v2.8.0+ behavior differing from v2.7.1 baseline	Changelog entry changing a known formula
Expanded detail	Adds depth to an existing section without contradicting it	Developer clarifying edge case behavior

Output	Destination	Frequency
Issue match comments	Posted directly on the matching GitHub issue	Each run (real-time)
Weekly digest	GitHub Discussion in a "Monitor Digests" category	Weekly (Sunday)
Backfill report	GitHub Discussion in "Monitor Digests" category	One-time (first run)
State file	`monitor-state` branch (`state.json`)	Each run
Run logs	GitHub Actions log (visible in Actions tab)	Each run

Signal	Weight	Example
Fuzzy match score	Primary	Score >= 80 = strong keyword match
Keyword count	Secondary	5+ Aurora terms in post = higher confidence
Author reputation	Bonus	Steve Walmsley post = auto-High
Multiple issue matches	Penalty	Matches 3+ issues = likely generic, lower confidence

Build Aurora Forums monitor for automated verification and content updates #1288

Description

Summary

What to Monitor

How to Monitor

Rate Limiting / Good Citizenship

Execution Environment

First Run: Historical Backfill

Backfill Mode (first run, manual trigger)

Safety: Backfill Must Run Before Steady-State

Steady-State Mode (daily cron)

What Happens When a New Post Is Detected

End-to-End Flow

On Issue Match

On Relevant Content Without an Issue Match

On No Relevance

Attribution (MANDATORY)

Output Destinations

Weekly Digest Format

GitHub Discussions Category

Design Decisions

Matcher Algorithm: Keyword + Fuzzy

Match Confidence Scoring: Multiple Signals

Repo Structure: Same Repo (Manual)

State Persistence: Dedicated Folder on Main Branch

Weekly Digest Timing: Same Workflow, Sunday Check

Authentication: GITHUB_TOKEN with Elevated Permissions

Error Handling: Retry with Backoff, Then Skip

Testing Strategy: Fixtures + Dry-Run

Prerequisites

Proposed Architecture

Blocked By

Implementation Notes

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions