Skip to content

Build Aurora Forums monitor for automated verification and content updates #1288

@ErikEvenson

Description

@ErikEvenson

Summary

Build a script that periodically scans the Aurora Forums (aurora2.pentarch.org) for new content relevant to the manual, matches it against open verification issues, and generates digests of actionable updates.

Steve Walmsley forum posts are the #2 verification source after the game database. Currently we only find relevant posts when manually searching for a specific claim. Automated monitoring would surface new mechanics clarifications, changelog entries, and bug confirmations as they happen.

What to Monitor

Source Why Priority
Steve Walmsley posts (all) Developer statements are authoritative for game logic Highest
v2.8.0 Changes List (topic 13884) Active changelog, directly affects version-specific content High
Bug report threads (confirmed bugs) Clarify how mechanics actually work vs assumptions Medium
Mechanics discussion threads Community testing results that could verify/contradict unverified claims Lower

How to Monitor

SMF (Simple Machines Forum) supports RSS natively:

  • aurora2.pentarch.org/index.php?action=.xml;type=rss — recent posts feed
  • aurora2.pentarch.org/index.php?action=.xml;type=rss;board=N — per-board feeds

RSS is the polite approach — no authentication, minimal server load, explicitly offered by the forum software.

Rate Limiting / Good Citizenship

  • RSS only (no aggressive HTML scraping)
  • Once-daily fetch at most
  • Cache everything locally
  • Respect robots.txt
  • No authentication circumvention

Execution Environment

Runs as a GitHub Actions scheduled workflow on the public repository.

# .github/workflows/forum-monitor.yml
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8am UTC

Why GitHub Actions:

  • Free for public repos (the manual repo is public)
  • Native GITHUB_TOKEN for commenting on issues and posting digests
  • Built-in secrets management for any API keys
  • No infrastructure to provision or maintain
  • Logs and run history visible in the Actions tab

State persistence: state.json committed to the aurora-monitor/ folder on the main branch to track seen post IDs across runs.

First Run: Historical Backfill

On first run, state.json doesn't exist. Without special handling, every reachable historical post would be treated as "new" — potentially spamming dozens of GitHub issue comments and generating a massive digest mixing years of content.

The monitor operates in two modes:

Backfill Mode (first run, manual trigger)

Triggered manually via workflow_dispatch with mode: backfill:

  1. PAGINATE — Walk available forum history via RSS pagination (depth depends on what SMF exposes)
  2. CAPTURE — Store attribution metadata for every post
  3. MATCH — Run all posts through matcher.py
  4. REPORT — Generate a one-time backfill report (NOT auto-comment on issues). Split into Matched, Triage, and Statistics sections.
  5. SEED — Write all seen post IDs to state.json
  6. PUBLISH — Post backfill report as a GitHub Discussion for human review

Key difference from steady-state: Backfill mode never auto-comments on GitHub issues. All matches go into the report for human review. This prevents noise from old posts, outdated context, and false positives from historical content.

Safety: Backfill Must Run Before Steady-State

The steady-state cron checks for state.json (or a backfill_complete flag). If missing, the workflow exits with a warning rather than accidentally treating all reachable history as new.

Steady-State Mode (daily cron)

Runs automatically after backfill is complete. Only processes posts newer than the last run.

What Happens When a New Post Is Detected

End-to-End Flow

1. FETCH     — Daily cron pulls new posts from Aurora Forums via RSS
2. DEDUP     — Check post ID against state.json, skip if already seen
3. CAPTURE   — Store attribution metadata (author, date, topic URL,
               post anchor, topic title, board name)
4. MATCH     — Run post text through matcher.py against two corpora:
               a. Open GitHub issues labeled "unverified" (title + body keywords)
               b. Manual section terminology and numeric values
5. ROUTE     — Based on match result:
               → Issue match found:     comment on GitHub issue + add to digest
               → Relevant, no match:    add to digest Triage section for human review
               → Not relevant:          record as seen, no action
6. STATE     — Mark post ID as seen in state.json
7. DIGEST    — Weekly: compile all matches and triage items into markdown summary

On Issue Match

When a post matches an open unverified issue:

  1. Post a comment on the matching GitHub issue with:
  2. Add to the Matched section of the weekly digest with full attribution

On Relevant Content Without an Issue Match

Not all valuable content maps to an existing unverified claim. Four scenarios:

Category Description Example
Potential correction Contradicts something the manual states as verified Steve posts "actually, the formula is X not Y"
New coverage Topic the manual doesn't address at all Developer describes a mechanic with no manual section
Version change Describes v2.8.0+ behavior differing from v2.7.1 baseline Changelog entry changing a known formula
Expanded detail Adds depth to an existing section without contradicting it Developer clarifying edge case behavior

These go into the Triage section of the weekly digest. Each entry includes:

  • Full attribution metadata (author, permalink, date, authority level)
  • Suggested category (correction / new coverage / version change / expanded detail)
  • The matched manual section (if any)
  • Relevant quote from the post

A human reviews the triage section and either:

  • Creates a new issue if the content is actionable
  • Dismisses if it's noise or already known
  • Routes to an existing issue that the matcher missed

No automatic issue creation for unmatched content — that would be noisy. But the content is captured and surfaced so nothing falls through the cracks.

On No Relevance

Post is recorded in state.json as seen. No action taken, no noise generated.

Attribution (MANDATORY)

Every forum post used as a verification source or content reference must be properly attributed.

When forum content is incorporated into the manual:

  1. Identify the author and post — capture username, post date, topic URL, and specific post anchor
  2. Add a manual reference in the relevant file's References section:
    \hypertarget{ref-X.Y-N}{[N]}. Aurora Forums — [topic URL] — [Author username], [post date] — [specific detail verified]
    
  3. Developer posts (Steve Walmsley) use authority level Process version changelogs (1.0 through 2.5) #2:
    \hypertarget{ref-X.Y-N}{[N]}. Aurora Forums — Steve Walmsley — [topic URL] — [specific mechanic confirmed]
    
  4. Community posts use authority level Add cross-references between related sections #5 and should note the verification method (testing, observation, code analysis)
  5. Credit in issue comments — when closing a verification issue based on forum content, include the source post URL and author in the closing comment
  6. Digest attribution — every digest entry must include the post author, date, and permalink so the source can be traced

The monitor should automatically capture and store attribution metadata (author, date, URL, topic title) for every matched post so that manual incorporation never loses provenance.

Output Destinations

Output Destination Frequency
Issue match comments Posted directly on the matching GitHub issue Each run (real-time)
Weekly digest GitHub Discussion in a "Monitor Digests" category Weekly (Sunday)
Backfill report GitHub Discussion in "Monitor Digests" category One-time (first run)
State file monitor-state branch (state.json) Each run
Run logs GitHub Actions log (visible in Actions tab) Each run

Weekly Digest Format

Posted as a GitHub Discussion with the title [Forum Monitor] Week of YYYY-MM-DD:

## Matched (N items)
Posts matching open unverified issues. Auto-commented on the relevant issues.

| Issue | Post Author | Date | Quote | Confidence |
|-------|------------|------|-------|------------|
| #NNN  | username   | date | "..." | High/Medium |

## Triage (N items)
Relevant posts not matching any open issue. Requires human review.

| Category | Post Author | Date | Manual Section | Quote |
|----------|------------|------|---------------|-------|
| Correction | username | date | 13.1 | "..." |

## Statistics
- Posts scanned: N
- Matches: N
- Triage items: N
- Skipped (not relevant): N

GitHub Discussions Category

Create a "Monitor Digests" category in the repository for all four monitors to post to. Each monitor prefixes its digest title with its source name (e.g., [Forum Monitor], [Reddit Monitor]).

Design Decisions

The following decisions apply to all four monitors (#1288, #1289, #1290, #1291):

Matcher Algorithm: Keyword + Fuzzy

The shared matcher.py uses keyword matching with fuzzy string similarity (e.g., fuzzywuzzy / thefuzz). Simple, fast, transparent, and easy to debug. Each open unverified issue and manual section generates a keyword set; incoming content is scored against these sets with fuzzy matching to handle minor variations.

Match Confidence Scoring: Multiple Signals

Match confidence is determined by combining multiple signals:

Signal Weight Example
Fuzzy match score Primary Score >= 80 = strong keyword match
Keyword count Secondary 5+ Aurora terms in post = higher confidence
Author reputation Bonus Steve Walmsley post = auto-High
Multiple issue matches Penalty Matches 3+ issues = likely generic, lower confidence

Thresholds:

  • High: Fuzzy score >= 80 AND (2+ keywords OR known author)
  • Medium: Fuzzy score 60-79 OR single strong keyword match
  • Low: Fuzzy score 40-59, included in digest but not auto-commented on issues

Only High and Medium confidence matches trigger auto-comments on GitHub issues. Low confidence matches appear in the digest Triage section only.

Repo Structure: Same Repo (Manual)

The aurora-monitor/ folder lives in the manual repository alongside the source files. Code and state in one place, single repo to manage.

State Persistence: Dedicated Folder on Main Branch

state.json lives in a dedicated aurora-monitor/ folder in the main branch of the repository. Committed after each run. Permanent, versioned, and simple.

Weekly Digest Timing: Same Workflow, Sunday Check

The daily workflow checks if today is Sunday. If so, it compiles and posts the weekly digest to the Monitor Digests discussion category after completing the daily scan. One workflow per monitor, not two.

Authentication: GITHUB_TOKEN with Elevated Permissions

Workflows use the default GITHUB_TOKEN with permissions: discussions: write set in the workflow YAML. No PAT or GitHub App required.

Error Handling: Retry with Backoff, Then Skip

On external API failure (Reddit 429, yt-dlp blocked, forum down): retry up to 3 times with exponential backoff. If still failing, skip that source and log the failure. No auto-alerting — GitHub Actions shows failed steps in the workflow log.

Testing Strategy: Fixtures + Dry-Run

  • Sample data fixtures: Real examples saved as JSON fixtures with expected match results. Run as part of CI for automated regression testing.
  • Dry-run mode: A --dry-run flag that processes live data but only logs results without posting comments or digests. Used for manual validation before enabling live mode.

Prerequisites

Before the first run, create the Monitor Digests discussion category manually (GitHub does not support programmatic category creation):

  1. Go to github.com/ErikEvenson/aurora-manual/discussions
  2. Click the pencil icon next to "Categories" (or Settings > Discussions)
  3. Click New category
  4. Name: Monitor Digests
  5. Description: Automated weekly digests and backfill reports from community source monitors (Forum, YouTube, Reddit, Discord). See issues #1288-#1291.
  6. Format: Announcement (only maintainers/workflows can post, others can comment)
  7. Record the category ID for use in workflow config

Created. Category ID: DIC_kwDORAJjec4C2k26 | Slug: monitor-digests

Proposed Architecture

aurora-monitor/
  sources/
    forum.py          # RSS fetch, rate-limited
  matcher.py          # Shared — match against issues and manual claims
  digest.py           # Shared — generate markdown digest (Matched + Triage sections)
  state.json          # Track last-seen post IDs (avoid duplicates)
  config.yaml         # Boards, users, keywords to watch

Blocked By

  • Aurora Forums (aurora2.pentarch.org) are currently down. RSS URL structure and board numbers need to be confirmed when the forum returns.

Implementation Notes

  • The script structure, matching logic, and issue integration can be built and tested against sample data before the forum comes back
  • Matching logic should use the manual's existing *(unverified — #NNN)* markers and issue titles as the keyword corpus
  • Consider using the GitHub API to automatically comment on issues when a match is found

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions