-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Build a script that periodically scans the Aurora Forums (aurora2.pentarch.org) for new content relevant to the manual, matches it against open verification issues, and generates digests of actionable updates.
Steve Walmsley forum posts are the #2 verification source after the game database. Currently we only find relevant posts when manually searching for a specific claim. Automated monitoring would surface new mechanics clarifications, changelog entries, and bug confirmations as they happen.
What to Monitor
| Source | Why | Priority |
|---|---|---|
| Steve Walmsley posts (all) | Developer statements are authoritative for game logic | Highest |
| v2.8.0 Changes List (topic 13884) | Active changelog, directly affects version-specific content | High |
| Bug report threads (confirmed bugs) | Clarify how mechanics actually work vs assumptions | Medium |
| Mechanics discussion threads | Community testing results that could verify/contradict unverified claims | Lower |
How to Monitor
SMF (Simple Machines Forum) supports RSS natively:
aurora2.pentarch.org/index.php?action=.xml;type=rss— recent posts feedaurora2.pentarch.org/index.php?action=.xml;type=rss;board=N— per-board feeds
RSS is the polite approach — no authentication, minimal server load, explicitly offered by the forum software.
Rate Limiting / Good Citizenship
- RSS only (no aggressive HTML scraping)
- Once-daily fetch at most
- Cache everything locally
- Respect
robots.txt - No authentication circumvention
Execution Environment
Runs as a GitHub Actions scheduled workflow on the public repository.
# .github/workflows/forum-monitor.yml
on:
schedule:
- cron: '0 8 * * *' # Daily at 8am UTCWhy GitHub Actions:
- Free for public repos (the manual repo is public)
- Native
GITHUB_TOKENfor commenting on issues and posting digests - Built-in secrets management for any API keys
- No infrastructure to provision or maintain
- Logs and run history visible in the Actions tab
State persistence: state.json committed to the aurora-monitor/ folder on the main branch to track seen post IDs across runs.
First Run: Historical Backfill
On first run, state.json doesn't exist. Without special handling, every reachable historical post would be treated as "new" — potentially spamming dozens of GitHub issue comments and generating a massive digest mixing years of content.
The monitor operates in two modes:
Backfill Mode (first run, manual trigger)
Triggered manually via workflow_dispatch with mode: backfill:
- PAGINATE — Walk available forum history via RSS pagination (depth depends on what SMF exposes)
- CAPTURE — Store attribution metadata for every post
- MATCH — Run all posts through matcher.py
- REPORT — Generate a one-time backfill report (NOT auto-comment on issues). Split into Matched, Triage, and Statistics sections.
- SEED — Write all seen post IDs to state.json
- PUBLISH — Post backfill report as a GitHub Discussion for human review
Key difference from steady-state: Backfill mode never auto-comments on GitHub issues. All matches go into the report for human review. This prevents noise from old posts, outdated context, and false positives from historical content.
Safety: Backfill Must Run Before Steady-State
The steady-state cron checks for state.json (or a backfill_complete flag). If missing, the workflow exits with a warning rather than accidentally treating all reachable history as new.
Steady-State Mode (daily cron)
Runs automatically after backfill is complete. Only processes posts newer than the last run.
What Happens When a New Post Is Detected
End-to-End Flow
1. FETCH — Daily cron pulls new posts from Aurora Forums via RSS
2. DEDUP — Check post ID against state.json, skip if already seen
3. CAPTURE — Store attribution metadata (author, date, topic URL,
post anchor, topic title, board name)
4. MATCH — Run post text through matcher.py against two corpora:
a. Open GitHub issues labeled "unverified" (title + body keywords)
b. Manual section terminology and numeric values
5. ROUTE — Based on match result:
→ Issue match found: comment on GitHub issue + add to digest
→ Relevant, no match: add to digest Triage section for human review
→ Not relevant: record as seen, no action
6. STATE — Mark post ID as seen in state.json
7. DIGEST — Weekly: compile all matches and triage items into markdown summary
On Issue Match
When a post matches an open unverified issue:
- Post a comment on the matching GitHub issue with:
- Forum permalink and author attribution
- Relevant quote from the post
- Authority level (Process version changelogs (1.0 through 2.5) #2 for Steve Walmsley, Extract ground-truth values from game database (SQLite) #3 for changelogs, Add cross-references between related sections #5 for community)
- Match confidence indicator
- Add to the Matched section of the weekly digest with full attribution
On Relevant Content Without an Issue Match
Not all valuable content maps to an existing unverified claim. Four scenarios:
| Category | Description | Example |
|---|---|---|
| Potential correction | Contradicts something the manual states as verified | Steve posts "actually, the formula is X not Y" |
| New coverage | Topic the manual doesn't address at all | Developer describes a mechanic with no manual section |
| Version change | Describes v2.8.0+ behavior differing from v2.7.1 baseline | Changelog entry changing a known formula |
| Expanded detail | Adds depth to an existing section without contradicting it | Developer clarifying edge case behavior |
These go into the Triage section of the weekly digest. Each entry includes:
- Full attribution metadata (author, permalink, date, authority level)
- Suggested category (correction / new coverage / version change / expanded detail)
- The matched manual section (if any)
- Relevant quote from the post
A human reviews the triage section and either:
- Creates a new issue if the content is actionable
- Dismisses if it's noise or already known
- Routes to an existing issue that the matcher missed
No automatic issue creation for unmatched content — that would be noisy. But the content is captured and surfaced so nothing falls through the cracks.
On No Relevance
Post is recorded in state.json as seen. No action taken, no noise generated.
Attribution (MANDATORY)
Every forum post used as a verification source or content reference must be properly attributed.
When forum content is incorporated into the manual:
- Identify the author and post — capture username, post date, topic URL, and specific post anchor
- Add a manual reference in the relevant file's References section:
\hypertarget{ref-X.Y-N}{[N]}. Aurora Forums — [topic URL] — [Author username], [post date] — [specific detail verified] - Developer posts (Steve Walmsley) use authority level Process version changelogs (1.0 through 2.5) #2:
\hypertarget{ref-X.Y-N}{[N]}. Aurora Forums — Steve Walmsley — [topic URL] — [specific mechanic confirmed] - Community posts use authority level Add cross-references between related sections #5 and should note the verification method (testing, observation, code analysis)
- Credit in issue comments — when closing a verification issue based on forum content, include the source post URL and author in the closing comment
- Digest attribution — every digest entry must include the post author, date, and permalink so the source can be traced
The monitor should automatically capture and store attribution metadata (author, date, URL, topic title) for every matched post so that manual incorporation never loses provenance.
Output Destinations
| Output | Destination | Frequency |
|---|---|---|
| Issue match comments | Posted directly on the matching GitHub issue | Each run (real-time) |
| Weekly digest | GitHub Discussion in a "Monitor Digests" category | Weekly (Sunday) |
| Backfill report | GitHub Discussion in "Monitor Digests" category | One-time (first run) |
| State file | monitor-state branch (state.json) |
Each run |
| Run logs | GitHub Actions log (visible in Actions tab) | Each run |
Weekly Digest Format
Posted as a GitHub Discussion with the title [Forum Monitor] Week of YYYY-MM-DD:
## Matched (N items)
Posts matching open unverified issues. Auto-commented on the relevant issues.
| Issue | Post Author | Date | Quote | Confidence |
|-------|------------|------|-------|------------|
| #NNN | username | date | "..." | High/Medium |
## Triage (N items)
Relevant posts not matching any open issue. Requires human review.
| Category | Post Author | Date | Manual Section | Quote |
|----------|------------|------|---------------|-------|
| Correction | username | date | 13.1 | "..." |
## Statistics
- Posts scanned: N
- Matches: N
- Triage items: N
- Skipped (not relevant): NGitHub Discussions Category
Create a "Monitor Digests" category in the repository for all four monitors to post to. Each monitor prefixes its digest title with its source name (e.g., [Forum Monitor], [Reddit Monitor]).
Design Decisions
The following decisions apply to all four monitors (#1288, #1289, #1290, #1291):
Matcher Algorithm: Keyword + Fuzzy
The shared matcher.py uses keyword matching with fuzzy string similarity (e.g., fuzzywuzzy / thefuzz). Simple, fast, transparent, and easy to debug. Each open unverified issue and manual section generates a keyword set; incoming content is scored against these sets with fuzzy matching to handle minor variations.
Match Confidence Scoring: Multiple Signals
Match confidence is determined by combining multiple signals:
| Signal | Weight | Example |
|---|---|---|
| Fuzzy match score | Primary | Score >= 80 = strong keyword match |
| Keyword count | Secondary | 5+ Aurora terms in post = higher confidence |
| Author reputation | Bonus | Steve Walmsley post = auto-High |
| Multiple issue matches | Penalty | Matches 3+ issues = likely generic, lower confidence |
Thresholds:
- High: Fuzzy score >= 80 AND (2+ keywords OR known author)
- Medium: Fuzzy score 60-79 OR single strong keyword match
- Low: Fuzzy score 40-59, included in digest but not auto-commented on issues
Only High and Medium confidence matches trigger auto-comments on GitHub issues. Low confidence matches appear in the digest Triage section only.
Repo Structure: Same Repo (Manual)
The aurora-monitor/ folder lives in the manual repository alongside the source files. Code and state in one place, single repo to manage.
State Persistence: Dedicated Folder on Main Branch
state.json lives in a dedicated aurora-monitor/ folder in the main branch of the repository. Committed after each run. Permanent, versioned, and simple.
Weekly Digest Timing: Same Workflow, Sunday Check
The daily workflow checks if today is Sunday. If so, it compiles and posts the weekly digest to the Monitor Digests discussion category after completing the daily scan. One workflow per monitor, not two.
Authentication: GITHUB_TOKEN with Elevated Permissions
Workflows use the default GITHUB_TOKEN with permissions: discussions: write set in the workflow YAML. No PAT or GitHub App required.
Error Handling: Retry with Backoff, Then Skip
On external API failure (Reddit 429, yt-dlp blocked, forum down): retry up to 3 times with exponential backoff. If still failing, skip that source and log the failure. No auto-alerting — GitHub Actions shows failed steps in the workflow log.
Testing Strategy: Fixtures + Dry-Run
- Sample data fixtures: Real examples saved as JSON fixtures with expected match results. Run as part of CI for automated regression testing.
- Dry-run mode: A
--dry-runflag that processes live data but only logs results without posting comments or digests. Used for manual validation before enabling live mode.
Prerequisites
Before the first run, create the Monitor Digests discussion category manually (GitHub does not support programmatic category creation):
- Go to github.com/ErikEvenson/aurora-manual/discussions
- Click the pencil icon next to "Categories" (or Settings > Discussions)
- Click New category
- Name:
Monitor Digests - Description:
Automated weekly digests and backfill reports from community source monitors (Forum, YouTube, Reddit, Discord). See issues #1288-#1291. - Format: Announcement (only maintainers/workflows can post, others can comment)
- Record the category ID for use in workflow config
Created. Category ID: DIC_kwDORAJjec4C2k26 | Slug: monitor-digests
Proposed Architecture
aurora-monitor/
sources/
forum.py # RSS fetch, rate-limited
matcher.py # Shared — match against issues and manual claims
digest.py # Shared — generate markdown digest (Matched + Triage sections)
state.json # Track last-seen post IDs (avoid duplicates)
config.yaml # Boards, users, keywords to watch
Blocked By
- Aurora Forums (aurora2.pentarch.org) are currently down. RSS URL structure and board numbers need to be confirmed when the forum returns.
Implementation Notes
- The script structure, matching logic, and issue integration can be built and tested against sample data before the forum comes back
- Matching logic should use the manual's existing
*(unverified — #NNN)*markers and issue titles as the keyword corpus - Consider using the GitHub API to automatically comment on issues when a match is found
Related
- Build YouTube monitor for Aurora content and transcript-based verification matching #1289 — YouTube monitor
- Build Reddit monitor for r/aurora4x content and verification matching #1290 — Reddit monitor
- Build Discord integration for community-driven content flagging and verification #1291 — Discord integration