-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
Issues with many unverified claims containing common game keywords (e.g., "ships", "fuel", "population") attract disproportionate match counts in the Reddit monitor. Issue #707 (Civilian Economy, 32 unverified claims) matched 543 posts with 59 high-confidence — nearly all false positives from generic term overlap.
Root Cause
token_set_ratio scores based on token intersection. Issues with long bodies containing many common Aurora terms create a large token pool that overlaps with almost any Aurora-related Reddit post. The #1294 fix (8eda798) addresses the short-post side, but the long-issue side remains: an issue with 32 claims and hundreds of keywords is a magnet for matches.
Suggested Approaches
- Term specificity weighting (TF-IDF style): Weight keywords by how unique they are across all issues. "box launcher" is specific; "ships" appears in every issue. Penalize matches driven by common terms.
- Issue text length normalization: Scale the fuzzy score inversely with issue text length — longer issue bodies should require higher raw scores to qualify.
- Max claims threshold: Issues with >N unverified claims could be split or excluded from automated matching, since they match everything.
Impact
Medium — false positives waste reviewer time and add noise to issue comment threads. The #1294 fix reduces short-post false positives but doesn't address the long-issue attractors.
Related
- bug: Reddit monitor matcher produces false positives on short/empty posts #1294 (short text false positives, fixed in 8eda798)
- Discussion [Reddit Monitor] Backfill Report — 2026-02-17 #1296 (backfill report showing the pattern)
- Verify: Section 6.5 unverified claims (32 items) #707 (primary example: 543 false positive matches)