Skip to content

Reddit monitor: issues with many unverified claims attract disproportionate false positives #1298

@ErikEvenson

Description

@ErikEvenson

Description

Issues with many unverified claims containing common game keywords (e.g., "ships", "fuel", "population") attract disproportionate match counts in the Reddit monitor. Issue #707 (Civilian Economy, 32 unverified claims) matched 543 posts with 59 high-confidence — nearly all false positives from generic term overlap.

Root Cause

token_set_ratio scores based on token intersection. Issues with long bodies containing many common Aurora terms create a large token pool that overlaps with almost any Aurora-related Reddit post. The #1294 fix (8eda798) addresses the short-post side, but the long-issue side remains: an issue with 32 claims and hundreds of keywords is a magnet for matches.

Suggested Approaches

  1. Term specificity weighting (TF-IDF style): Weight keywords by how unique they are across all issues. "box launcher" is specific; "ships" appears in every issue. Penalize matches driven by common terms.
  2. Issue text length normalization: Scale the fuzzy score inversely with issue text length — longer issue bodies should require higher raw scores to qualify.
  3. Max claims threshold: Issues with >N unverified claims could be split or excluded from automated matching, since they match everything.

Impact

Medium — false positives waste reviewer time and add noise to issue comment threads. The #1294 fix reduces short-post false positives but doesn't address the long-issue attractors.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions