-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
Currently, automated scraper PRs are created daily but there's no validation of data quality before auto-merge. Bad scraper output (like the 443 xAI models) could go straight to production if CI passes.
Proposed Solution
Add an LLM-based sanity checker as a CI step that validates scraped data before allowing auto-merge:
Checks to implement:
- Volume checks: Flag if scraper returns unusually high/low number of items (e.g., xAI: 443 items vs expected ~5)
- Data quality: Validate that items have required fields (dates, context, valid model IDs)
- Pattern detection: Detect concatenated model IDs, invalid dates, placeholder values ("N/A", "TBD")
- Delta analysis: Compare with previous data to flag suspicious changes (e.g., 10x increase in deprecations)
Implementation approach:
- Add new CI job:
sanity-check - Use Claude API (Haiku for speed/cost) to analyze scraped data
- Output structured report with pass/fail + warnings
- Block auto-merge if critical issues found
- Allow auto-merge if only warnings (log for review)
Benefits:
- Catch data quality issues before they hit production RSS feed
- Maintain automation while ensuring quality
- Provide detailed feedback for debugging scraper issues
- Can evolve checks over time without changing scrapers
Example prompt structure:
Analyze this scraped deprecation data and check for:
1. Unusual item counts per provider
2. Missing required fields
3. Invalid date formats
4. Concatenated or malformed model IDs
5. Suspicious patterns
Data: {json_data}
Previous counts: {historical_counts}
Return JSON with: {pass: bool, critical_issues: [], warnings: []}
Acceptance Criteria:
- CI job added that runs LLM sanity check
- Blocks auto-merge on critical issues
- Logs warnings but allows merge
- Uses cost-effective model (Haiku)
- Completes in <30 seconds
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels