-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The parse_date() method in base_scraper.py doesn't handle dates with additional context:
"May 20, 2025 (us-east-1 and us-west-2)"→ Should parse to"2025-05-20""January 15th, 2026 (all Regions)"→ Should parse to"2026-01-15""Not sooner than 2025-06-24"→ Should parse to"2025-06-24"
Proposed Fix
Update parse_date() in src/base_scraper.py:
def parse_date(self, date_str: str) -> str:
"""Parse various date formats to ISO format."""
if not date_str:
return ""
# Already in ISO format
if re.match(r"^\d{4}-\d{2}-\d{2}$", date_str):
return date_str
# Clean up common patterns
date_str = date_str.strip()
# Remove region information in parentheses
date_str = re.sub(r'\s*\([^)]+\)\s*$', '', date_str)
# Extract date from "Not sooner than DATE" patterns
sooner_match = re.search(r'not\s+sooner\s+than\s+(\S+)', date_str, re.IGNORECASE)
if sooner_match:
date_str = sooner_match.group(1)
# Remove ordinal suffixes (1st, 2nd, 3rd, 4th)
date_str = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', date_str)
# Common formats to try
formats = [
"%B %d, %Y", # January 31, 2025
"%b %d, %Y", # Jan 31, 2025
"%Y-%m-%d", # 2025-01-31
"%m/%d/%Y", # 01/31/2025
"%d/%m/%Y", # 31/01/2025
]
for fmt in formats:
try:
dt = datetime.strptime(date_str.strip(), fmt)
return dt.strftime("%Y-%m-%d")
except ValueError:
continue
# If no format matches, return empty string
# (Changed from returning original to enforce ISO format)
return ""Test Cases
def test_parse_date_with_region():
scraper = EnhancedBaseScraper()
assert scraper.parse_date("May 20, 2025 (us-east-1 and us-west-2)") == "2025-05-20"
assert scraper.parse_date("January 15th, 2026 (all Regions)") == "2026-01-15"
assert scraper.parse_date("Not sooner than 2025-06-24") == "2025-06-24"
assert scraper.parse_date("March 8, 2025") == "2025-03-08"Benefits
- Consistent ISO date format across all providers
- Handles real-world date variations from different providers
- Makes dates programmatically parseable
- Improves RSS feed quality
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels