Skip to content

Improve date parsing to handle dates with region/context information #72

@leblancfg

Description

@leblancfg

Problem

The parse_date() method in base_scraper.py doesn't handle dates with additional context:

  • "May 20, 2025 (us-east-1 and us-west-2)" → Should parse to "2025-05-20"
  • "January 15th, 2026 (all Regions)" → Should parse to "2026-01-15"
  • "Not sooner than 2025-06-24" → Should parse to "2025-06-24"

Proposed Fix

Update parse_date() in src/base_scraper.py:

def parse_date(self, date_str: str) -> str:
    """Parse various date formats to ISO format."""
    if not date_str:
        return ""
    
    # Already in ISO format
    if re.match(r"^\d{4}-\d{2}-\d{2}$", date_str):
        return date_str
    
    # Clean up common patterns
    date_str = date_str.strip()
    
    # Remove region information in parentheses
    date_str = re.sub(r'\s*\([^)]+\)\s*$', '', date_str)
    
    # Extract date from "Not sooner than DATE" patterns
    sooner_match = re.search(r'not\s+sooner\s+than\s+(\S+)', date_str, re.IGNORECASE)
    if sooner_match:
        date_str = sooner_match.group(1)
    
    # Remove ordinal suffixes (1st, 2nd, 3rd, 4th)
    date_str = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', date_str)
    
    # Common formats to try
    formats = [
        "%B %d, %Y",  # January 31, 2025
        "%b %d, %Y",  # Jan 31, 2025
        "%Y-%m-%d",   # 2025-01-31
        "%m/%d/%Y",   # 01/31/2025
        "%d/%m/%Y",   # 31/01/2025
    ]
    
    for fmt in formats:
        try:
            dt = datetime.strptime(date_str.strip(), fmt)
            return dt.strftime("%Y-%m-%d")
        except ValueError:
            continue
    
    # If no format matches, return empty string
    # (Changed from returning original to enforce ISO format)
    return ""

Test Cases

def test_parse_date_with_region():
    scraper = EnhancedBaseScraper()
    assert scraper.parse_date("May 20, 2025 (us-east-1 and us-west-2)") == "2025-05-20"
    assert scraper.parse_date("January 15th, 2026 (all Regions)") == "2026-01-15"
    assert scraper.parse_date("Not sooner than 2025-06-24") == "2025-06-24"
    assert scraper.parse_date("March 8, 2025") == "2025-03-08"

Benefits

  • Consistent ISO date format across all providers
  • Handles real-world date variations from different providers
  • Makes dates programmatically parseable
  • Improves RSS feed quality

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions