US News Scraper automatically collects structured article data from a leading US news website, turning unstructured pages into clean, machine-ready records. It helps you monitor coverage, analyze trends, and track how stories perform over time. Whether you are building dashboards, research pipelines, or content intelligence tools, this US news scraper gives you a reliable data backbone.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for us-news-scraper you've just found your team — Let’s Chat. 👆👆
US News Scraper is a news data extraction toolkit that discovers article pages, parses their core metadata, and exports them in multiple formats for downstream use. It removes the manual work of navigating categories, paginated listings, and detail pages.
This project is built for analysts, data engineers, researchers, and developers who need fresh, large-scale US news datasets without dealing with brittle one-off scripts.
- Automatically discovers and classifies article pages from section, topic, or search URLs.
- Extracts core metadata such as titles, authors, timestamps, categories, tags, and images.
- Captures engagement signals like popularity scores, comment counts, and social interactions where available.
- Supports continuous monitoring runs to track how coverage evolves over time.
- Exports data to multiple formats so it can plug into BI tools, warehouses, and custom apps.
| Feature | Description |
|---|---|
| Full-site coverage | Crawl entire news sections, topic hubs, or custom entry points to build comprehensive datasets. |
| Smart article detection | Uses heuristic logic to distinguish real article pages from navigation, utility, and marketing pages. |
| Rich metadata extraction | Collects titles, authors, timestamps, categories, tags, images, and canonical URLs for each article. |
| Popularity and performance signals | Tracks fields like view score, comment count, and social interactions when exposed by the site. |
| Flexible export formats | Export results as JSON, CSV, XML, HTML tables, or Excel files for easy integration with other tools. |
| Incremental & large-scale runs | Designed to handle large result sets with pagination and batching to keep memory usage stable. |
| Configurable limits | Set maximum item counts, depth, or runtime limits to control resource usage and cost. |
| Robust error handling | Retries transient failures and skips broken pages while keeping the rest of the dataset intact. |
| Field Name | Field Description |
|---|---|
| id | Unique identifier of the article within the dataset. |
| url | Canonical URL of the article detail page. |
| section | Top-level section or category (e.g., Politics, Health, Education). |
| category | More specific category or sub-section of the article. |
| title | Headline text of the article. |
| subtitle | Optional subheading or deck text when present. |
| author | Name of the primary author or byline. |
| contributors | List of additional authors or contributors. |
| publishedAt | Original publication datetime in ISO 8601 format. |
| updatedAt | Last updated datetime, if provided by the site. |
| summary | Short summary or standfirst describing the article. |
| body | Main article body text, cleaned and merged into paragraphs. |
| tags | List of topical tags or keywords associated with the article. |
| topics | Higher-level topical groupings inferred from navigation or labels. |
| imageUrl | URL of the primary header or feature image. |
| imageAlt | Alternate text or caption for the lead image. |
| readingTimeMinutes | Estimated reading time for the article content. |
| popularityScore | Numeric score representing relative popularity or engagement. |
| commentCount | Number of comments, if visible. |
| shareCount | Aggregate share count or derived engagement metric, when available. |
| reactions | Breakdown of reaction/emotion counts where exposed. |
| language | Language code of the article content, typically "en". |
| region | Geographic focus or edition (e.g., "US"). |
| source | Name of the news source for downstream identification. |
| scrapeRunId | Identifier of the scraping run or batch that produced this record. |
| scrapedAt | Datetime when the article was scraped, in ISO 8601 format. |
Example:
[
{
"id": "usnews-2025-001234",
"url": "https://www.example-usnews.com/politics/election-coverage-2025-01-01",
"section": "Politics",
"category": "Elections",
"title": "Key Takeaways From the Latest Election Polls",
"subtitle": "Voters signal shifting priorities ahead of the primary season.",
"author": "Jane Doe",
"contributors": [
"John Smith"
],
"publishedAt": "2025-01-01T14:30:00Z",
"updatedAt": "2025-01-01T16:05:00Z",
"summary": "A breakdown of voter sentiment across key battleground states based on new polling data.",
"body": "New polling data released on Wednesday shows that voter priorities are shifting toward economic and healthcare issues...",
"tags": [
"elections",
"polls",
"voter sentiment"
],
"topics": [
"US Politics",
"Elections 2025"
],
"imageUrl": "https://www.example-usnews.com/media/election-polls.jpg",
"imageAlt": "Chart showing voter poll percentages by candidate.",
"readingTimeMinutes": 6,
"popularityScore": 0.87,
"commentCount": 142,
"shareCount": 923,
"reactions": {
"insightful": 321,
"surprised": 54,
"skeptical": 77
},
"language": "en",
"region": "US",
"source": "Example US News",
"scrapeRunId": "run-2025-01-01T15-00-00Z",
"scrapedAt": "2025-01-01T15:02:18Z"
}
]
us-news-scraper/
├── src/
│ ├── index.ts
│ ├── crawler/
│ │ ├── bootstrap.ts
│ │ ├── queueManager.ts
│ │ └── paginationHandler.ts
│ ├── extractors/
│ │ ├── articleDetector.ts
│ │ ├── articleParser.ts
│ │ └── metadataNormalizer.ts
│ ├── outputs/
│ │ ├── exporterJson.ts
│ │ ├── exporterCsv.ts
│ │ ├── exporterXml.ts
│ │ └── exporterExcel.ts
│ ├── config/
│ │ ├── defaults.ts
│ │ └── schema.ts
│ └── utils/
│ ├── logger.ts
│ ├── httpClient.ts
│ └── rateLimiter.ts
├── config/
│ ├── start-urls.example.json
│ └── scraper.config.example.json
├── tests/
│ ├── articleDetector.test.ts
│ ├── articleParser.test.ts
│ └── integration.test.ts
├── data/
│ ├── sample-output.json
│ └── sample-output.csv
├── docs/
│ ├── usage.md
│ └── schema.md
├── package.json
├── tsconfig.json
├── README.md
└── LICENSE
- Media analysts use it to track coverage of specific topics over time, so they can quantify how quickly stories gain or lose attention.
- Data scientists use it to build news-based sentiment and trend models, so they can forecast market or opinion shifts with richer signals.
- Researchers and academics use it to assemble longitudinal corpora of US news articles, so they can study bias, framing, and agenda-setting at scale.
- Marketing and communications teams use it to monitor how campaigns or brands are referenced in the news, so they can react faster and manage reputation.
- News aggregators and dashboard builders use it to power content feeds and analytics panels, so they can deliver timely summaries to their users.
Q1: Do I need to know the exact article URLs before I start? No. You can provide section, topic, or search listing URLs as starting points. The scraper will follow pagination, discover article links, and then visit each article detail page to extract structured data.
Q2: What output formats are supported? You can export results as JSON, CSV, XML, HTML tables, or Excel files. This makes it easy to import the data into spreadsheets, BI tools, databases, or custom applications without additional conversion scripts.
Q3: How can I avoid overloading the target website? You can configure delays between requests, maximum concurrency, and global item limits. This lets you tune runs to be respectful and stable, while still collecting the volume of data you need.
Q4: Is it okay to reuse or republish the article content? Always check the website’s terms of use and relevant copyright laws before republishing or redistributing article text or media. The scraper is intended for analysis, research, and internal applications. For public reuse of content, seek legal advice and permissions where required.
Primary Metric: On a typical broadband connection, the scraper can process approximately 300–600 article pages per minute when concurrency is set to a moderate level and pages are lightweight.
Reliability Metric: In test runs across mixed sections and topics, over 98% of reachable article URLs were successfully processed without critical extraction errors.
Efficiency Metric: With batching and streaming exports enabled, memory usage remains stable even for runs exceeding 50,000 articles, making it suitable for large-scale data collection.
Quality Metric: For well-structured article templates, field-level completeness (title, author, timestamps, section, body) consistently exceeds 95%, with strict validation used to flag incomplete or malformed records for review.
