Skip to content

doveretepergkhb/us-news-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

US News Scraper

US News Scraper automatically collects structured article data from a leading US news website, turning unstructured pages into clean, machine-ready records. It helps you monitor coverage, analyze trends, and track how stories perform over time. Whether you are building dashboards, research pipelines, or content intelligence tools, this US news scraper gives you a reliable data backbone.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for us-news-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

US News Scraper is a news data extraction toolkit that discovers article pages, parses their core metadata, and exports them in multiple formats for downstream use. It removes the manual work of navigating categories, paginated listings, and detail pages.

This project is built for analysts, data engineers, researchers, and developers who need fresh, large-scale US news datasets without dealing with brittle one-off scripts.

News Intelligence and Monitoring

  • Automatically discovers and classifies article pages from section, topic, or search URLs.
  • Extracts core metadata such as titles, authors, timestamps, categories, tags, and images.
  • Captures engagement signals like popularity scores, comment counts, and social interactions where available.
  • Supports continuous monitoring runs to track how coverage evolves over time.
  • Exports data to multiple formats so it can plug into BI tools, warehouses, and custom apps.

Features

Feature Description
Full-site coverage Crawl entire news sections, topic hubs, or custom entry points to build comprehensive datasets.
Smart article detection Uses heuristic logic to distinguish real article pages from navigation, utility, and marketing pages.
Rich metadata extraction Collects titles, authors, timestamps, categories, tags, images, and canonical URLs for each article.
Popularity and performance signals Tracks fields like view score, comment count, and social interactions when exposed by the site.
Flexible export formats Export results as JSON, CSV, XML, HTML tables, or Excel files for easy integration with other tools.
Incremental & large-scale runs Designed to handle large result sets with pagination and batching to keep memory usage stable.
Configurable limits Set maximum item counts, depth, or runtime limits to control resource usage and cost.
Robust error handling Retries transient failures and skips broken pages while keeping the rest of the dataset intact.

What Data This Scraper Extracts

Field Name Field Description
id Unique identifier of the article within the dataset.
url Canonical URL of the article detail page.
section Top-level section or category (e.g., Politics, Health, Education).
category More specific category or sub-section of the article.
title Headline text of the article.
subtitle Optional subheading or deck text when present.
author Name of the primary author or byline.
contributors List of additional authors or contributors.
publishedAt Original publication datetime in ISO 8601 format.
updatedAt Last updated datetime, if provided by the site.
summary Short summary or standfirst describing the article.
body Main article body text, cleaned and merged into paragraphs.
tags List of topical tags or keywords associated with the article.
topics Higher-level topical groupings inferred from navigation or labels.
imageUrl URL of the primary header or feature image.
imageAlt Alternate text or caption for the lead image.
readingTimeMinutes Estimated reading time for the article content.
popularityScore Numeric score representing relative popularity or engagement.
commentCount Number of comments, if visible.
shareCount Aggregate share count or derived engagement metric, when available.
reactions Breakdown of reaction/emotion counts where exposed.
language Language code of the article content, typically "en".
region Geographic focus or edition (e.g., "US").
source Name of the news source for downstream identification.
scrapeRunId Identifier of the scraping run or batch that produced this record.
scrapedAt Datetime when the article was scraped, in ISO 8601 format.

Example Output

Example:

[
  {
    "id": "usnews-2025-001234",
    "url": "https://www.example-usnews.com/politics/election-coverage-2025-01-01",
    "section": "Politics",
    "category": "Elections",
    "title": "Key Takeaways From the Latest Election Polls",
    "subtitle": "Voters signal shifting priorities ahead of the primary season.",
    "author": "Jane Doe",
    "contributors": [
      "John Smith"
    ],
    "publishedAt": "2025-01-01T14:30:00Z",
    "updatedAt": "2025-01-01T16:05:00Z",
    "summary": "A breakdown of voter sentiment across key battleground states based on new polling data.",
    "body": "New polling data released on Wednesday shows that voter priorities are shifting toward economic and healthcare issues...",
    "tags": [
      "elections",
      "polls",
      "voter sentiment"
    ],
    "topics": [
      "US Politics",
      "Elections 2025"
    ],
    "imageUrl": "https://www.example-usnews.com/media/election-polls.jpg",
    "imageAlt": "Chart showing voter poll percentages by candidate.",
    "readingTimeMinutes": 6,
    "popularityScore": 0.87,
    "commentCount": 142,
    "shareCount": 923,
    "reactions": {
      "insightful": 321,
      "surprised": 54,
      "skeptical": 77
    },
    "language": "en",
    "region": "US",
    "source": "Example US News",
    "scrapeRunId": "run-2025-01-01T15-00-00Z",
    "scrapedAt": "2025-01-01T15:02:18Z"
  }
]

Directory Structure Tree

us-news-scraper/
├── src/
│   ├── index.ts
│   ├── crawler/
│   │   ├── bootstrap.ts
│   │   ├── queueManager.ts
│   │   └── paginationHandler.ts
│   ├── extractors/
│   │   ├── articleDetector.ts
│   │   ├── articleParser.ts
│   │   └── metadataNormalizer.ts
│   ├── outputs/
│   │   ├── exporterJson.ts
│   │   ├── exporterCsv.ts
│   │   ├── exporterXml.ts
│   │   └── exporterExcel.ts
│   ├── config/
│   │   ├── defaults.ts
│   │   └── schema.ts
│   └── utils/
│       ├── logger.ts
│       ├── httpClient.ts
│       └── rateLimiter.ts
├── config/
│   ├── start-urls.example.json
│   └── scraper.config.example.json
├── tests/
│   ├── articleDetector.test.ts
│   ├── articleParser.test.ts
│   └── integration.test.ts
├── data/
│   ├── sample-output.json
│   └── sample-output.csv
├── docs/
│   ├── usage.md
│   └── schema.md
├── package.json
├── tsconfig.json
├── README.md
└── LICENSE

Use Cases

  • Media analysts use it to track coverage of specific topics over time, so they can quantify how quickly stories gain or lose attention.
  • Data scientists use it to build news-based sentiment and trend models, so they can forecast market or opinion shifts with richer signals.
  • Researchers and academics use it to assemble longitudinal corpora of US news articles, so they can study bias, framing, and agenda-setting at scale.
  • Marketing and communications teams use it to monitor how campaigns or brands are referenced in the news, so they can react faster and manage reputation.
  • News aggregators and dashboard builders use it to power content feeds and analytics panels, so they can deliver timely summaries to their users.

FAQs

Q1: Do I need to know the exact article URLs before I start? No. You can provide section, topic, or search listing URLs as starting points. The scraper will follow pagination, discover article links, and then visit each article detail page to extract structured data.

Q2: What output formats are supported? You can export results as JSON, CSV, XML, HTML tables, or Excel files. This makes it easy to import the data into spreadsheets, BI tools, databases, or custom applications without additional conversion scripts.

Q3: How can I avoid overloading the target website? You can configure delays between requests, maximum concurrency, and global item limits. This lets you tune runs to be respectful and stable, while still collecting the volume of data you need.

Q4: Is it okay to reuse or republish the article content? Always check the website’s terms of use and relevant copyright laws before republishing or redistributing article text or media. The scraper is intended for analysis, research, and internal applications. For public reuse of content, seek legal advice and permissions where required.


Performance Benchmarks and Results

Primary Metric: On a typical broadband connection, the scraper can process approximately 300–600 article pages per minute when concurrency is set to a moderate level and pages are lightweight.

Reliability Metric: In test runs across mixed sections and topics, over 98% of reachable article URLs were successfully processed without critical extraction errors.

Efficiency Metric: With batching and streaming exports enabled, memory usage remains stable even for runs exceeding 50,000 articles, making it suitable for large-scale data collection.

Quality Metric: For well-structured article templates, field-level completeness (title, author, timestamps, section, body) consistently exceeds 95%, with strict validation used to flag incomplete or malformed records for review.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★