news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research
Purpose: For educational and research purposes only. Not designed for commercial use that could be detrimental to news source providers.
User Responsibility: Users must comply with each website's Terms of Service and robots.txt. Aggressive scraping may lead to IP blocking. Scrape responsibly and respect server limitations.
pip install news-watch
playwright install chromiumDevelopment setup: see https://okky.dev/news-watch/getting-started/
To run the scraper from the command line:
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -vCommand-Line Arguments
| Argument | Description |
|---|---|
-k, --keywords |
Required. Comma-separated keywords to scrape (e.g., "ojk,bank,npl") |
-sd, --start_date |
Required. Start date in YYYY-MM-DD format (e.g., 2025-01-01) |
-s, --scrapers |
Scrapers to use: specific names (e.g., "kompas,viva"), "auto" (default, platform-appropriate), or "all" (force all, may fail) |
-of, --output_format |
Output format: csv, xlsx, or json (default: csv) |
-o, --output_path |
Custom output file path (optional) |
-v, --verbose |
Show detailed logging output (default: silent) |
--list_scrapers |
List all supported scrapers and exit |
# Basic usage
newswatch --keywords ihsg --start_date 2025-01-01
# Multiple keywords with specific scraper
newswatch -k "ihsg,bank" -s "detik" --output_format xlsx -v
# List available scrapers
newswatch --list_scrapersimport newswatch as nw
# Basic scraping - returns list of article dictionaries
articles = nw.scrape("ekonomi,politik", "2025-01-01")
print(f"Found {len(articles)} articles")
# Get results as pandas DataFrame for analysis
df = nw.scrape_to_dataframe("teknologi,startup", "2025-01-01")
print(df['source'].value_counts())
# Save directly to file
nw.scrape_to_file(
keywords="bank,ihsg",
start_date="2025-01-01",
output_path="financial_news.xlsx"
)
# Quick recent news
recent_news = nw.quick_scrape("politik", days_back=3)
# Get available news sources
sources = nw.list_scrapers()
print("Available sources:", sources)See the comprehensive guide for detailed usage examples and advanced patterns. For interactive examples, see the API reference notebook.
You can run news-watch on Google Colab
The scraped articles are saved as a CSV, XLSX, or JSON file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.
The output file contains the following columns:
titlepublish_dateauthorcontentkeywordcategorysourcelink
- Antaranews.com
- Bisnis.com
- Bloomberg Technoz
- CNBC Indonesia
- CNN Indonesia
- Detik.com
- IDN Times
- Jawapos.com
- Katadata.co.id
- Kompas.com
- Kontan.co.id
- Liputan6.com
- Media Indonesia
- Metrotvnews.com
- Okezone.com
- Kumparan
- Merdeka
- Republika
- Suara
- Tempo.co
- Tirto
- Tribunnews.com
- Viva.co.id
Note:
- On Linux platforms / cloud: Katadata relies on bearer-token capture (Playwright) and may fail in restricted environments, so it is automatically excluded in
automode on Linux.- Use
-s allto force-run all scrapers (may cause errors/timeouts).- Limitation: Kontan scraper maximum 50 pages.
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.
@software{mabruri_newswatch,
author = {Okky Mabruri},
title = {news-watch},
year = {2025},
doi = {10.5281/zenodo.14908389}
}