Skip to content

Demonstrates a practical data product: headless JS fetchers capture dynamic pipeline assets, Python + LLMs parse them into structured tables, then market sizes and sources are appended for analysis.

License

Notifications You must be signed in to change notification settings

takers2018/medical-indication-market-sizing-scraper

Repository files navigation

Medical Indication Market Sizing Scraper

Two-part pipeline:

  1. LLM-based scraper extracts structured drug-candidate data from pharma pipeline pages (including pipeline images).
  2. Market sizing companion finds market sizes for those indications and appends source-backed figures.

Why it matters

  • Pipeline diagrams are often images or dynamic DOM content.
  • This repo demonstrates a practical, reproducible path from messy public artifacts → structured, analyzable data.
  • By combining drug candidate and market size data, one can estimate company investments toward solving specific indications

Tech Highlights

  • Python: LLM calls, image→JSON extraction, summary generation for each drug candidate, CSV consolidation.
  • JS/TS: headless/stealth asset fetching for dynamic pipelines and pipeline image URLs (Puppeteer/Playwright).
  • Data hygiene: JSON schemas, normalized fields, deterministic exports.
  • Master runner: process multiple companies simultaneously and export a consolidated CSV.
  • Secrets: .env pattern; nothing sensitive committed.

Quickstart

  1. Requirements

    • Python 3.10+ with pip
    • Node 18+ (if using Puppeteer/Playwright)
    • Git LFS (optional, for large files)
  2. Install

    # Python
    pip install -r requirements.txt
    # Node
    npm install
    

Run The Flow

  1. Fetch pipeline asset(s) (example: Ethris) node scrapers/js/ethris_fetch.js

  2. Analyze pipeline image -> structured JSON python scrapers/python/analyze_pipeline.py ^ --input data/raw/ethris_pipeline.png ^ --out outputs/ethris.json

  3. Consolidate JSON -> CSV python scripts/consolidate_to_csv.py --input outputs --out outputs/combined.csv

  4. Market sizing: append size + source to the dataset python market_sizing/fetch_market_sizes.py --in outputs/combined.csv --out outputs/with_sizes.csv python market_sizing/enrich_dataset.py --in outputs/with_sizes.csv --out outputs/final.csv

Credits

Portions of this project were inspired by the approach and examples in [mishushakov/llm-scraper] (MIT). Directly adapted files include:

  • scrapers/js/ethris_fetch.js (adapted)
  • scrapers/python/analyze_pipeline.py (inspired by LLM-driven extraction patterns)

We retained original notices where appropriate and reference the upstream project here.

About

Demonstrates a practical data product: headless JS fetchers capture dynamic pipeline assets, Python + LLMs parse them into structured tables, then market sizes and sources are appended for analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published