Medical Indication Market Sizing Scraper

Two-part pipeline:

LLM-based scraper extracts structured drug-candidate data from pharma pipeline pages (including pipeline images).
Market sizing companion finds market sizes for those indications and appends source-backed figures.

Why it matters

Pipeline diagrams are often images or dynamic DOM content.
This repo demonstrates a practical, reproducible path from messy public artifacts → structured, analyzable data.
By combining drug candidate and market size data, one can estimate company investments toward solving specific indications

Tech Highlights

Python: LLM calls, image→JSON extraction, summary generation for each drug candidate, CSV consolidation.
JS/TS: headless/stealth asset fetching for dynamic pipelines and pipeline image URLs (Puppeteer/Playwright).
Data hygiene: JSON schemas, normalized fields, deterministic exports.
Master runner: process multiple companies simultaneously and export a consolidated CSV.
Secrets: .env pattern; nothing sensitive committed.

Quickstart

Requirements
- Python 3.10+ with pip
- Node 18+ (if using Puppeteer/Playwright)
- Git LFS (optional, for large files)

Install

# Python
pip install -r requirements.txt
# Node
npm install

Run The Flow

Fetch pipeline asset(s) (example: Ethris) node scrapers/js/ethris_fetch.js
Analyze pipeline image -> structured JSON python scrapers/python/analyze_pipeline.py ^ --input data/raw/ethris_pipeline.png ^ --out outputs/ethris.json
Consolidate JSON -> CSV python scripts/consolidate_to_csv.py --input outputs --out outputs/combined.csv
Market sizing: append size + source to the dataset python market_sizing/fetch_market_sizes.py --in outputs/combined.csv --out outputs/with_sizes.csv python market_sizing/enrich_dataset.py --in outputs/with_sizes.csv --out outputs/final.csv

Credits

Portions of this project were inspired by the approach and examples in [mishushakov/llm-scraper] (MIT). Directly adapted files include:

scrapers/js/ethris_fetch.js (adapted)
scrapers/python/analyze_pipeline.py (inspired by LLM-driven extraction patterns)

We retained original notices where appropriate and reference the upstream project here.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
debugging		debugging
market_sizing		market_sizing
scrapers		scrapers
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Indication Market Sizing Scraper

Why it matters

Tech Highlights

Quickstart

Run The Flow

Credits

About

Uh oh!

Releases

Packages

Languages

License

takers2018/medical-indication-market-sizing-scraper

Folders and files

Latest commit

History

Repository files navigation

Medical Indication Market Sizing Scraper

Why it matters

Tech Highlights

Quickstart

Run The Flow

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages