Two-part pipeline:
- LLM-based scraper extracts structured drug-candidate data from pharma pipeline pages (including pipeline images).
- Market sizing companion finds market sizes for those indications and appends source-backed figures.
- Pipeline diagrams are often images or dynamic DOM content.
- This repo demonstrates a practical, reproducible path from messy public artifacts → structured, analyzable data.
- By combining drug candidate and market size data, one can estimate company investments toward solving specific indications
- Python: LLM calls, image→JSON extraction, summary generation for each drug candidate, CSV consolidation.
- JS/TS: headless/stealth asset fetching for dynamic pipelines and pipeline image URLs (Puppeteer/Playwright).
- Data hygiene: JSON schemas, normalized fields, deterministic exports.
- Master runner: process multiple companies simultaneously and export a consolidated CSV.
- Secrets:
.envpattern; nothing sensitive committed.
-
Requirements
- Python 3.10+ with
pip - Node 18+ (if using Puppeteer/Playwright)
- Git LFS (optional, for large files)
- Python 3.10+ with
-
Install
# Python pip install -r requirements.txt # Node npm install
-
Fetch pipeline asset(s) (example: Ethris) node scrapers/js/ethris_fetch.js
-
Analyze pipeline image -> structured JSON python scrapers/python/analyze_pipeline.py ^ --input data/raw/ethris_pipeline.png ^ --out outputs/ethris.json
-
Consolidate JSON -> CSV python scripts/consolidate_to_csv.py --input outputs --out outputs/combined.csv
-
Market sizing: append size + source to the dataset python market_sizing/fetch_market_sizes.py --in outputs/combined.csv --out outputs/with_sizes.csv python market_sizing/enrich_dataset.py --in outputs/with_sizes.csv --out outputs/final.csv
Portions of this project were inspired by the approach and examples in [mishushakov/llm-scraper] (MIT). Directly adapted files include:
- scrapers/js/ethris_fetch.js (adapted)
- scrapers/python/analyze_pipeline.py (inspired by LLM-driven extraction patterns)
We retained original notices where appropriate and reference the upstream project here.