An automated LLM-driven pipeline for collecting, validating, and structuring species data for native plant gardening applications, including traits, care requirements, phenology, pollinator interactions, and imagery.
This repository is a core component of Pollinator Yards, an AI-driven system that integrates spatial modeling, large ensemble learning, and LLM-based knowledge pipelines to deliver yard-scale native plant recommendations for pollinator conservation.
The species knowledge produced here is consumed by downstream modeling and decision layers within the Pollinator Yards framework.
Main project repository:
https://github.com/kpmainali/pollinator_yards
This project automates the collection of comprehensive plant species information through three main pipelines:
- PDF Collection - Scrapes species information from USDA, iNaturalist, and academic sources
- JSON Generation - Uses multi-LLM validation to extract structured data from PDFs
- Image Collection - Fetches and verifies plant images using GPT-4o vision
Pollinator_Yard/
├── pdf_scrapers/ # PDF collection scripts
│ ├── usda.py # USDA Plant Database scraper
│ ├── inaturalist.py # iNaturalist page downloader
│ └── web_scraper.py # Academic/gov site scraper
│
├── json_generator/ # JSON creation pipeline
│ ├── main.py # Main orchestrator
│ ├── excel_parser.py # Species extraction from Excel
│ ├── pdf_extractor.py # PDF text extraction
│ ├── llm_clients.py # OpenAI/Anthropic API clients
│ ├── validator.py # Multi-LLM validation
│ ├── json_writer.py # JSON output handling
│ └── template.json # JSON schema template
│
├── image_scraper/ # Image collection pipeline
│ ├── main.py # Image pipeline orchestrator
│ ├── sources/ # API clients (iNaturalist, Flickr)
│ └── verification/ # GPT-4o vision verification
│
├── utils/ # Shared utilities
├── scripts/ # Utility scripts
├── docs/ # Documentation
│
├── Species initial list.xlsx # Input species list
├── requirements.txt # Python dependencies
└── .env.example # API key template
cd Pollinator_Yard
pip install -r requirements.txt
# Install Playwright browser (required for PDF scrapers)
playwright install chromiumcp .env.example .env
# Edit .env with your API keysRequired keys:
OPENAI_API_KEY- For GPT models and image verificationANTHROPIC_API_KEY- For Claude models
Optional keys:
INATURALIST_USERNAME/INATURALIST_PASSWORD- For iNaturalist guide downloadsFLICKR_API_KEY- For additional image sources
Place your species list Excel file at the project root. The file should have sheets with a "Scientific Name" column.
# Download USDA fact sheets
python -m pdf_scrapers.usda
# Download iNaturalist pages
python -m pdf_scrapers.inaturalist
# Download from web sources
python -m pdf_scrapers.web_scraper# Process all species
python -m json_generator.main
# Process specific species
python -m json_generator.main --species "Aquilegia canadensis"
# Skip already processed species
python -m json_generator.main --skip-existing
# Preview what would be processed
python -m json_generator.main --dry-run# Process all species
python -m image_scraper.main
# Process specific species
python -m image_scraper.main --species "Aquilegia canadensis"
# Limit to first N species
python -m image_scraper.main --limit 5
# Skip already processed
python -m image_scraper.main --skip-existingEach species JSON file includes:
{
"scientific_name": "Aquilegia canadensis",
"common_name": "Red columbine",
"traits": {
"flower_color": ["red", "yellow"],
"bloom_season": ["April", "May", "June"],
"sun": ["partial shade", "full shade"],
"water": "Medium moisture",
"deer_resistant": false,
"pollinators": ["hummingbirds", "bees"],
"height": {"min": 30, "max": 60, "unit": "cm"},
"native_regions": ["Eastern North America"],
"soil_type": ["well-drained"],
"wildlife_value": ["nectar source"]
},
"care_summary": "Plant care description...",
"description": "Botanical description...",
"usda_Hardiness_zones": {"min": 3, "max": 8}
}The JSON generator uses a three-LLM validation approach:
- LLM 1 (GPT) - Generates JSON independently from PDF evidence
- LLM 2 (Claude) - Generates JSON independently from same evidence
- LLM 3 (Validator) - Compares, resolves conflicts, produces final JSON
This ensures accuracy and reduces hallucination by cross-validating outputs.
- API keys are stored in
.env(git-ignored) - Downloaded PDFs and images are stored locally (git-ignored)
- No user data is collected or transmitted
Internal use for Pollinator Yard project.
- Kumar Mainali
- Sahil Kharel