Skip to content

Multi-LLM pipeline for validated extraction and structuring of species-level botanical and ecological knowledge.

Notifications You must be signed in to change notification settings

kpmainali/species-knowledge-base

Repository files navigation

Pollinator Yards – Species Knowledge Base (LLM-Driven)

An automated LLM-driven pipeline for collecting, validating, and structuring species data for native plant gardening applications, including traits, care requirements, phenology, pollinator interactions, and imagery.

Relationship to Pollinator Yards

This repository is a core component of Pollinator Yards, an AI-driven system that integrates spatial modeling, large ensemble learning, and LLM-based knowledge pipelines to deliver yard-scale native plant recommendations for pollinator conservation.

The species knowledge produced here is consumed by downstream modeling and decision layers within the Pollinator Yards framework.

Main project repository:
https://github.com/kpmainali/pollinator_yards

Overview

This project automates the collection of comprehensive plant species information through three main pipelines:

  1. PDF Collection - Scrapes species information from USDA, iNaturalist, and academic sources
  2. JSON Generation - Uses multi-LLM validation to extract structured data from PDFs
  3. Image Collection - Fetches and verifies plant images using GPT-4o vision

Project Structure

Pollinator_Yard/
├── pdf_scrapers/          # PDF collection scripts
│   ├── usda.py            # USDA Plant Database scraper
│   ├── inaturalist.py     # iNaturalist page downloader
│   └── web_scraper.py     # Academic/gov site scraper
│
├── json_generator/        # JSON creation pipeline
│   ├── main.py            # Main orchestrator
│   ├── excel_parser.py    # Species extraction from Excel
│   ├── pdf_extractor.py   # PDF text extraction
│   ├── llm_clients.py     # OpenAI/Anthropic API clients
│   ├── validator.py       # Multi-LLM validation
│   ├── json_writer.py     # JSON output handling
│   └── template.json      # JSON schema template
│
├── image_scraper/         # Image collection pipeline
│   ├── main.py            # Image pipeline orchestrator
│   ├── sources/           # API clients (iNaturalist, Flickr)
│   └── verification/      # GPT-4o vision verification
│
├── utils/                 # Shared utilities
├── scripts/               # Utility scripts
├── docs/                  # Documentation
│
├── Species initial list.xlsx  # Input species list
├── requirements.txt       # Python dependencies
└── .env.example           # API key template

Quick Start

1. Install Dependencies

cd Pollinator_Yard
pip install -r requirements.txt

# Install Playwright browser (required for PDF scrapers)
playwright install chromium

2. Configure API Keys

cp .env.example .env
# Edit .env with your API keys

Required keys:

  • OPENAI_API_KEY - For GPT models and image verification
  • ANTHROPIC_API_KEY - For Claude models

Optional keys:

  • INATURALIST_USERNAME / INATURALIST_PASSWORD - For iNaturalist guide downloads
  • FLICKR_API_KEY - For additional image sources

3. Prepare Input Data

Place your species list Excel file at the project root. The file should have sheets with a "Scientific Name" column.

Usage

PDF Collection

# Download USDA fact sheets
python -m pdf_scrapers.usda

# Download iNaturalist pages
python -m pdf_scrapers.inaturalist

# Download from web sources
python -m pdf_scrapers.web_scraper

JSON Generation

# Process all species
python -m json_generator.main

# Process specific species
python -m json_generator.main --species "Aquilegia canadensis"

# Skip already processed species
python -m json_generator.main --skip-existing

# Preview what would be processed
python -m json_generator.main --dry-run

Image Collection

# Process all species
python -m image_scraper.main

# Process specific species
python -m image_scraper.main --species "Aquilegia canadensis"

# Limit to first N species
python -m image_scraper.main --limit 5

# Skip already processed
python -m image_scraper.main --skip-existing

JSON Schema

Each species JSON file includes:

{
  "scientific_name": "Aquilegia canadensis",
  "common_name": "Red columbine",
  "traits": {
    "flower_color": ["red", "yellow"],
    "bloom_season": ["April", "May", "June"],
    "sun": ["partial shade", "full shade"],
    "water": "Medium moisture",
    "deer_resistant": false,
    "pollinators": ["hummingbirds", "bees"],
    "height": {"min": 30, "max": 60, "unit": "cm"},
    "native_regions": ["Eastern North America"],
    "soil_type": ["well-drained"],
    "wildlife_value": ["nectar source"]
  },
  "care_summary": "Plant care description...",
  "description": "Botanical description...",
  "usda_Hardiness_zones": {"min": 3, "max": 8}
}

Multi-LLM Validation

The JSON generator uses a three-LLM validation approach:

  1. LLM 1 (GPT) - Generates JSON independently from PDF evidence
  2. LLM 2 (Claude) - Generates JSON independently from same evidence
  3. LLM 3 (Validator) - Compares, resolves conflicts, produces final JSON

This ensures accuracy and reduces hallucination by cross-validating outputs.

Documentation

Data Privacy

  • API keys are stored in .env (git-ignored)
  • Downloaded PDFs and images are stored locally (git-ignored)
  • No user data is collected or transmitted

License

Internal use for Pollinator Yard project.

Contributors

  • Kumar Mainali
  • Sahil Kharel

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages