Skip to content

A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs including the Socrata Open Data Portal and Texas Comptroller API. Features GPU acceleration (CUDA/cuDNN), intelligent data merging, automated deduplication and precise business information through Google Places API.

License

Notifications You must be signed in to change notification settings

chanderbhanswami/texas-data-scraper

Repository files navigation

Texas Government Data Scraper Toolkit

Python Version License Platform CUDA Code Style PRs Welcome Maintenance

A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs & Google Places API

FeaturesInstallationUsageDocsContributingSupport


A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs including the Socrata Open Data Portal and Texas Comptroller API. Features GPU acceleration (CUDA/cuDNN), intelligent data merging, automated deduplication and precise business information through Google Places API.

Features

Core Capabilities

  • Dual API Support: Socrata Open Data Portal + Texas Comptroller API
  • GPU Acceleration: CUDA/cuDNN optimized for NVIDIA RTX 3060
  • Interactive CLI: User-friendly menus for all operations
  • Multi-Format Export: JSON, CSV, and Excel with automatic formatting
  • Smart Data Merging: Intelligent field prioritization and conflict resolution
  • Advanced Deduplication: Multiple strategies with merge capabilities
  • Rate Limiting: Intelligent throttling with automatic token management
  • Comprehensive Logging: Detailed logs with rotation and compression
  • Progress Persistence: Resume interrupted downloads from checkpoints (v1.1.0)
  • Export Verification: SHA-256 checksums for data integrity (v1.1.0)
  • Smart Field Detection: Case-insensitive matching with 20+ field variations (v1.2.0)
  • Global Auto-Deduplication: Automatically skips already-scraped records (v1.2.0)
  • Append-to-Existing Exports: Single consolidated file per dataset (v1.2.0)
  • Process ALL Socrata Files: Bulk process all datasets through Comptroller (v1.3.0)
  • Separate Comptroller Files: Source-specific filenames per dataset (v1.3.0)
  • Master Combine All: Full pipeline merge of all Socrata + Comptroller data (v1.3.0)
  • 9 Manual Combine Options: Granular control over file merging (v1.3.0)
  • Outlet Data Enricher: Extract outlet fields from duplicate Socrata records (v1.4.0)
  • Persistent Disk Caching: Comptroller cache survives restarts - truly resumable (v1.4.0)
  • Network Retry with Backoff: Automatic recovery from internet outages (v1.4.0)
  • Configurable Comptroller Settings: Fine-tune concurrent requests, chunk size, delays (v1.4.0)
  • Google Places API Integration: Get phone, website, ratings, hours from Google (v1.5.0)
  • Two-Step Places Workflow: Find Place IDs → Get Place Details (v1.5.0)
  • Final Data Combiner: Merge Google Places with polished taxpayer data (v1.5.0)
  • New Places API v1: Migrated to places.googleapis.com/v1 with header-based auth (v1.5.1)

Data Sources

  • Franchise Tax Permit Holders
  • Sales Tax Permit Holders
  • Mixed Beverage Tax Permit Holders
  • Tax Registration Data
  • Detailed Taxpayer Information
  • FTAS (Franchise Tax Account Status) Records

Requirements

System Requirements

  • Python 3.8+
  • NVIDIA GPU with CUDA support (optional, for GPU acceleration)
  • CUDA Toolkit 11.8+ or 12.x or 13.x
  • cuDNN 8.9.x or later
  • 8GB+ RAM (16GB recommended)
  • 10GB+ free disk space

API Keys (Free)

  1. Socrata API Token (optional but recommended)

  2. Texas Comptroller API Key (required for full access)

Installation

1. Clone the Repository

git clone <repository-url>
cd texas-data-scraper

2. Create Virtual Environment

python -m venv venv

# Windows
venv\Scripts\activate

# Linux/Mac
source venv/bin/activate

3. Install Dependencies

For CPU-Only Installation:

pip install -r requirements.txt

For GPU-Accelerated Installation:

# First, ensure CUDA Toolkit and cuDNN are installed
# Download from: https://developer.nvidia.com/cuda-downloads
# and: https://developer.nvidia.com/cudnn

# Then install GPU requirements
pip install -r requirements-gpu.txt

4. Configure Environment Variables

# Copy the example environment file
cp config/.env.example .env

# Edit .env and add your API keys
nano .env  # or use your preferred editor

Required Environment Variables:

# Socrata API Token (optional but recommended)
SOCRATA_APP_TOKEN=your_token_here

# Comptroller API Key (required)
COMPTROLLER_API_KEY=your_key_here

# GPU Settings (if using GPU)
USE_GPU=true
GPU_DEVICE_ID=0
GPU_MEMORY_LIMIT=10240

5. Verify Installation

# Test API endpoints
python scripts/api_tester.py

# Check GPU availability (if using GPU)
python -c "import cupy; print('GPU Available:', cupy.cuda.is_available())"

Usage Guide

1. Socrata Data Scraper

Download data from Texas Open Data Portal:

python scripts/socrata_scraper.py

Interactive Menu Options:

  • Download full datasets (Franchise Tax, Sales Tax, etc.)
  • Download with custom record limits
  • Search by business name, city, ZIP code, agent name, etc.
  • View dataset metadata
  • Export data in multiple formats

Example Workflow:

  1. Select "1" for Franchise Tax (full dataset)
  2. Wait for download to complete
  3. Choose "Yes" to export data
  4. Files saved to exports/socrata/ in JSON, CSV, and Excel formats

2. Comptroller Data Scraper

Fetch detailed taxpayer information:

python scripts/comptroller_scraper.py

Features:

  • Auto-detect Socrata export files
  • Batch process taxpayer IDs
  • Single taxpayer lookup (terminal-only display)
  • Async processing for faster results
  • Combined details + FTAS records

Example Workflow:

  1. First, run Socrata scraper to get taxpayer IDs
  2. Select "1" for auto-detect Socrata files
  3. Choose the most recent export
  4. Select async processing method
  5. Wait for batch processing
  6. Export enriched data

3. Data Combiner

Merge Socrata and Comptroller data intelligently:

python scripts/data_combiner.py

Features:

  • Smart field merging with priority (Comptroller > Socrata)
  • Automatic conflict resolution
  • Support for JSON, CSV, and Excel
  • Auto-detect latest exports

Example Workflow:

  1. Select "4" for auto-detect and combine
  2. Confirm the detected files
  3. View combination statistics
  4. Export combined data

4. Deduplicator

Remove duplicate records and polish data:

python scripts/deduplicator.py

Deduplication Strategies:

  • taxpayer_id: Remove duplicates by taxpayer ID (fastest)
  • exact: Remove exact duplicate records
  • fuzzy: Fuzzy matching on key fields

Advanced Options:

  • Deduplicate with merge (combines duplicate records)
  • Deduplicate by confidence (keeps most complete record)

Example Workflow:

  1. Select "4" to deduplicate all combined exports
  2. Review deduplication statistics
  3. Files saved to exports/deduplicated/

5. Outlet Data Enricher (v1.4.0)

Enrich deduplicated data with outlet information from duplicate records:

python scripts/outlet_enricher.py

Features:

  • Extract outlet fields from duplicate Socrata records
  • Enrich deduplicated data with outlet info
  • GPU acceleration support
  • Handles multiple outlets per taxpayer

Outlet Fields Extracted:

  • outlet_number, outlet_name, outlet_address
  • outlet_city, outlet_state, outlet_zip_code
  • outlet_county_code, outlet_naics_code
  • outlet_permit_issue_date, outlet_first_sales_date

Example Workflow:

  1. Select "1" for Auto-Enrich
  2. Choose Socrata source file
  3. Choose Deduplicated file
  4. Files saved to exports/polished/

6. Google Places Scraper (v1.5.0)

Enrich data with Google Places business information:

python scripts/google_places_scraper.py

Two-Step Workflow:

  1. Step 1: Find Place IDs - Search Google Places using business name + address
  2. Step 2: Get Details - Fetch full business info using Place IDs

Fields Extracted:

  • formatted_phone_number, international_phone_number
  • website, url (Google Maps link)
  • rating, user_ratings_total
  • business_status, types (categories)
  • opening_hours, geometry (lat/lng)
  • reviews, photos

Example Workflow:

  1. Select "1" for Auto-Find Place IDs (from polished data)
  2. Export to exports/place_ids/
  3. Continue to get details
  4. Export to exports/places_details/
  5. Use Data Combiner option 13 to merge with polished data
  6. Final output in exports/final/

7. API Endpoint Tester

Test all API endpoints:

python scripts/api_tester.py

Tests:

  • Socrata API connection and token validation
  • Comptroller API connection and key validation
  • Dataset access and search functionality
  • Pagination and metadata retrieval
  • Error handling

Project Structure

texas-data-scraper/
│
├── .cache/                           # Cache directory
│   ├── progress/                     # Progress checkpoints for resume
│   ├── comptroller/                  # Comptroller API response cache (v1.4.0)
│   └── google_places/                # Google Places API cache (v1.5.0)
│       ├── place_ids/                # Cached place ID lookups
│       └── details/                  # Cached place details
│
├── config/
│   ├── __init__.py                   # Config package initialization
│   ├── settings.py                   # Configuration management
│   └── .env.example                  # Environment variables template
│
├── docs/
│   ├── ABSOLUTELY_FINAL_SUMMARY.md   # Final project summary
│   ├── DEPLOYMENT_GUIDE.md           # Deployment instructions
│   ├── FINAL_COMPLETE_CHECKLIST.md   # Complete feature checklist
│   ├── INSTALLATION_CHECKLIST.md     # Installation guide
│   └── QUICK_START.md                # Quick start guide
│
├── exports/                          # Output directory for exported data
│   ├── combined/                     # Combined data exports
│   ├── comptroller/                  # Comptroller data exports
│   ├── deduplicated/                 # Deduplicated data exports
│   ├── polished/                     # Outlet-enriched data exports (v1.4.0)
│   ├── place_ids/                    # Google Place IDs exports (v1.5.0)
│   ├── places_details/               # Google Places details exports (v1.5.0)
│   ├── final/                        # Final combined data exports (v1.5.0)
│   └── socrata/                      # Socrata data exports
│
├── logs/                             # Log files directory
│
├── scripts/
│   ├── api_tester.py                 # API endpoint testing
│   ├── batch_processor.py            # Batch processing CLI
│   ├── comptroller_scraper.py        # Main Comptroller scraper CLI
│   ├── data_combiner.py              # Data combination CLI
│   ├── deduplicator.py               # Deduplication CLI
│   ├── google_places_scraper.py      # Google Places API CLI (v1.5.0)
│   ├── outlet_enricher.py            # Outlet data enrichment CLI (v1.4.0)
│   └── socrata_scraper.py            # Main Socrata scraper CLI
│
├── src/
│   ├── __init__.py                   # Source package initialization
│   │
│   ├── api/
│   │   ├── __init__.py               # API package initialization
│   │   ├── comptroller_client.py     # Comptroller API client
│   │   ├── google_places_client.py   # Google Places API client (v1.5.0)
│   │   ├── rate_limiter.py           # Rate limiting logic
│   │   └── socrata_client.py         # Socrata API client
│   │
│   ├── exporters/
│   │   ├── __init__.py               # Exporters package initialization
│   │   └── file_exporter.py          # Export to JSON/CSV/Excel
│   │
│   ├── processors/
│   │   ├── __init__.py               # Processors package initialization
│   │   ├── data_combiner.py          # Combine Socrata + Comptroller data
│   │   ├── data_validator.py         # Data validation
│   │   ├── deduplicator.py           # Remove duplicates
│   │   └── outlet_enricher.py        # Outlet data enrichment (v1.4.0)
│   │
│   ├── scrapers/
│   │   ├── __init__.py               # Scrapers package initialization
│   │   ├── comptroller_scraper.py    # Comptroller data scraper
│   │   ├── google_places_scraper.py  # Google Places scraper (v1.5.0)
│   │   ├── gpu_accelerator.py        # GPU acceleration utilities
│   │   └── socrata_scraper.py        # Socrata data scraper
│   │
│   └── utils/
│       ├── __init__.py               # Utils package initialization
│       ├── checksum.py               # File checksum verification
│       ├── helpers.py                # Helper functions
│       ├── logger.py                 # Logging utilities
│       ├── menu.py                   # Interactive CLI menu
│       └── progress_manager.py       # Progress persistence for downloads
│
├── tests/
│   ├── __init__.py                   # Tests package initialization
│   ├── test_comptroller_api.py       # Comptroller API tests
│   ├── test_google_places_api.py     # Google Places API tests
│   ├── test_integration.py           # Integration tests
│   ├── test_processors.py            # Processor tests
│   ├── test_scrapers.py              # Scraper tests
│   └── test_socrata_api.py           # Socrata API tests
│
├── .env                              # Environment variables (gitignored)
├── .gitignore                        # Git ignore file
├── CHANGELOG.md                      # Project changelog
├── CONTRIBUTING.md                   # Contribution guidelines
├── LICENSE                           # Project license
├── Makefile                          # Make commands for automation
├── PROJECT_STRUCTURE.md              # This file - project structure docs
├── PROJECT_SUMMARY.md                # Detailed project summary
├── README.md                         # Main documentation
├── requirements.txt                  # Python dependencies
├── requirements-gpu.txt              # GPU-specific dependencies
├── run.py                            # Main entry point runner
├── setup.py                          # Package setup
└── setup_project.py                  # Project setup/initialization script

Complete Workflow Example

Full Pipeline: Socrata → Comptroller → Combine → Deduplicate

# Step 1: Download Socrata data
python scripts/socrata_scraper.py
# Select: 1 (Franchise Tax full dataset)
# Export: Yes

# Step 2: Enrich with Comptroller data
python scripts/comptroller_scraper.py
# Select: 1 (Auto-detect Socrata files)
# Choose: Latest export
# Method: 2 (Async)
# Export: Yes

# Step 3: Combine both datasets
python scripts/data_combiner.py
# Select: 4 (Auto-detect and combine)
# Export: Yes (all formats)

# Step 4: Remove duplicates
python scripts/deduplicator.py
# Select: 4 (Deduplicate all combined)
# Strategy: taxpayer_id

# Final data location: exports/deduplicated/

Configuration

Rate Limits

Socrata API:

  • Without token: 1,000 requests/hour
  • With token: 50,000 requests/hour

Comptroller API:

  • 100 requests/minute (with API key)

GPU Settings

Optimize for your RTX 3060:

# .env file
USE_GPU=true
GPU_DEVICE_ID=0
GPU_MEMORY_LIMIT=10240  # 10GB for RTX 3060 (12GB total)

Batch Processing

BATCH_SIZE=100              # Records per batch
CONCURRENT_REQUESTS=5       # Simultaneous requests

Troubleshooting

Common Issues

1. GPU Not Detected

# Check CUDA installation
nvidia-smi

# Verify CuPy installation
python -c "import cupy; cupy.cuda.runtime.getDeviceProperties(0)"

# If issues persist, use CPU-only mode:
USE_GPU=false

2. Rate Limit Errors

  • Add Socrata API token to .env
  • Reduce CONCURRENT_REQUESTS in settings
  • Increase REQUEST_DELAY

3. Memory Errors

  • Reduce BATCH_SIZE
  • Lower GPU_MEMORY_LIMIT
  • Process smaller datasets

4. Import Errors

# Reinstall dependencies
pip install --upgrade -r requirements.txt

# For GPU issues
pip install --force-reinstall -r requirements-gpu.txt

Output Formats

All exports include:

  • JSON: Human-readable, preserves data types
  • CSV: Excel-compatible (UTF-8 with BOM)
  • Excel: Formatted with headers, auto-sized columns

File Naming Convention

[source]_[dataset]_[timestamp].[ext]
Example: franchise_tax_20251226_143052.json

Data Privacy & Security

  • API keys stored in .env (gitignored)
  • No data transmitted to third parties
  • All processing done locally
  • Logs exclude sensitive information

Performance Tips

  1. Use GPU acceleration for large datasets (10k+ records)
  2. Enable Socrata API token for faster downloads
  3. Use async processing in Comptroller scraper
  4. Batch process large taxpayer ID lists
  5. Clear GPU memory between large operations

Testing

Run the test suite:

# API endpoint tests
python scripts/api_tester.py

# Unit tests (if implemented)
pytest tests/

# Integration tests
pytest tests/test_integration.py

Logging

Logs are saved to logs/ directory:

  • texas_scraper_YYYY-MM-DD.log - All operations
  • errors_YYYY-MM-DD.log - Errors only
  • Automatic rotation at 100MB
  • Compressed archives retained

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

License

LICENSE

Support

For issues or questions:

  • Check troubleshooting section
  • Review logs in logs/ directory
  • Check API status pages
  • Create an issue on GitHub

Additional Resources


Screenshots & Demo

Click to view screenshots

Socrata Scraper Menu

╔══════════════════════════════════════════════════════════════╗
║           TEXAS DATA SCRAPER - SOCRATA MENU                  ║
╠══════════════════════════════════════════════════════════════╣
║  1. Download Franchise Tax (Full Dataset)                    ║
║  2. Download Sales Tax (Full Dataset)                        ║
║  3. Download Mixed Beverage Tax (Full Dataset)               ║
║  4. Search by Business Name                                  ║
║  5. Search by City                                           ║
║  ...                                                         ║
╚══════════════════════════════════════════════════════════════╝

Data Processing Pipeline

Socrata API → Raw Data → Comptroller Enrichment → Merge → Deduplicate → Export
     ↓            ↓              ↓                  ↓         ↓           ↓
  50k+ records  JSON/CSV     +FTAS data         Combined   Cleaned    JSON/CSV/Excel

Roadmap

See our project roadmap for upcoming features.

Phase 1.0: Core Data Pipeline ✅

  • Socrata Open Data Portal integration
  • Texas Comptroller API integration
  • GPU acceleration with CUDA
  • Multi-format export (JSON, CSV, Excel)
  • Data deduplication
  • Interactive CLI menus

Phase 1.1: Resilience & Reliability (v1.1.0) ✅

  • Progress persistence (resume interrupted downloads)
  • Export checksum verification (SHA-256)
  • Data validation and quality reports
  • GPU-accelerated merging and deduplication
  • Scraper wrappers with scrape_with_progress()

Phase 1.2: Smart Data Handling (v1.2.0) ✅

  • Smart field detection (case-insensitive ID matching)
  • Semantic field normalization (zipcodezip_code)
  • Global auto-deduplication (skips already-scraped records)
  • Append-to-existing export mode
  • Cross-dataset deduplication

Phase 1.3: Bulk Operations & Master Combine (v1.3.0) ✅

  • Process ALL Socrata files through Comptroller at once
  • Separate Comptroller files per dataset (source traceability)
  • Master Combine All (full pipeline automation)
  • 9 Manual Combine Options (granular control)
  • Smart format detection (JSON-only for bulk)

Phase 1.4: Outlet Enrichment & Resilience (v1.4.0) ✅

  • Outlet Data Enricher (extract outlets from duplicates)
  • Persistent disk caching (survives restarts)
  • Network retry with exponential backoff
  • Configurable Comptroller API settings
  • New exports/polished/ directory

Phase 2: Business Enrichment (v1.5.0) ✅

  • Google Places API integration
    • Business phone numbers
    • Business websites
    • Ratings and reviews
    • Operating hours
    • Business status
  • Clearbit API integration (Planned)
    • Company emails
    • Social media profiles
    • Company logo and branding
    • Industry classification

Phase 3: Advanced Features (Planned)

  • Web dashboard interface
  • Scheduled automatic scraping
  • Email notifications
  • Cloud deployment support
  • API rate limit analytics
  • Data visualization exports
  • Unified company profile generation

FAQ

Q: Do I need an NVIDIA GPU to use this tool?

No! GPU acceleration is optional. The toolkit automatically falls back to CPU processing if no GPU is detected. GPU acceleration is recommended for datasets with 10,000+ records.

Q: Are the API keys free?

Yes! Both the Socrata API token and Texas Comptroller API key are free. The Socrata token increases your rate limit from 1,000 to 50,000 requests per hour.

Q: What data can I access?

You can access public Texas government data including Franchise Tax Permit Holders, Sales Tax Permit Holders, Mixed Beverage Tax Permit Holders, and detailed taxpayer information through the Comptroller API.

Q: Is the data up to date?

The data is fetched directly from official Texas government APIs in real-time, ensuring you always get the most current publicly available information.

Q: Can I use this for commercial purposes?

Please review the LICENSE file and the terms of use for the Texas Open Data Portal and Comptroller API for commercial use guidelines.


Acknowledgments


Star History

Star History Chart

If you find this project useful, please consider giving it a ⭐!


Contact & Social Media

GitHub LinkedIn X Instagram Email Portfolio

Project Maintainer: Chanderbhanswami

📧 Email: chanderbhanswami@gmail.com

🐦 X: Chanderbhanswa7


Citation

If you use this tool in your research or project, please cite it as:

@software{texas_data_scraper,
  author = {Chanderbhan Swami},
  title = {Texas Government Data Scraper Toolkit},
  year = {2025},
  url = {https://github.com/chanderbhanswami/texas-data-scraper}
}

Made with ❤️ by Chanderbhan Swami for data enthusiasts and researchers

Happy Scraping, Y'all!

Made with Python Powered by CUDA

About

A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs including the Socrata Open Data Portal and Texas Comptroller API. Features GPU acceleration (CUDA/cuDNN), intelligent data merging, automated deduplication and precise business information through Google Places API.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published