Texas Government Data Scraper Toolkit

A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs & Google Places API

Features • Installation • Usage • Docs • Contributing • Support

A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs including the Socrata Open Data Portal and Texas Comptroller API. Features GPU acceleration (CUDA/cuDNN), intelligent data merging, automated deduplication and precise business information through Google Places API.

Features

Core Capabilities

Dual API Support: Socrata Open Data Portal + Texas Comptroller API
GPU Acceleration: CUDA/cuDNN optimized for NVIDIA RTX 3060
Interactive CLI: User-friendly menus for all operations
Multi-Format Export: JSON, CSV, and Excel with automatic formatting
Smart Data Merging: Intelligent field prioritization and conflict resolution
Advanced Deduplication: Multiple strategies with merge capabilities
Rate Limiting: Intelligent throttling with automatic token management
Comprehensive Logging: Detailed logs with rotation and compression
Progress Persistence: Resume interrupted downloads from checkpoints (v1.1.0)
Export Verification: SHA-256 checksums for data integrity (v1.1.0)
Smart Field Detection: Case-insensitive matching with 20+ field variations (v1.2.0)
Global Auto-Deduplication: Automatically skips already-scraped records (v1.2.0)
Append-to-Existing Exports: Single consolidated file per dataset (v1.2.0)
Process ALL Socrata Files: Bulk process all datasets through Comptroller (v1.3.0)
Separate Comptroller Files: Source-specific filenames per dataset (v1.3.0)
Master Combine All: Full pipeline merge of all Socrata + Comptroller data (v1.3.0)
9 Manual Combine Options: Granular control over file merging (v1.3.0)
Outlet Data Enricher: Extract outlet fields from duplicate Socrata records (v1.4.0)
Persistent Disk Caching: Comptroller cache survives restarts - truly resumable (v1.4.0)
Network Retry with Backoff: Automatic recovery from internet outages (v1.4.0)
Configurable Comptroller Settings: Fine-tune concurrent requests, chunk size, delays (v1.4.0)
Google Places API Integration: Get phone, website, ratings, hours from Google (v1.5.0)
Two-Step Places Workflow: Find Place IDs → Get Place Details (v1.5.0)
Final Data Combiner: Merge Google Places with polished taxpayer data (v1.5.0)
New Places API v1: Migrated to places.googleapis.com/v1 with header-based auth (v1.5.1)

Data Sources

Franchise Tax Permit Holders
Sales Tax Permit Holders
Mixed Beverage Tax Permit Holders
Tax Registration Data
Detailed Taxpayer Information
FTAS (Franchise Tax Account Status) Records

Requirements

System Requirements

Python 3.8+
NVIDIA GPU with CUDA support (optional, for GPU acceleration)
CUDA Toolkit 11.8+ or 12.x or 13.x
cuDNN 8.9.x or later
8GB+ RAM (16GB recommended)
10GB+ free disk space

API Keys (Free)

Socrata API Token (optional but recommended)
- Increases rate limit from 1,000 to 50,000 requests/hour
- Get at: https://data.texas.gov/profile/edit/developer_settings
Texas Comptroller API Key (required for full access)
- Get at: https://comptroller.texas.gov/transparency/open-data/

Installation

1. Clone the Repository

git clone <repository-url>
cd texas-data-scraper

2. Create Virtual Environment

python -m venv venv

# Windows
venv\Scripts\activate

# Linux/Mac
source venv/bin/activate

3. Install Dependencies

For CPU-Only Installation:

pip install -r requirements.txt

For GPU-Accelerated Installation:

# First, ensure CUDA Toolkit and cuDNN are installed
# Download from: https://developer.nvidia.com/cuda-downloads
# and: https://developer.nvidia.com/cudnn

# Then install GPU requirements
pip install -r requirements-gpu.txt

4. Configure Environment Variables

# Copy the example environment file
cp config/.env.example .env

# Edit .env and add your API keys
nano .env  # or use your preferred editor

Required Environment Variables:

# Socrata API Token (optional but recommended)
SOCRATA_APP_TOKEN=your_token_here

# Comptroller API Key (required)
COMPTROLLER_API_KEY=your_key_here

# GPU Settings (if using GPU)
USE_GPU=true
GPU_DEVICE_ID=0
GPU_MEMORY_LIMIT=10240

5. Verify Installation

# Test API endpoints
python scripts/api_tester.py

# Check GPU availability (if using GPU)
python -c "import cupy; print('GPU Available:', cupy.cuda.is_available())"

Usage Guide

1. Socrata Data Scraper

Download data from Texas Open Data Portal:

python scripts/socrata_scraper.py

Interactive Menu Options:

Download full datasets (Franchise Tax, Sales Tax, etc.)
Download with custom record limits
Search by business name, city, ZIP code, agent name, etc.
View dataset metadata
Export data in multiple formats

Example Workflow:

Select "1" for Franchise Tax (full dataset)
Wait for download to complete
Choose "Yes" to export data
Files saved to exports/socrata/ in JSON, CSV, and Excel formats

2. Comptroller Data Scraper

Fetch detailed taxpayer information:

python scripts/comptroller_scraper.py

Features:

Auto-detect Socrata export files
Batch process taxpayer IDs
Single taxpayer lookup (terminal-only display)
Async processing for faster results
Combined details + FTAS records

Example Workflow:

First, run Socrata scraper to get taxpayer IDs
Select "1" for auto-detect Socrata files
Choose the most recent export
Select async processing method
Wait for batch processing
Export enriched data

3. Data Combiner

Merge Socrata and Comptroller data intelligently:

python scripts/data_combiner.py

Features:

Smart field merging with priority (Comptroller > Socrata)
Automatic conflict resolution
Support for JSON, CSV, and Excel
Auto-detect latest exports

Example Workflow:

Select "4" for auto-detect and combine
Confirm the detected files
View combination statistics
Export combined data

4. Deduplicator

Remove duplicate records and polish data:

python scripts/deduplicator.py

Deduplication Strategies:

taxpayer_id: Remove duplicates by taxpayer ID (fastest)
exact: Remove exact duplicate records
fuzzy: Fuzzy matching on key fields

Advanced Options:

Deduplicate with merge (combines duplicate records)
Deduplicate by confidence (keeps most complete record)

Example Workflow:

Select "4" to deduplicate all combined exports
Review deduplication statistics
Files saved to exports/deduplicated/

5. Outlet Data Enricher (v1.4.0)

Enrich deduplicated data with outlet information from duplicate records:

python scripts/outlet_enricher.py

Features:

Extract outlet fields from duplicate Socrata records
Enrich deduplicated data with outlet info
GPU acceleration support
Handles multiple outlets per taxpayer

Outlet Fields Extracted:

outlet_number, outlet_name, outlet_address
outlet_city, outlet_state, outlet_zip_code
outlet_county_code, outlet_naics_code
outlet_permit_issue_date, outlet_first_sales_date

Example Workflow:

Select "1" for Auto-Enrich
Choose Socrata source file
Choose Deduplicated file
Files saved to exports/polished/

6. Google Places Scraper (v1.5.0)

Enrich data with Google Places business information:

python scripts/google_places_scraper.py

Two-Step Workflow:

Step 1: Find Place IDs - Search Google Places using business name + address
Step 2: Get Details - Fetch full business info using Place IDs

Fields Extracted:

formatted_phone_number, international_phone_number
website, url (Google Maps link)
rating, user_ratings_total
business_status, types (categories)
opening_hours, geometry (lat/lng)
reviews, photos

Example Workflow:

Select "1" for Auto-Find Place IDs (from polished data)
Export to exports/place_ids/
Continue to get details
Export to exports/places_details/
Use Data Combiner option 13 to merge with polished data
Final output in exports/final/

7. API Endpoint Tester

Test all API endpoints:

python scripts/api_tester.py

Tests:

Socrata API connection and token validation
Comptroller API connection and key validation
Dataset access and search functionality
Pagination and metadata retrieval
Error handling

Project Structure

texas-data-scraper/
│
├── .cache/                           # Cache directory
│   ├── progress/                     # Progress checkpoints for resume
│   ├── comptroller/                  # Comptroller API response cache (v1.4.0)
│   └── google_places/                # Google Places API cache (v1.5.0)
│       ├── place_ids/                # Cached place ID lookups
│       └── details/                  # Cached place details
│
├── config/
│   ├── __init__.py                   # Config package initialization
│   ├── settings.py                   # Configuration management
│   └── .env.example                  # Environment variables template
│
├── docs/
│   ├── ABSOLUTELY_FINAL_SUMMARY.md   # Final project summary
│   ├── DEPLOYMENT_GUIDE.md           # Deployment instructions
│   ├── FINAL_COMPLETE_CHECKLIST.md   # Complete feature checklist
│   ├── INSTALLATION_CHECKLIST.md     # Installation guide
│   └── QUICK_START.md                # Quick start guide
│
├── exports/                          # Output directory for exported data
│   ├── combined/                     # Combined data exports
│   ├── comptroller/                  # Comptroller data exports
│   ├── deduplicated/                 # Deduplicated data exports
│   ├── polished/                     # Outlet-enriched data exports (v1.4.0)
│   ├── place_ids/                    # Google Place IDs exports (v1.5.0)
│   ├── places_details/               # Google Places details exports (v1.5.0)
│   ├── final/                        # Final combined data exports (v1.5.0)
│   └── socrata/                      # Socrata data exports
│
├── logs/                             # Log files directory
│
├── scripts/
│   ├── api_tester.py                 # API endpoint testing
│   ├── batch_processor.py            # Batch processing CLI
│   ├── comptroller_scraper.py        # Main Comptroller scraper CLI
│   ├── data_combiner.py              # Data combination CLI
│   ├── deduplicator.py               # Deduplication CLI
│   ├── google_places_scraper.py      # Google Places API CLI (v1.5.0)
│   ├── outlet_enricher.py            # Outlet data enrichment CLI (v1.4.0)
│   └── socrata_scraper.py            # Main Socrata scraper CLI
│
├── src/
│   ├── __init__.py                   # Source package initialization
│   │
│   ├── api/
│   │   ├── __init__.py               # API package initialization
│   │   ├── comptroller_client.py     # Comptroller API client
│   │   ├── google_places_client.py   # Google Places API client (v1.5.0)
│   │   ├── rate_limiter.py           # Rate limiting logic
│   │   └── socrata_client.py         # Socrata API client
│   │
│   ├── exporters/
│   │   ├── __init__.py               # Exporters package initialization
│   │   └── file_exporter.py          # Export to JSON/CSV/Excel
│   │
│   ├── processors/
│   │   ├── __init__.py               # Processors package initialization
│   │   ├── data_combiner.py          # Combine Socrata + Comptroller data
│   │   ├── data_validator.py         # Data validation
│   │   ├── deduplicator.py           # Remove duplicates
│   │   └── outlet_enricher.py        # Outlet data enrichment (v1.4.0)
│   │
│   ├── scrapers/
│   │   ├── __init__.py               # Scrapers package initialization
│   │   ├── comptroller_scraper.py    # Comptroller data scraper
│   │   ├── google_places_scraper.py  # Google Places scraper (v1.5.0)
│   │   ├── gpu_accelerator.py        # GPU acceleration utilities
│   │   └── socrata_scraper.py        # Socrata data scraper
│   │
│   └── utils/
│       ├── __init__.py               # Utils package initialization
│       ├── checksum.py               # File checksum verification
│       ├── helpers.py                # Helper functions
│       ├── logger.py                 # Logging utilities
│       ├── menu.py                   # Interactive CLI menu
│       └── progress_manager.py       # Progress persistence for downloads
│
├── tests/
│   ├── __init__.py                   # Tests package initialization
│   ├── test_comptroller_api.py       # Comptroller API tests
│   ├── test_google_places_api.py     # Google Places API tests
│   ├── test_integration.py           # Integration tests
│   ├── test_processors.py            # Processor tests
│   ├── test_scrapers.py              # Scraper tests
│   └── test_socrata_api.py           # Socrata API tests
│
├── .env                              # Environment variables (gitignored)
├── .gitignore                        # Git ignore file
├── CHANGELOG.md                      # Project changelog
├── CONTRIBUTING.md                   # Contribution guidelines
├── LICENSE                           # Project license
├── Makefile                          # Make commands for automation
├── PROJECT_STRUCTURE.md              # This file - project structure docs
├── PROJECT_SUMMARY.md                # Detailed project summary
├── README.md                         # Main documentation
├── requirements.txt                  # Python dependencies
├── requirements-gpu.txt              # GPU-specific dependencies
├── run.py                            # Main entry point runner
├── setup.py                          # Package setup
└── setup_project.py                  # Project setup/initialization script

Complete Workflow Example

Full Pipeline: Socrata → Comptroller → Combine → Deduplicate

# Step 1: Download Socrata data
python scripts/socrata_scraper.py
# Select: 1 (Franchise Tax full dataset)
# Export: Yes

# Step 2: Enrich with Comptroller data
python scripts/comptroller_scraper.py
# Select: 1 (Auto-detect Socrata files)
# Choose: Latest export
# Method: 2 (Async)
# Export: Yes

# Step 3: Combine both datasets
python scripts/data_combiner.py
# Select: 4 (Auto-detect and combine)
# Export: Yes (all formats)

# Step 4: Remove duplicates
python scripts/deduplicator.py
# Select: 4 (Deduplicate all combined)
# Strategy: taxpayer_id

# Final data location: exports/deduplicated/

Configuration

Rate Limits

Socrata API:

Without token: 1,000 requests/hour
With token: 50,000 requests/hour

Comptroller API:

100 requests/minute (with API key)

GPU Settings

Optimize for your RTX 3060:

# .env file
USE_GPU=true
GPU_DEVICE_ID=0
GPU_MEMORY_LIMIT=10240  # 10GB for RTX 3060 (12GB total)

Batch Processing

BATCH_SIZE=100              # Records per batch
CONCURRENT_REQUESTS=5       # Simultaneous requests

Troubleshooting

Common Issues

1. GPU Not Detected

# Check CUDA installation
nvidia-smi

# Verify CuPy installation
python -c "import cupy; cupy.cuda.runtime.getDeviceProperties(0)"

# If issues persist, use CPU-only mode:
USE_GPU=false

2. Rate Limit Errors

Add Socrata API token to .env
Reduce CONCURRENT_REQUESTS in settings
Increase REQUEST_DELAY

3. Memory Errors

Reduce BATCH_SIZE
Lower GPU_MEMORY_LIMIT
Process smaller datasets

4. Import Errors

# Reinstall dependencies
pip install --upgrade -r requirements.txt

# For GPU issues
pip install --force-reinstall -r requirements-gpu.txt

Output Formats

All exports include:

JSON: Human-readable, preserves data types
CSV: Excel-compatible (UTF-8 with BOM)
Excel: Formatted with headers, auto-sized columns

File Naming Convention

[source]_[dataset]_[timestamp].[ext]
Example: franchise_tax_20251226_143052.json

Data Privacy & Security

API keys stored in .env (gitignored)
No data transmitted to third parties
All processing done locally
Logs exclude sensitive information

Performance Tips

Use GPU acceleration for large datasets (10k+ records)
Enable Socrata API token for faster downloads
Use async processing in Comptroller scraper
Batch process large taxpayer ID lists
Clear GPU memory between large operations

Testing

Run the test suite:

# API endpoint tests
python scripts/api_tester.py

# Unit tests (if implemented)
pytest tests/

# Integration tests
pytest tests/test_integration.py

Logging

Logs are saved to logs/ directory:

texas_scraper_YYYY-MM-DD.log - All operations
errors_YYYY-MM-DD.log - Errors only
Automatic rotation at 100MB
Compressed archives retained

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

LICENSE

Support

For issues or questions:

Check troubleshooting section
Review logs in logs/ directory
Check API status pages
Create an issue on GitHub

Additional Resources

Screenshots & Demo

Click to view screenshots

Socrata Scraper Menu

╔══════════════════════════════════════════════════════════════╗
║           TEXAS DATA SCRAPER - SOCRATA MENU                  ║
╠══════════════════════════════════════════════════════════════╣
║  1. Download Franchise Tax (Full Dataset)                    ║
║  2. Download Sales Tax (Full Dataset)                        ║
║  3. Download Mixed Beverage Tax (Full Dataset)               ║
║  4. Search by Business Name                                  ║
║  5. Search by City                                           ║
║  ...                                                         ║
╚══════════════════════════════════════════════════════════════╝

Data Processing Pipeline

Socrata API → Raw Data → Comptroller Enrichment → Merge → Deduplicate → Export
     ↓            ↓              ↓                  ↓         ↓           ↓
  50k+ records  JSON/CSV     +FTAS data         Combined   Cleaned    JSON/CSV/Excel

Roadmap

See our project roadmap for upcoming features.

Phase 1.0: Core Data Pipeline ✅

Socrata Open Data Portal integration
Texas Comptroller API integration
GPU acceleration with CUDA
Multi-format export (JSON, CSV, Excel)
Data deduplication
Interactive CLI menus

Phase 1.1: Resilience & Reliability (v1.1.0) ✅

Progress persistence (resume interrupted downloads)
Export checksum verification (SHA-256)
Data validation and quality reports
GPU-accelerated merging and deduplication
Scraper wrappers with scrape_with_progress()

Phase 1.2: Smart Data Handling (v1.2.0) ✅

Smart field detection (case-insensitive ID matching)
Semantic field normalization (zipcode → zip_code)
Global auto-deduplication (skips already-scraped records)
Append-to-existing export mode
Cross-dataset deduplication

Phase 1.3: Bulk Operations & Master Combine (v1.3.0) ✅

Process ALL Socrata files through Comptroller at once
Separate Comptroller files per dataset (source traceability)
Master Combine All (full pipeline automation)
9 Manual Combine Options (granular control)
Smart format detection (JSON-only for bulk)

Phase 1.4: Outlet Enrichment & Resilience (v1.4.0) ✅

Outlet Data Enricher (extract outlets from duplicates)
Persistent disk caching (survives restarts)
Network retry with exponential backoff
Configurable Comptroller API settings
New exports/polished/ directory

Phase 2: Business Enrichment (v1.5.0) ✅

Google Places API integration
- Business phone numbers
- Business websites
- Ratings and reviews
- Operating hours
- Business status
Clearbit API integration (Planned)
- Company emails
- Social media profiles
- Company logo and branding
- Industry classification

Phase 3: Advanced Features (Planned)

FAQ

Q: Do I need an NVIDIA GPU to use this tool?

No! GPU acceleration is optional. The toolkit automatically falls back to CPU processing if no GPU is detected. GPU acceleration is recommended for datasets with 10,000+ records.

Q: Are the API keys free?

Yes! Both the Socrata API token and Texas Comptroller API key are free. The Socrata token increases your rate limit from 1,000 to 50,000 requests per hour.

Q: What data can I access?

You can access public Texas government data including Franchise Tax Permit Holders, Sales Tax Permit Holders, Mixed Beverage Tax Permit Holders, and detailed taxpayer information through the Comptroller API.

Q: Is the data up to date?

The data is fetched directly from official Texas government APIs in real-time, ensuring you always get the most current publicly available information.

Q: Can I use this for commercial purposes?

Please review the LICENSE file and the terms of use for the Texas Open Data Portal and Comptroller API for commercial use guidelines.

Acknowledgments

Texas Open Data Portal - For providing open access to state data
Texas Comptroller of Public Accounts - For the comprehensive taxpayer API
Socrata - For their excellent Open Data API
NVIDIA - For CUDA toolkit and GPU acceleration support
All contributors who help improve this project

Star History

If you find this project useful, please consider giving it a ⭐!

Contact & Social Media

Project Maintainer: Chanderbhanswami

📧 Email: chanderbhanswami@gmail.com

🐦 X: Chanderbhanswa7

Citation

If you use this tool in your research or project, please cite it as:

@software{texas_data_scraper,
  author = {Chanderbhan Swami},
  title = {Texas Government Data Scraper Toolkit},
  year = {2025},
  url = {https://github.com/chanderbhanswami/texas-data-scraper}
}

Made with ❤️ by Chanderbhan Swami for data enthusiasts and researchers

Happy Scraping, Y'all!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
config		config
docs		docs
exports		exports
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py
setup_project.py		setup_project.py

License

chanderbhanswami/texas-data-scraper

Folders and files

Latest commit

History

Repository files navigation

Texas Government Data Scraper Toolkit

Features

Core Capabilities

Data Sources

Requirements

System Requirements

API Keys (Free)

Installation

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

For CPU-Only Installation:

For GPU-Accelerated Installation:

4. Configure Environment Variables

Required Environment Variables:

5. Verify Installation

Usage Guide

1. Socrata Data Scraper

2. Comptroller Data Scraper

3. Data Combiner

4. Deduplicator

5. Outlet Data Enricher (v1.4.0)

6. Google Places Scraper (v1.5.0)

7. API Endpoint Tester

Project Structure

Complete Workflow Example

Full Pipeline: Socrata → Comptroller → Combine → Deduplicate

Configuration

Rate Limits

GPU Settings

Batch Processing

Troubleshooting

Common Issues

Output Formats

File Naming Convention

Data Privacy & Security

Performance Tips

Testing

Logging

Contributing

License

Support

Additional Resources

Screenshots & Demo

Socrata Scraper Menu

Data Processing Pipeline

Roadmap

Phase 1.0: Core Data Pipeline ✅

Phase 1.1: Resilience & Reliability (v1.1.0) ✅

Phase 1.2: Smart Data Handling (v1.2.0) ✅

Phase 1.3: Bulk Operations & Master Combine (v1.3.0) ✅

Phase 1.4: Outlet Enrichment & Resilience (v1.4.0) ✅

Phase 2: Business Enrichment (v1.5.0) ✅

Phase 3: Advanced Features (Planned)

FAQ

Acknowledgments

Star History

Contact & Social Media

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages