A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs & Google Places API
Features • Installation • Usage • Docs • Contributing • Support
A comprehensive, production-ready toolkit for scraping and processing data from Texas government APIs including the Socrata Open Data Portal and Texas Comptroller API. Features GPU acceleration (CUDA/cuDNN), intelligent data merging, automated deduplication and precise business information through Google Places API.
- Dual API Support: Socrata Open Data Portal + Texas Comptroller API
- GPU Acceleration: CUDA/cuDNN optimized for NVIDIA RTX 3060
- Interactive CLI: User-friendly menus for all operations
- Multi-Format Export: JSON, CSV, and Excel with automatic formatting
- Smart Data Merging: Intelligent field prioritization and conflict resolution
- Advanced Deduplication: Multiple strategies with merge capabilities
- Rate Limiting: Intelligent throttling with automatic token management
- Comprehensive Logging: Detailed logs with rotation and compression
- Progress Persistence: Resume interrupted downloads from checkpoints (v1.1.0)
- Export Verification: SHA-256 checksums for data integrity (v1.1.0)
- Smart Field Detection: Case-insensitive matching with 20+ field variations (v1.2.0)
- Global Auto-Deduplication: Automatically skips already-scraped records (v1.2.0)
- Append-to-Existing Exports: Single consolidated file per dataset (v1.2.0)
- Process ALL Socrata Files: Bulk process all datasets through Comptroller (v1.3.0)
- Separate Comptroller Files: Source-specific filenames per dataset (v1.3.0)
- Master Combine All: Full pipeline merge of all Socrata + Comptroller data (v1.3.0)
- 9 Manual Combine Options: Granular control over file merging (v1.3.0)
- Outlet Data Enricher: Extract outlet fields from duplicate Socrata records (v1.4.0)
- Persistent Disk Caching: Comptroller cache survives restarts - truly resumable (v1.4.0)
- Network Retry with Backoff: Automatic recovery from internet outages (v1.4.0)
- Configurable Comptroller Settings: Fine-tune concurrent requests, chunk size, delays (v1.4.0)
- Google Places API Integration: Get phone, website, ratings, hours from Google (v1.5.0)
- Two-Step Places Workflow: Find Place IDs → Get Place Details (v1.5.0)
- Final Data Combiner: Merge Google Places with polished taxpayer data (v1.5.0)
- New Places API v1: Migrated to
places.googleapis.com/v1with header-based auth (v1.5.1)
- Franchise Tax Permit Holders
- Sales Tax Permit Holders
- Mixed Beverage Tax Permit Holders
- Tax Registration Data
- Detailed Taxpayer Information
- FTAS (Franchise Tax Account Status) Records
- Python 3.8+
- NVIDIA GPU with CUDA support (optional, for GPU acceleration)
- CUDA Toolkit 11.8+ or 12.x or 13.x
- cuDNN 8.9.x or later
- 8GB+ RAM (16GB recommended)
- 10GB+ free disk space
-
Socrata API Token (optional but recommended)
- Increases rate limit from 1,000 to 50,000 requests/hour
- Get at: https://data.texas.gov/profile/edit/developer_settings
-
Texas Comptroller API Key (required for full access)
git clone <repository-url>
cd texas-data-scraperpython -m venv venv
# Windows
venv\Scripts\activate
# Linux/Mac
source venv/bin/activatepip install -r requirements.txt# First, ensure CUDA Toolkit and cuDNN are installed
# Download from: https://developer.nvidia.com/cuda-downloads
# and: https://developer.nvidia.com/cudnn
# Then install GPU requirements
pip install -r requirements-gpu.txt# Copy the example environment file
cp config/.env.example .env
# Edit .env and add your API keys
nano .env # or use your preferred editor# Socrata API Token (optional but recommended)
SOCRATA_APP_TOKEN=your_token_here
# Comptroller API Key (required)
COMPTROLLER_API_KEY=your_key_here
# GPU Settings (if using GPU)
USE_GPU=true
GPU_DEVICE_ID=0
GPU_MEMORY_LIMIT=10240# Test API endpoints
python scripts/api_tester.py
# Check GPU availability (if using GPU)
python -c "import cupy; print('GPU Available:', cupy.cuda.is_available())"Download data from Texas Open Data Portal:
python scripts/socrata_scraper.pyInteractive Menu Options:
- Download full datasets (Franchise Tax, Sales Tax, etc.)
- Download with custom record limits
- Search by business name, city, ZIP code, agent name, etc.
- View dataset metadata
- Export data in multiple formats
Example Workflow:
- Select "1" for Franchise Tax (full dataset)
- Wait for download to complete
- Choose "Yes" to export data
- Files saved to
exports/socrata/in JSON, CSV, and Excel formats
Fetch detailed taxpayer information:
python scripts/comptroller_scraper.pyFeatures:
- Auto-detect Socrata export files
- Batch process taxpayer IDs
- Single taxpayer lookup (terminal-only display)
- Async processing for faster results
- Combined details + FTAS records
Example Workflow:
- First, run Socrata scraper to get taxpayer IDs
- Select "1" for auto-detect Socrata files
- Choose the most recent export
- Select async processing method
- Wait for batch processing
- Export enriched data
Merge Socrata and Comptroller data intelligently:
python scripts/data_combiner.pyFeatures:
- Smart field merging with priority (Comptroller > Socrata)
- Automatic conflict resolution
- Support for JSON, CSV, and Excel
- Auto-detect latest exports
Example Workflow:
- Select "4" for auto-detect and combine
- Confirm the detected files
- View combination statistics
- Export combined data
Remove duplicate records and polish data:
python scripts/deduplicator.pyDeduplication Strategies:
- taxpayer_id: Remove duplicates by taxpayer ID (fastest)
- exact: Remove exact duplicate records
- fuzzy: Fuzzy matching on key fields
Advanced Options:
- Deduplicate with merge (combines duplicate records)
- Deduplicate by confidence (keeps most complete record)
Example Workflow:
- Select "4" to deduplicate all combined exports
- Review deduplication statistics
- Files saved to
exports/deduplicated/
Enrich deduplicated data with outlet information from duplicate records:
python scripts/outlet_enricher.pyFeatures:
- Extract outlet fields from duplicate Socrata records
- Enrich deduplicated data with outlet info
- GPU acceleration support
- Handles multiple outlets per taxpayer
Outlet Fields Extracted:
outlet_number,outlet_name,outlet_addressoutlet_city,outlet_state,outlet_zip_codeoutlet_county_code,outlet_naics_codeoutlet_permit_issue_date,outlet_first_sales_date
Example Workflow:
- Select "1" for Auto-Enrich
- Choose Socrata source file
- Choose Deduplicated file
- Files saved to
exports/polished/
Enrich data with Google Places business information:
python scripts/google_places_scraper.pyTwo-Step Workflow:
- Step 1: Find Place IDs - Search Google Places using business name + address
- Step 2: Get Details - Fetch full business info using Place IDs
Fields Extracted:
formatted_phone_number,international_phone_numberwebsite,url(Google Maps link)rating,user_ratings_totalbusiness_status,types(categories)opening_hours,geometry(lat/lng)reviews,photos
Example Workflow:
- Select "1" for Auto-Find Place IDs (from polished data)
- Export to
exports/place_ids/ - Continue to get details
- Export to
exports/places_details/ - Use Data Combiner option 13 to merge with polished data
- Final output in
exports/final/
Test all API endpoints:
python scripts/api_tester.pyTests:
- Socrata API connection and token validation
- Comptroller API connection and key validation
- Dataset access and search functionality
- Pagination and metadata retrieval
- Error handling
texas-data-scraper/
│
├── .cache/ # Cache directory
│ ├── progress/ # Progress checkpoints for resume
│ ├── comptroller/ # Comptroller API response cache (v1.4.0)
│ └── google_places/ # Google Places API cache (v1.5.0)
│ ├── place_ids/ # Cached place ID lookups
│ └── details/ # Cached place details
│
├── config/
│ ├── __init__.py # Config package initialization
│ ├── settings.py # Configuration management
│ └── .env.example # Environment variables template
│
├── docs/
│ ├── ABSOLUTELY_FINAL_SUMMARY.md # Final project summary
│ ├── DEPLOYMENT_GUIDE.md # Deployment instructions
│ ├── FINAL_COMPLETE_CHECKLIST.md # Complete feature checklist
│ ├── INSTALLATION_CHECKLIST.md # Installation guide
│ └── QUICK_START.md # Quick start guide
│
├── exports/ # Output directory for exported data
│ ├── combined/ # Combined data exports
│ ├── comptroller/ # Comptroller data exports
│ ├── deduplicated/ # Deduplicated data exports
│ ├── polished/ # Outlet-enriched data exports (v1.4.0)
│ ├── place_ids/ # Google Place IDs exports (v1.5.0)
│ ├── places_details/ # Google Places details exports (v1.5.0)
│ ├── final/ # Final combined data exports (v1.5.0)
│ └── socrata/ # Socrata data exports
│
├── logs/ # Log files directory
│
├── scripts/
│ ├── api_tester.py # API endpoint testing
│ ├── batch_processor.py # Batch processing CLI
│ ├── comptroller_scraper.py # Main Comptroller scraper CLI
│ ├── data_combiner.py # Data combination CLI
│ ├── deduplicator.py # Deduplication CLI
│ ├── google_places_scraper.py # Google Places API CLI (v1.5.0)
│ ├── outlet_enricher.py # Outlet data enrichment CLI (v1.4.0)
│ └── socrata_scraper.py # Main Socrata scraper CLI
│
├── src/
│ ├── __init__.py # Source package initialization
│ │
│ ├── api/
│ │ ├── __init__.py # API package initialization
│ │ ├── comptroller_client.py # Comptroller API client
│ │ ├── google_places_client.py # Google Places API client (v1.5.0)
│ │ ├── rate_limiter.py # Rate limiting logic
│ │ └── socrata_client.py # Socrata API client
│ │
│ ├── exporters/
│ │ ├── __init__.py # Exporters package initialization
│ │ └── file_exporter.py # Export to JSON/CSV/Excel
│ │
│ ├── processors/
│ │ ├── __init__.py # Processors package initialization
│ │ ├── data_combiner.py # Combine Socrata + Comptroller data
│ │ ├── data_validator.py # Data validation
│ │ ├── deduplicator.py # Remove duplicates
│ │ └── outlet_enricher.py # Outlet data enrichment (v1.4.0)
│ │
│ ├── scrapers/
│ │ ├── __init__.py # Scrapers package initialization
│ │ ├── comptroller_scraper.py # Comptroller data scraper
│ │ ├── google_places_scraper.py # Google Places scraper (v1.5.0)
│ │ ├── gpu_accelerator.py # GPU acceleration utilities
│ │ └── socrata_scraper.py # Socrata data scraper
│ │
│ └── utils/
│ ├── __init__.py # Utils package initialization
│ ├── checksum.py # File checksum verification
│ ├── helpers.py # Helper functions
│ ├── logger.py # Logging utilities
│ ├── menu.py # Interactive CLI menu
│ └── progress_manager.py # Progress persistence for downloads
│
├── tests/
│ ├── __init__.py # Tests package initialization
│ ├── test_comptroller_api.py # Comptroller API tests
│ ├── test_google_places_api.py # Google Places API tests
│ ├── test_integration.py # Integration tests
│ ├── test_processors.py # Processor tests
│ ├── test_scrapers.py # Scraper tests
│ └── test_socrata_api.py # Socrata API tests
│
├── .env # Environment variables (gitignored)
├── .gitignore # Git ignore file
├── CHANGELOG.md # Project changelog
├── CONTRIBUTING.md # Contribution guidelines
├── LICENSE # Project license
├── Makefile # Make commands for automation
├── PROJECT_STRUCTURE.md # This file - project structure docs
├── PROJECT_SUMMARY.md # Detailed project summary
├── README.md # Main documentation
├── requirements.txt # Python dependencies
├── requirements-gpu.txt # GPU-specific dependencies
├── run.py # Main entry point runner
├── setup.py # Package setup
└── setup_project.py # Project setup/initialization script
# Step 1: Download Socrata data
python scripts/socrata_scraper.py
# Select: 1 (Franchise Tax full dataset)
# Export: Yes
# Step 2: Enrich with Comptroller data
python scripts/comptroller_scraper.py
# Select: 1 (Auto-detect Socrata files)
# Choose: Latest export
# Method: 2 (Async)
# Export: Yes
# Step 3: Combine both datasets
python scripts/data_combiner.py
# Select: 4 (Auto-detect and combine)
# Export: Yes (all formats)
# Step 4: Remove duplicates
python scripts/deduplicator.py
# Select: 4 (Deduplicate all combined)
# Strategy: taxpayer_id
# Final data location: exports/deduplicated/Socrata API:
- Without token: 1,000 requests/hour
- With token: 50,000 requests/hour
Comptroller API:
- 100 requests/minute (with API key)
Optimize for your RTX 3060:
# .env file
USE_GPU=true
GPU_DEVICE_ID=0
GPU_MEMORY_LIMIT=10240 # 10GB for RTX 3060 (12GB total)BATCH_SIZE=100 # Records per batch
CONCURRENT_REQUESTS=5 # Simultaneous requests1. GPU Not Detected
# Check CUDA installation
nvidia-smi
# Verify CuPy installation
python -c "import cupy; cupy.cuda.runtime.getDeviceProperties(0)"
# If issues persist, use CPU-only mode:
USE_GPU=false2. Rate Limit Errors
- Add Socrata API token to
.env - Reduce
CONCURRENT_REQUESTSin settings - Increase
REQUEST_DELAY
3. Memory Errors
- Reduce
BATCH_SIZE - Lower
GPU_MEMORY_LIMIT - Process smaller datasets
4. Import Errors
# Reinstall dependencies
pip install --upgrade -r requirements.txt
# For GPU issues
pip install --force-reinstall -r requirements-gpu.txtAll exports include:
- JSON: Human-readable, preserves data types
- CSV: Excel-compatible (UTF-8 with BOM)
- Excel: Formatted with headers, auto-sized columns
[source]_[dataset]_[timestamp].[ext]
Example: franchise_tax_20251226_143052.json
- API keys stored in
.env(gitignored) - No data transmitted to third parties
- All processing done locally
- Logs exclude sensitive information
- Use GPU acceleration for large datasets (10k+ records)
- Enable Socrata API token for faster downloads
- Use async processing in Comptroller scraper
- Batch process large taxpayer ID lists
- Clear GPU memory between large operations
Run the test suite:
# API endpoint tests
python scripts/api_tester.py
# Unit tests (if implemented)
pytest tests/
# Integration tests
pytest tests/test_integration.pyLogs are saved to logs/ directory:
texas_scraper_YYYY-MM-DD.log- All operationserrors_YYYY-MM-DD.log- Errors only- Automatic rotation at 100MB
- Compressed archives retained
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For issues or questions:
- Check troubleshooting section
- Review logs in
logs/directory - Check API status pages
- Create an issue on GitHub
- Socrata API Documentation
- Texas Open Data Portal
- Texas Comptroller API
- CUDA Toolkit Documentation
- cuDNN Documentation
Click to view screenshots
╔══════════════════════════════════════════════════════════════╗
║ TEXAS DATA SCRAPER - SOCRATA MENU ║
╠══════════════════════════════════════════════════════════════╣
║ 1. Download Franchise Tax (Full Dataset) ║
║ 2. Download Sales Tax (Full Dataset) ║
║ 3. Download Mixed Beverage Tax (Full Dataset) ║
║ 4. Search by Business Name ║
║ 5. Search by City ║
║ ... ║
╚══════════════════════════════════════════════════════════════╝
Socrata API → Raw Data → Comptroller Enrichment → Merge → Deduplicate → Export
↓ ↓ ↓ ↓ ↓ ↓
50k+ records JSON/CSV +FTAS data Combined Cleaned JSON/CSV/Excel
See our project roadmap for upcoming features.
- Socrata Open Data Portal integration
- Texas Comptroller API integration
- GPU acceleration with CUDA
- Multi-format export (JSON, CSV, Excel)
- Data deduplication
- Interactive CLI menus
- Progress persistence (resume interrupted downloads)
- Export checksum verification (SHA-256)
- Data validation and quality reports
- GPU-accelerated merging and deduplication
- Scraper wrappers with
scrape_with_progress()
- Smart field detection (case-insensitive ID matching)
- Semantic field normalization (
zipcode→zip_code) - Global auto-deduplication (skips already-scraped records)
- Append-to-existing export mode
- Cross-dataset deduplication
- Process ALL Socrata files through Comptroller at once
- Separate Comptroller files per dataset (source traceability)
- Master Combine All (full pipeline automation)
- 9 Manual Combine Options (granular control)
- Smart format detection (JSON-only for bulk)
- Outlet Data Enricher (extract outlets from duplicates)
- Persistent disk caching (survives restarts)
- Network retry with exponential backoff
- Configurable Comptroller API settings
- New
exports/polished/directory
- Google Places API integration
- Business phone numbers
- Business websites
- Ratings and reviews
- Operating hours
- Business status
- Clearbit API integration (Planned)
- Company emails
- Social media profiles
- Company logo and branding
- Industry classification
- Web dashboard interface
- Scheduled automatic scraping
- Email notifications
- Cloud deployment support
- API rate limit analytics
- Data visualization exports
- Unified company profile generation
Q: Do I need an NVIDIA GPU to use this tool?
No! GPU acceleration is optional. The toolkit automatically falls back to CPU processing if no GPU is detected. GPU acceleration is recommended for datasets with 10,000+ records.
Q: Are the API keys free?
Yes! Both the Socrata API token and Texas Comptroller API key are free. The Socrata token increases your rate limit from 1,000 to 50,000 requests per hour.
Q: What data can I access?
You can access public Texas government data including Franchise Tax Permit Holders, Sales Tax Permit Holders, Mixed Beverage Tax Permit Holders, and detailed taxpayer information through the Comptroller API.
Q: Is the data up to date?
The data is fetched directly from official Texas government APIs in real-time, ensuring you always get the most current publicly available information.
Q: Can I use this for commercial purposes?
Please review the LICENSE file and the terms of use for the Texas Open Data Portal and Comptroller API for commercial use guidelines.
- Texas Open Data Portal - For providing open access to state data
- Texas Comptroller of Public Accounts - For the comprehensive taxpayer API
- Socrata - For their excellent Open Data API
- NVIDIA - For CUDA toolkit and GPU acceleration support
- All contributors who help improve this project
Project Maintainer: Chanderbhanswami
📧 Email: chanderbhanswami@gmail.com
🐦 X: Chanderbhanswa7
If you use this tool in your research or project, please cite it as:
@software{texas_data_scraper,
author = {Chanderbhan Swami},
title = {Texas Government Data Scraper Toolkit},
year = {2025},
url = {https://github.com/chanderbhanswami/texas-data-scraper}
}