A Python utility that converts scanned invoices (PDF, images, or text files) into structured CSV/JSON data using OCR technology.
Manual processing of scanned invoices is time-consuming and error-prone. This tool automates the extraction of key invoice information including vendor details, line items, and totals from PDF documents, images, and text files, outputting structured data for further processing.
The solution uses a robust OCR-based approach with Tesseract OCR:
- Document Processing: PDF files are converted to high-resolution images using PyMuPDF
- Image Enhancement: Images are preprocessed with denoising and adaptive thresholding for better OCR accuracy
- OCR Processing: Tesseract OCR extracts text with optimized configurations for invoice documents
- Data Extraction: Smart parsing logic uses regex patterns to extract structured invoice data
- Validation & Output: Results are validated and exported to CSV and JSON formats
- Multi-format Support: Processes PDF files, images (PNG, JPG, JPEG, TIFF, BMP), and text files
- Intelligent Extraction: Extracts vendor name, invoice number, date, currency, and total amounts
- Image Enhancement: Advanced preprocessing for improved OCR accuracy
- Robust Error Handling: Graceful handling of corrupted files, missing fields, and OCR failures
- Dual Output Formats: Generates both CSV files and JSON with extracted data
- Tesseract Integration: Uses industry-standard Tesseract OCR engine
- Smart Pattern Matching: Advanced regex patterns for accurate data extraction
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Input Files │───▶│ File Validation │───▶│ Load Image │
│ (PDF/Images) │ │ & Filtering │ │ Processing │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Data Validation│◀───│ Data Extraction │◀───│ Dolphin OCR │
│ & Cleaning │ │ & Parsing │ │ Processing │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ CSV Output │ │ JSON Output │
│ (Header/Lines) │ │ (Raw + Meta) │
└─────────────────┘ └──────────────────┘
-
Clone or download this repository
-
Install dependencies:
pip install -r requirements.txt
-
Install Tesseract OCR:
- Download from: https://github.com/UB-Mannheim/tesseract/wiki
- Install the Windows executable to default location
- Or install to custom location and update the path in the code
# Process all invoices in a directory
python scan2csv.py --in_dir "Sample Documents/Sample Documents" --out_csv results.csv --out_json results.json
# Process with custom Tesseract path
python scan2csv.py --in_dir invoices --out_csv results.csv --out_json results.json --tesseract_path "C:\Program Files\Tesseract-OCR\tesseract.exe"--in_dir: Required. Directory containing invoice files (PDF, PNG, JPG, etc.)--out_csv: Output CSV file path (creates_header.csvand_lines.csvvariants)--out_json: Output JSON file path for raw OCR results--tesseract_path: Optional. Path to Tesseract executable if not in default location
invoices_header.csv - One row per invoice:
invoice_id,file_name,vendor_name,invoice_number,invoice_date,currency,grand_total,line_items_count,source_file
1,invoice1.pdf,ACME Corp,INV-001,2024-01-15,USD,1250.00,3,/path/to/invoice1.pdfinvoices_lines.csv - One row per line item:
invoice_id,line_number,description,quantity,unit_price,amount
1,1,Product A,2,100.00,200.00
1,2,Product B,1,150.00,150.00Contains raw OCR results, extracted data, and validation metadata:
{
"metadata": {
"total_invoices": 5,
"processing_timestamp": "2024-01-15T10:30:00",
"tool_version": "1.0.0"
},
"invoices": [
{
"vendor_name": "ACME Corp",
"invoice_number": "INV-001",
"invoice_date": "2024-01-15",
"currency": "USD",
"grand_total": "1250.00",
"line_items": [...],
"validation": {
"completeness_score": 0.85,
"is_valid": true,
"missing_fields": []
},
"raw_ocr_text": "..."
}
]
}The repository includes three sample invoices and their processing results:
sample_invoices/invoice_001.txt: ACME Consulting LLC - Business consulting services ($6,510.00)sample_invoices/invoice_002.txt: Tech Supplies Inc - Computer equipment order ($3,147.05)sample_invoices/invoice_003.txt: Global Solutions Group - Professional services ($11,500.00)
sample_results_header.csv: Invoice header data (3 invoices processed)sample_results_lines.csv: Line item details (7 line items extracted)sample_results.json: Complete raw data with OCR text for audit
The sample demonstrates successful extraction of vendor names, invoice numbers, dates, currencies, line items, and totals from diverse invoice formats.
- Processing Speed: < 3 minutes for typical invoice batches on CPU
- GPU Acceleration: Significantly faster processing with CUDA-compatible GPUs
- Memory Usage: Optimized for batch processing with configurable memory management
The tool includes comprehensive error handling for:
- Corrupted or unreadable PDF files
- Invalid image formats
- Missing or incomplete invoice data
- OCR processing failures
- Network issues during model download
- Python >= 3.9
- Tesseract OCR executable
- See
requirements.txtfor complete dependency list
MIT License - See LICENSE file for details.
- Model Download Fails: Ensure internet connection and try manual download
- GPU Not Detected: Install CUDA-compatible PyTorch version
- PDF Processing Errors: Ensure PyMuPDF is properly installed
- Memory Issues: Reduce batch size or use CPU mode
For issues or questions:
- Check the log files for detailed error messages
- Ensure all dependencies are properly installed
- Verify input file formats are supported
- Try processing a single file first to isolate issues