Automating extract–transform–load (ETL) pipelines for scanned business documents typically demands costly, finetuned, layout-aware models. We present a cloud-native architecture that transforms heterogeneous documents into a unified, structured JSON schema—without any model fine-tuning. Our pipeline combines off-the-shelf OCR (Azure Document Intelligence) with a schema-constrained large language model (LLM), guided by type-checked Pydantic outputs and a one-pass swap heuristic for efficient few-shot prompting. Evaluated on the FUNSD (form) and CORD (receipt) corpora, the system achieves 0.60 and 0.83 fuzzy KV F1 scores, respectively, while processing each page in under eight seconds at under $0.004 on standard cloud quota. Scaling to a larger LLM boosts CORD accuracy to 0.89 F1 at under $0.02 per page. The entire pipeline—code, prompts, and metric scripts—is open-sourced, enabling lightweight, fully-deployable semantic ETL for small-to medium-scale workloads.
Structura/
├── docs/
│ ├── Few-Shot Optimization Pipeline.pdf
│ └── System Architecture Diagram.pdf
├── paper/
├── src/
│ ├── benchmark.py
│ ├── clients.py
│ ├── inference.py
│ ├── main.py
│ ├── metrics.py
│ ├── optimizer.py
│ ├── schemas.py
│ ├── system_prompt.py
│ ├── benchmarks/
│ ├── datasets/
│ │ ├── cord/
│ │ └── funsd/
│ └── prompts/
│ ├── cord/
│ └── funsd/
├── LICENSE
├── README.MD
└── requirements.txt
-
Python 3.10+ is recommended.
-
Create a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Configure environment variables. Copy the provided template and fill in values from your Azure resources:
cp .env.template .envEnvironment keys consumed by the code:
# Document Intelligence Endpoint
AZUREDOCINTEL_BASE_URI: Azure Document Intelligence endpoint
AZUREDOCINTEL_TOKEN: Azure Document Intelligence API key
# Azure OpenAI Endpoint
AZUREOPENAI_BASE_URI: Azure OpenAI endpoint
AZUREOPENAI_API_TOKEN: Azure OpenAI API key
AZUREOPENAI_API_VERSION: Azure OpenAI API version
AZUREOPENAI_MODEL_NAME: Deployed Azure OpenAI model nameThe library loads variables via dotenv at import time (see src/clients.py).
Sample datasets are included under src/datasets/:
cord/: CORD receipt corpus (images and JSON annotations)funsd/: FUNSD form corpus (images and JSON annotations)
Each dataset has images/ and annotations/ directories. Filenames (without extension) align between image and JSON.
The default entrypoint runs CORD with gpt-4o-mini, generates few-shot exemplars, evaluates, and iteratively improves the exemplar set with a one-pass swap heuristic.
python src/main.pyArtifacts are written to src/benchmarks/ as JSON plus failure reports for timeouts/errors.
Adjust high-level settings in src/main.py:
dataset_name: one ofcordorfunsdschema: a Pydantic schema fromsrc/schemas.py(e.g.,CORDSchema)model: your Azure OpenAI deployment name (e.g.,gpt-4o-mini)fewshot_count,fewshot_z_swap,max_test_size: exemplar and evaluation sizes
from src.inference import get_response
from src.schemas import CORDSchema
from src.system_prompt import get_system_prompt
system_prompt = get_system_prompt(train_set=["075", "153"], dataset="cord", overwrite=False, use_fewshot=True)
ocr_text, ocr_ms, llm_json, llm_ms = get_response(
system_prompt=system_prompt,
pydantic_schema=CORDSchema,
model_name="gpt-4o-mini",
file_path="src/datasets/cord/images/000.png",
temperature=0.3,
)
print(llm_json)src/benchmark.py computes metrics and aggregates results. Metrics include fuzzy and exact KV F1, canonical F1 (Hungarian alignment), value quality, and confusion statistics (see src/metrics.py).
Core execution path:
- OCR via Azure Document Intelligence
prebuilt-layoutwith KEY_VALUE_PAIRS (src/inference.get_docintel_result). - Prompt construction from dataset templates plus generated exemplars (
src/system_prompt.py). - LLM call through Azure OpenAI with
instructorfor schema-constrained Pydantic outputs (src/inference.get_instructor_response). - Structured JSON validation by Pydantic models in
src/schemas.py. - Parallelization with thread pools for OCR and LLM, bounded connection pools, and lightweight rate limiting for OCR posts (
src/inference.py,src/benchmark.py).
This repository implements a one-pass swap heuristic (see src/optimizer.py):
- Select an initial exemplar set and a disjoint test set (
src/system_prompt.get_random_train_set). - Evaluate training exemplars without few-shot examples to estimate individual utility.
- Generate a test-time system prompt with few-shot examples and evaluate on the test set.
- Swap out the best-performing training exemplars for the worst-performing test samples (z-swap).
- Iterate until the train/test sets stabilize or the iteration budget is reached.
The few-shot examples are materialized into src/prompts/<dataset>/fewshot_examples.txt and combined with src/prompts/<dataset>/prompt.txt.
- FUNSD (forms), fuzzy KV F1: 0.60
- CORD (receipts), fuzzy KV F1: 0.83
- Larger LLM on CORD: 0.89
- Throughput: <8 seconds per page on standard cloud quota
- Cost: <$0.004 per page with the small model; <$0.02 with a larger model
Empirical outputs for this repository appear under src/benchmarks/.
Key tunables (edit in code):
src/main.py: dataset/model selection, exemplar counts, temperaturesrc/benchmark.py: parallelism, timeouts/retries, output file namingsrc/metrics.py: thresholds for fuzzy matching and canonical alignment
@article{gupta2025llm,
author = {Gupta, Shreyan},
title = {An LLM-Based ETL Architecture for Semantic
Normalization of Unstructured Data},
url = {https://doi.org/10.5281/zenodo.16786494},
year = 2025
doi = {10.5281/zenodo.16786494},
version = {v1},
journal = {Preprint submitted to IEEE MIT Undergraduate
Research Technology Conference (URTC) 2025},
}
MIT License. See LICENSE.

