US Address Cleaner helps you normalize messy US addresses and reliably extract city, state, and ZIP code from free-form text. It’s built for teams that need clean, consistent address fields for shipping, CRM hygiene, lead enrichment, and analytics. Use it to turn noisy inputs into structured outputs with practical NLP-based parsing.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for us-address-cleaner you've just found your team — Let’s Chat. 👆👆
This project normalizes US addresses and extracts structured fields (city, state, zipcode) from non-standard address strings. It solves the common problem of inconsistent address formatting that breaks downstream systems like shipping validation, deduplication, and reporting. It’s designed for developers, data teams, and automation pipelines that need dependable US address normalization at scale.
- Parses free-form address text (commas, extra tokens, inconsistent spacing, mixed abbreviations)
- Extracts city, state (including multi-token state/region strings), and ZIP/ZIP+4 when present
- Applies normalization rules (whitespace cleanup, punctuation handling, casing standardization)
- Produces predictable, table-friendly output for databases, spreadsheets, and ETL jobs
- Supports batch processing to clean large lists of physical addresses efficiently
| Feature | Description |
|---|---|
| Free-form input parsing | Accepts messy address strings and extracts key fields without strict formatting requirements. |
| City/state/ZIP extraction | Identifies city, state, and ZIP/ZIP+4 with robust heuristics and NLP cues. |
| Address normalization | Cleans spacing, punctuation, and common abbreviations for consistent downstream use. |
| Batch processing | Processes lists of addresses in one run for ETL pipelines and data cleaning workflows. |
| Quality signals | Returns optional confidence and warnings to help detect ambiguous or incomplete inputs. |
| Developer-friendly output | Emits JSON-ready structures that are easy to store, query, and export. |
| Field Name | Field Description |
|---|---|
| input | The original raw address string provided by the user. |
| city | Extracted city name, normalized for consistent casing and spacing. |
| state | Extracted state/region token(s), preserving relevant qualifiers if present. |
| zipcode | Extracted ZIP code (5-digit) or ZIP+4 when available. |
| normalized | A cleaned version of the input (trimmed, standardized separators/spacing). |
| confidence | A 0–1 score indicating extraction certainty based on pattern matches and NLP signals. |
| warnings | Array of notes for ambiguous inputs (missing ZIP, multiple candidate cities, etc.). |
[
{
"input": "Elgin, IL, US, 60120",
"city": "Elgin",
"state": "IL, US",
"zipcode": "60120",
"normalized": "Elgin, IL, US, 60120",
"confidence": 0.93,
"warnings": []
}
]
US address cleaner/
├── src/
│ ├── __init__.py
│ ├── cli.py
│ ├── runner.py
│ ├── parsers/
│ │ ├── __init__.py
│ │ ├── address_parser.py
│ │ ├── zipcode.py
│ │ └── state_city_rules.py
│ ├── nlp/
│ │ ├── __init__.py
│ │ ├── tokenizer.py
│ │ └── features.py
│ ├── normalization/
│ │ ├── __init__.py
│ │ ├── cleaners.py
│ │ └── abbreviations.py
│ ├── schemas/
│ │ ├── __init__.py
│ │ └── output_schema.py
│ └── utils/
│ ├── __init__.py
│ └── logging.py
├── tests/
│ ├── test_address_parser.py
│ ├── test_zipcode.py
│ └── test_normalization.py
├── data/
│ ├── inputs.sample.txt
│ └── sample.output.json
├── scripts/
│ ├── run_batch.sh
│ └── benchmark.py
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── LICENSE
└── README.md
- E-commerce ops teams use it to clean checkout/shipping addresses, so they can reduce failed deliveries and support tickets.
- CRM/data teams use it to standardize contact addresses, so they can deduplicate records and improve segmentation accuracy.
- Lead generation agencies use it to extract city/state/ZIP from scraped leads, so they can enrich lists and target campaigns better.
- Analytics engineers use it to normalize address fields in pipelines, so they can build reliable geo-based dashboards and reports.
- Logistics platforms use it to pre-structure addresses before validation, so they can speed up downstream address verification.
1) What types of inputs does it handle best? It’s optimized for US-style address snippets where city/state/ZIP appear in the string, even if the string includes extra tokens (country hints, stray commas, inconsistent spacing). It performs best when at least two of the three target fields (city/state/ZIP) are present.
2) Does it validate that a ZIP code matches the city/state? By default, it focuses on extraction and normalization rather than authoritative validation. If you need ZIP-to-city/state validation, you can add a post-step using a ZIP reference dataset or an external validation service.
3) How does it behave when the input is ambiguous or incomplete?
When multiple candidates are detected (or a key field is missing), the output includes a lower confidence score and a warnings list describing what was uncertain (e.g., “missing_zipcode”, “multiple_city_candidates”).
4) Can I run it on a large list of addresses? Yes—batch processing is supported via the runner/CLI flow. For large datasets, run in chunks and store outputs incrementally to keep memory usage stable.
Primary Metric: ~6,000–10,000 addresses/minute on a modern 8-core CPU for typical “city, ST ZIP” style inputs.
Reliability Metric: ~98–99% successful structured output generation (always returns a JSON object; ambiguous cases are flagged via warnings).
Efficiency Metric: ~60–140 MB RAM for steady-state batch runs (10k–50k inputs) depending on tokenizer/NLP feature configuration.
Quality Metric: ~92–96% precision on city/state/ZIP extraction for semi-structured real-world lists; precision increases when ZIP is present and decreases on city-only inputs with heavy noise.
