Zero-Configuration Data Quality Framework Powered by Polars
Sniffs out bad data
Beta Release: Core features are stable, APIs may still change in minor versions.
Truthound is a data quality validation framework built on Polars, a Rust-based DataFrame library. The framework provides zero-configuration validation through automatic schema inference and supports a wide range of validation scenarios from basic schema checks to statistical drift detection.
Documentation: Document Site
Related Projects
| Project | Description | Status |
|---|---|---|
| truthound-orchestration | Workflow integration for Airflow, Dagster, Prefect, and dbt | α test |
| truthound-dashboard | Web-based data quality monitoring dashboard | α test |
| Metric | Value |
|---|---|
| Test Cases | 8,585+ |
| Validators | 264 |
| Validator Categories | 28 |
| VE Test Cases | 316 (Validation Engine Enhancement) |
pip install truthound
# With optional features
pip install truthound[all]import truthound as th
# Basic validation
report = th.check("data.csv")
# Schema-based validation
schema = th.learn("baseline.csv")
report = th.check("new_data.csv", schema=schema)
# Drift detection
drift = th.compare("train.csv", "production.csv")
# PII scanning and masking
pii_report = th.scan(df)
masked_df = th.mask(df, strategy="hash")
# Statistical profiling
profile = th.profile("data.csv")
# Validation Engine Enhancement features
report = th.check("data.csv",
result_format="complete", # 4-level detail control
catch_exceptions=True, # Exception isolation
max_retries=2, # Auto-retry for transient errors
parallel=True, # DAG-based parallel execution
)truthound check data.csv # Validate
truthound check data.csv --rf complete # With full result detail
truthound check data.csv --catch-exceptions --max-retries 2 # Resilient mode
truthound compare baseline.csv current.csv # Drift detection
truthound scan data.csv # PII scanning
truthound auto-profile data.csv # Profiling
truthound new validator my_validator # Code scaffolding| Command | Description | Key Options |
|---|---|---|
learn |
Learn schema from data | --output, --no-constraints |
check |
Validate data quality | --validators, --min-severity, --schema, --strict, --format, --rf, --catch-exceptions, --max-retries |
scan |
Scan for PII | --format, --output |
mask |
Mask sensitive data | --columns, --strategy (redact/hash/fake), --strict |
profile |
Generate data profile | --format, --output |
compare |
Detect data drift | --method (auto/ks/psi/chi2/js), --threshold, --strict |
| Command | Description | Key Options |
|---|---|---|
auto-profile |
Profile with auto-detection | --patterns, --correlations, --sample, --top-n |
generate-suite |
Generate validation rules from profile | --strictness, --preset, --code-style |
quick-suite |
Profile and generate rules in one step | --strictness, --sample-size |
list-formats |
List supported output formats | - |
list-presets |
List available presets | - |
list-categories |
List rule categories | - |
| Command | Description | Key Options |
|---|---|---|
checkpoint run |
Run validation pipeline | --config, --data, --strict, --slack, --webhook |
checkpoint list |
List available checkpoints | --config, --format |
checkpoint validate |
Validate configuration | --strict |
checkpoint init |
Initialize sample config | --output, --format |
| Command | Description | Key Options |
|---|---|---|
ml anomaly |
Detect anomalies | --method (zscore/iqr/mad/isolation_forest), --contamination |
ml drift |
Detect data drift | --method (distribution/feature/multivariate), --threshold |
ml learn-rules |
Learn validation rules | --strictness, --min-confidence, --max-rules |
| Command | Description | Key Options |
|---|---|---|
docs generate |
Generate HTML/PDF report | --theme, --format (html/pdf), --title |
docs themes |
List available themes | - |
| Command | Description | Key Options |
|---|---|---|
lineage show |
Display lineage information | --node, --direction (upstream/downstream/both) |
lineage impact |
Analyze change impact | --max-depth, --output |
lineage visualize |
Generate lineage visualization | --renderer (d3/cytoscape/graphviz/mermaid), --theme |
| Command | Description | Key Options |
|---|---|---|
realtime validate |
Validate streaming data | --batch-size, --max-batches |
realtime monitor |
Monitor validation metrics | --interval, --duration |
realtime checkpoint list |
List checkpoints | --dir |
realtime checkpoint show |
Show checkpoint details | --dir |
realtime checkpoint delete |
Delete checkpoint | --dir, --force |
| Command | Description | Key Options |
|---|---|---|
benchmark run |
Run performance benchmarks | --suite (quick/ci/full), --size, --iterations |
benchmark list |
List available benchmarks | --format |
benchmark compare |
Compare benchmark results | --threshold |
| Command | Description | Key Options |
|---|---|---|
new validator |
Create custom validator | --template (basic/column/pattern/range/comparison/composite/full) |
new reporter |
Create custom reporter | --template (basic/full), --extension |
new plugin |
Create plugin package | --type (validator/reporter/hook/datasource/action/full) |
new list |
List scaffold types | --verbose |
new templates |
List available templates | - |
| Command | Description | Key Options |
|---|---|---|
plugin list |
List discovered plugins | --type, --state, --verbose |
plugin info |
Show plugin details | --json |
plugin load |
Load a plugin | --activate/--no-activate |
plugin unload |
Unload a plugin | - |
plugin enable |
Enable a plugin | - |
plugin disable |
Disable a plugin | - |
plugin create |
Create plugin template | --type, --author |
| Command | Description | Key Options |
|---|---|---|
dashboard |
Launch interactive dashboard | --profile, --port, --host, --debug |
| Guide | Description |
|---|---|
| Categories | 28 validator categories overview |
| Built-in | 264 built-in validators reference |
| Custom Validators | @custom_validator decorator, ValidatorBuilder fluent API |
| Enterprise SDK | Sandbox, signing, licensing, fuzzing |
| Security | ReDoS protection, SQL injection prevention |
| i18n | 7-language error messages |
| Optimization | Expression batch execution, DAG parallel |
| Guide | Description |
|---|---|
| Files | CSV, JSON, Parquet, NDJSON, JSONL |
| Databases | PostgreSQL, MySQL, SQLite, Oracle, SQL Server |
| Cloud Warehouses | BigQuery, Snowflake, Redshift, Databricks |
| Streaming | Kafka, Kinesis, Pub/Sub adapters |
| Custom Sources | IDataSource protocol implementation |
| Guide | Description |
|---|---|
| Basics | Column statistics, distribution analysis |
| Patterns | Email, phone, credit card detection |
| Rule Generation | Auto-generate validation rules |
| Drift Detection | KS, PSI, Chi-square, Wasserstein |
| Quality Scoring | Data quality metrics |
| Sampling | Block, multi-stage, progressive sampling |
| Caching | xxhash fingerprint-based caching |
| ML Inference | ML-based rule generation |
| Threshold Tuning | Automatic threshold optimization |
| Visualization | Profile visualization |
| i18n | Localized profiling output |
| Schema Evolution | Change detection, compatibility analysis |
| Distributed | Spark, Dask, Ray backends |
| Enterprise Sampling | 100M+ row sampling strategies |
| Guide | Description |
|---|---|
| HTML Reports | Report generation pipeline |
| Charts | ApexCharts, Chart.js, Plotly.js, SVG |
| Sections | Report sections configuration |
| Themes | 6 built-in themes, white-labeling |
| Versioning | 4 versioning strategies |
| PDF Export | Chunked rendering, parallel processing |
| Custom Renderers | Jinja2, String, File, Callable templates |
| Dashboard | Interactive dashboard integration |
| Guide | Description |
|---|---|
| Console | Rich terminal output |
| JSON/YAML | Structured output formats |
| HTML/Markdown | Document formats |
| CI Reporters | JUnit, GitHub Actions, GitLab CI |
| Custom SDK | IReporter protocol implementation |
| Guide | Description |
|---|---|
| Filesystem | Local file storage |
| Cloud Storage | S3, GCS, Azure Blob |
| Versioning | Incremental, Semantic, Timestamp, GitLike |
| Retention | Time, Count, Size, Status, Tag policies |
| Tiering | Hot/Warm/Cold/Archive migration |
| Caching | LRU, LFU, TTL backends |
| Replication | Sync/Async/Semi-Sync cross-region |
| Observability | Audit, Metrics, Tracing |
| Guide | Description |
|---|---|
| Basics | Checkpoint configuration |
| Triggers | Event-based triggering |
| Actions | Notifications, Webhook, Storage, Incident |
| Routing | Python + Jinja2 rule engine, 11 built-in rules |
| Deduplication | InMemory/Redis, 4 window strategies |
| Throttling | Token Bucket, Fixed/Sliding Window |
| Escalation | Multi-level policies, state machine |
| Async | Celery, Ray, Kubernetes backends |
| CI Platforms | GitHub Actions, GitLab CI, Jenkins, etc. |
| Guide | Description |
|---|---|
| Environment Variables | Environment-based configuration |
| Sources | Data source configuration |
| Datasource Config | Connection settings |
| Store Config | Storage backend settings |
| Profiler Config | Profiler settings |
| Checkpoint Config | CI/CD pipeline settings |
| Logging | JSON format, ELK/Loki integration |
| Metrics | Prometheus counters, gauges, histograms |
| Audit | Operation trail, compliance reporting |
| Encryption | AES-256-GCM, Cloud KMS integration |
| Resilience | Circuit breaker, retry, bulkhead |
| Guide | Description |
|---|---|
| ML Anomaly | Isolation Forest, LOF, One-Class SVM |
| Lineage | DAG tracking, OpenLineage integration |
| Plugins | Security sandbox, signing, hot reload |
| Performance | Optimization strategies |
Truthound v1.3.0 introduces a GX-inspired Validation Engine Enhancement comprising five phases that strengthen the validation pipeline's expressiveness, performance, and fault tolerance.
| Phase | Feature | Description |
|---|---|---|
| VE-1 | Result Format System | 4-level detail control (BOOLEAN_ONLY < BASIC < SUMMARY < COMPLETE) with progressive enrichment |
| VE-2 | Structured Results | ValidationDetail dataclass mirroring GX ExpectationValidationResult.result |
| VE-3 | Metric Deduplication | SharedMetricStore with MetricKey-based caching, CommonMetrics (11 standard metrics), deduplication across validators |
| VE-4 | Dependency DAG | SkipCondition for conditional execution, should_skip() based on prior results, priority-based level grouping |
| VE-5 | Exception Isolation | ExceptionInfo with 4-category classification, 3-tier fallback (batch → per-validator → per-expression), exponential backoff retry |
import truthound as th
# Result format control (VE-1)
report = th.check("data.csv", result_format="complete")
# Exception isolation with retry (VE-5)
report = th.check("data.csv", catch_exceptions=True, max_retries=3)
# Access structured results (VE-2)
for issue in report.issues:
if issue.result:
print(f"Unexpected: {issue.result.unexpected_count} ({issue.result.unexpected_percent:.1%})")| Category | Description |
|---|---|
| schema | Column structure, types, relationships |
| completeness | Null detection, required fields |
| uniqueness | Duplicates, primary keys, composite keys |
| distribution | Range, outliers, statistical tests |
| string | Regex, email, URL, JSON validation |
| datetime | Format, range, sequence validation |
| aggregate | Mean, median, sum constraints |
| cross_table | Multi-table relationships |
| multi_column | Column comparisons, conditional logic |
| query | SQL/Polars expression validation |
| table | Row count, freshness, metadata |
| geospatial | Coordinates, bounding boxes |
| drift | KS, PSI, Chi-square, Wasserstein |
| anomaly | IQR, Z-score, Isolation Forest, LOF |
| business_rule | Luhn, IBAN, VAT, ISBN validation |
| localization | Korean, Japanese, Chinese identifiers |
| ml_feature | Leakage detection, correlation |
| profiling | Cardinality, entropy, frequency |
| referential | Foreign keys, orphan records |
| timeseries | Gaps, seasonality, trend detection |
| privacy | PII detection and compliance rules |
| security | SQL injection, ReDoS protection |
| sdk | Custom validator development |
| timeout | Distributed timeout management |
| i18n | Internationalized error messages |
| streaming | Streaming data validation |
| memory | Memory-aware processing |
| optimization | Execution optimization |
| Category | Sources |
|---|---|
| DataFrame | Polars, Pandas, PySpark |
| Core SQL | PostgreSQL, MySQL, SQLite |
| Cloud DW | BigQuery, Snowflake, Redshift, Databricks |
| Enterprise | Oracle, SQL Server |
| File | CSV, Parquet, JSON, NDJSON |
| Streaming | Kafka, Kinesis, Pub/Sub |
# Core installation
pip install truthound
# Feature-specific extras
pip install truthound[drift] # Drift detection (scipy)
pip install truthound[anomaly] # Anomaly detection (scikit-learn)
pip install truthound[pdf] # PDF export (weasyprint)
# Data source extras
pip install truthound[bigquery] # Google BigQuery
pip install truthound[snowflake] # Snowflake
pip install truthound[redshift] # Amazon Redshift
pip install truthound[databricks] # Databricks
pip install truthound[oracle] # Oracle Database
pip install truthound[sqlserver] # SQL Server
# Security extras
pip install truthound[encryption] # Encryption (cryptography)
# Full installation
pip install truthound[all]- Python 3.11+
- Polars 1.x
- PyYAML
- Rich
- Typer
git clone https://github.com/seadonggyun4/Truthound.git
cd Truthound
pip install hatch
hatch env create
hatch run test- Polars Documentation. https://pola.rs/
- Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione"
- Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"
- Breunig, M. M., et al. (2000). "LOF: Identifying Density-Based Local Outliers"
Apache License 2.0
Built with Polars, Rich, Typer, scikit-learn, and SciPy.
