Truthound

Zero-Configuration Data Quality Framework Powered by Polars

Sniffs out bad data

Beta Release: Core features are stable, APIs may still change in minor versions.

Abstract

Truthound is a data quality validation framework built on Polars, a Rust-based DataFrame library. The framework provides zero-configuration validation through automatic schema inference and supports a wide range of validation scenarios from basic schema checks to statistical drift detection.

Documentation: Document Site

Related Projects

Project	Description	Status
truthound-orchestration	Workflow integration for Airflow, Dagster, Prefect, and dbt	α test
truthound-dashboard	Web-based data quality monitoring dashboard	α test

Metrics

Metric	Value
Test Cases	8,585+
Validators	264
Validator Categories	28
VE Test Cases	316 (Validation Engine Enhancement)

Quick Start

Installation

pip install truthound

# With optional features
pip install truthound[all]

Python API

import truthound as th

# Basic validation
report = th.check("data.csv")

# Schema-based validation
schema = th.learn("baseline.csv")
report = th.check("new_data.csv", schema=schema)

# Drift detection
drift = th.compare("train.csv", "production.csv")

# PII scanning and masking
pii_report = th.scan(df)
masked_df = th.mask(df, strategy="hash")

# Statistical profiling
profile = th.profile("data.csv")

# Validation Engine Enhancement features
report = th.check("data.csv",
    result_format="complete",       # 4-level detail control
    catch_exceptions=True,          # Exception isolation
    max_retries=2,                  # Auto-retry for transient errors
    parallel=True,                  # DAG-based parallel execution
)

CLI

truthound check data.csv                    # Validate
truthound check data.csv --rf complete      # With full result detail
truthound check data.csv --catch-exceptions --max-retries 2  # Resilient mode
truthound compare baseline.csv current.csv  # Drift detection
truthound scan data.csv                     # PII scanning
truthound auto-profile data.csv             # Profiling
truthound new validator my_validator        # Code scaffolding

CLI Reference

Core Commands

Command	Description	Key Options
`learn`	Learn schema from data	`--output`, `--no-constraints`
`check`	Validate data quality	`--validators`, `--min-severity`, `--schema`, `--strict`, `--format`, `--rf`, `--catch-exceptions`, `--max-retries`
`scan`	Scan for PII	`--format`, `--output`
`mask`	Mask sensitive data	`--columns`, `--strategy` (redact/hash/fake), `--strict`
`profile`	Generate data profile	`--format`, `--output`
`compare`	Detect data drift	`--method` (auto/ks/psi/chi2/js), `--threshold`, `--strict`

Profiler Commands

Command	Description	Key Options
`auto-profile`	Profile with auto-detection	`--patterns`, `--correlations`, `--sample`, `--top-n`
`generate-suite`	Generate validation rules from profile	`--strictness`, `--preset`, `--code-style`
`quick-suite`	Profile and generate rules in one step	`--strictness`, `--sample-size`
`list-formats`	List supported output formats	-
`list-presets`	List available presets	-
`list-categories`	List rule categories	-

Checkpoint Commands (CI/CD)

Command	Description	Key Options
`checkpoint run`	Run validation pipeline	`--config`, `--data`, `--strict`, `--slack`, `--webhook`
`checkpoint list`	List available checkpoints	`--config`, `--format`
`checkpoint validate`	Validate configuration	`--strict`
`checkpoint init`	Initialize sample config	`--output`, `--format`

ML Commands

Command	Description	Key Options
`ml anomaly`	Detect anomalies	`--method` (zscore/iqr/mad/isolation_forest), `--contamination`
`ml drift`	Detect data drift	`--method` (distribution/feature/multivariate), `--threshold`
`ml learn-rules`	Learn validation rules	`--strictness`, `--min-confidence`, `--max-rules`

Docs Commands

Command	Description	Key Options
`docs generate`	Generate HTML/PDF report	`--theme`, `--format` (html/pdf), `--title`
`docs themes`	List available themes	-

Lineage Commands

Command	Description	Key Options
`lineage show`	Display lineage information	`--node`, `--direction` (upstream/downstream/both)
`lineage impact`	Analyze change impact	`--max-depth`, `--output`
`lineage visualize`	Generate lineage visualization	`--renderer` (d3/cytoscape/graphviz/mermaid), `--theme`

Realtime Commands (Streaming)

Command	Description	Key Options
`realtime validate`	Validate streaming data	`--batch-size`, `--max-batches`
`realtime monitor`	Monitor validation metrics	`--interval`, `--duration`
`realtime checkpoint list`	List checkpoints	`--dir`
`realtime checkpoint show`	Show checkpoint details	`--dir`
`realtime checkpoint delete`	Delete checkpoint	`--dir`, `--force`

Benchmark Commands

Command	Description	Key Options
`benchmark run`	Run performance benchmarks	`--suite` (quick/ci/full), `--size`, `--iterations`
`benchmark list`	List available benchmarks	`--format`
`benchmark compare`	Compare benchmark results	`--threshold`

Scaffolding Commands

Command	Description	Key Options
`new validator`	Create custom validator	`--template` (basic/column/pattern/range/comparison/composite/full)
`new reporter`	Create custom reporter	`--template` (basic/full), `--extension`
`new plugin`	Create plugin package	`--type` (validator/reporter/hook/datasource/action/full)
`new list`	List scaffold types	`--verbose`
`new templates`	List available templates	-

Plugin Commands

Command	Description	Key Options
`plugin list`	List discovered plugins	`--type`, `--state`, `--verbose`
`plugin info`	Show plugin details	`--json`
`plugin load`	Load a plugin	`--activate/--no-activate`
`plugin unload`	Unload a plugin	-
`plugin enable`	Enable a plugin	-
`plugin disable`	Disable a plugin	-
`plugin create`	Create plugin template	`--type`, `--author`

Dashboard Command

Command	Description	Key Options
`dashboard`	Launch interactive dashboard	`--profile`, `--port`, `--host`, `--debug`

Python API Guides

Validators

Guide	Description
Categories	28 validator categories overview
Built-in	264 built-in validators reference
Custom Validators	`@custom_validator` decorator, `ValidatorBuilder` fluent API
Enterprise SDK	Sandbox, signing, licensing, fuzzing
Security	ReDoS protection, SQL injection prevention
i18n	7-language error messages
Optimization	Expression batch execution, DAG parallel

Data Sources

Guide	Description
Files	CSV, JSON, Parquet, NDJSON, JSONL
Databases	PostgreSQL, MySQL, SQLite, Oracle, SQL Server
Cloud Warehouses	BigQuery, Snowflake, Redshift, Databricks
Streaming	Kafka, Kinesis, Pub/Sub adapters
Custom Sources	IDataSource protocol implementation

Profiler

Guide	Description
Basics	Column statistics, distribution analysis
Patterns	Email, phone, credit card detection
Rule Generation	Auto-generate validation rules
Drift Detection	KS, PSI, Chi-square, Wasserstein
Quality Scoring	Data quality metrics
Sampling	Block, multi-stage, progressive sampling
Caching	xxhash fingerprint-based caching
ML Inference	ML-based rule generation
Threshold Tuning	Automatic threshold optimization
Visualization	Profile visualization
i18n	Localized profiling output
Schema Evolution	Change detection, compatibility analysis
Distributed	Spark, Dask, Ray backends
Enterprise Sampling	100M+ row sampling strategies

Data Docs

Guide	Description
HTML Reports	Report generation pipeline
Charts	ApexCharts, Chart.js, Plotly.js, SVG
Sections	Report sections configuration
Themes	6 built-in themes, white-labeling
Versioning	4 versioning strategies
PDF Export	Chunked rendering, parallel processing
Custom Renderers	Jinja2, String, File, Callable templates
Dashboard	Interactive dashboard integration

Reporters

Guide	Description
Console	Rich terminal output
JSON/YAML	Structured output formats
HTML/Markdown	Document formats
CI Reporters	JUnit, GitHub Actions, GitLab CI
Custom SDK	IReporter protocol implementation

Storage

Guide	Description
Filesystem	Local file storage
Cloud Storage	S3, GCS, Azure Blob
Versioning	Incremental, Semantic, Timestamp, GitLike
Retention	Time, Count, Size, Status, Tag policies
Tiering	Hot/Warm/Cold/Archive migration
Caching	LRU, LFU, TTL backends
Replication	Sync/Async/Semi-Sync cross-region
Observability	Audit, Metrics, Tracing

Checkpoint & CI/CD

Guide	Description
Basics	Checkpoint configuration
Triggers	Event-based triggering
Actions	Notifications, Webhook, Storage, Incident
Routing	Python + Jinja2 rule engine, 11 built-in rules
Deduplication	InMemory/Redis, 4 window strategies
Throttling	Token Bucket, Fixed/Sliding Window
Escalation	Multi-level policies, state machine
Async	Celery, Ray, Kubernetes backends
CI Platforms	GitHub Actions, GitLab CI, Jenkins, etc.

Configuration

Guide	Description
Environment Variables	Environment-based configuration
Sources	Data source configuration
Datasource Config	Connection settings
Store Config	Storage backend settings
Profiler Config	Profiler settings
Checkpoint Config	CI/CD pipeline settings
Logging	JSON format, ELK/Loki integration
Metrics	Prometheus counters, gauges, histograms
Audit	Operation trail, compliance reporting
Encryption	AES-256-GCM, Cloud KMS integration
Resilience	Circuit breaker, retry, bulkhead

Advanced

Guide	Description
ML Anomaly	Isolation Forest, LOF, One-Class SVM
Lineage	DAG tracking, OpenLineage integration
Plugins	Security sandbox, signing, hot reload
Performance	Optimization strategies

Validation Engine Enhancement (VE)

Truthound v1.3.0 introduces a GX-inspired Validation Engine Enhancement comprising five phases that strengthen the validation pipeline's expressiveness, performance, and fault tolerance.

Phase	Feature	Description
VE-1	Result Format System	4-level detail control (`BOOLEAN_ONLY` < `BASIC` < `SUMMARY` < `COMPLETE`) with progressive enrichment
VE-2	Structured Results	`ValidationDetail` dataclass mirroring GX `ExpectationValidationResult.result`
VE-3	Metric Deduplication	`SharedMetricStore` with `MetricKey`-based caching, `CommonMetrics` (11 standard metrics), deduplication across validators
VE-4	Dependency DAG	`SkipCondition` for conditional execution, `should_skip()` based on prior results, priority-based level grouping
VE-5	Exception Isolation	`ExceptionInfo` with 4-category classification, 3-tier fallback (batch → per-validator → per-expression), exponential backoff retry

Key API Additions

import truthound as th

# Result format control (VE-1)
report = th.check("data.csv", result_format="complete")

# Exception isolation with retry (VE-5)
report = th.check("data.csv", catch_exceptions=True, max_retries=3)

# Access structured results (VE-2)
for issue in report.issues:
    if issue.result:
        print(f"Unexpected: {issue.result.unexpected_count} ({issue.result.unexpected_percent:.1%})")

Validator Categories

Category	Description
schema	Column structure, types, relationships
completeness	Null detection, required fields
uniqueness	Duplicates, primary keys, composite keys
distribution	Range, outliers, statistical tests
string	Regex, email, URL, JSON validation
datetime	Format, range, sequence validation
aggregate	Mean, median, sum constraints
cross_table	Multi-table relationships
multi_column	Column comparisons, conditional logic
query	SQL/Polars expression validation
table	Row count, freshness, metadata
geospatial	Coordinates, bounding boxes
drift	KS, PSI, Chi-square, Wasserstein
anomaly	IQR, Z-score, Isolation Forest, LOF
business_rule	Luhn, IBAN, VAT, ISBN validation
localization	Korean, Japanese, Chinese identifiers
ml_feature	Leakage detection, correlation
profiling	Cardinality, entropy, frequency
referential	Foreign keys, orphan records
timeseries	Gaps, seasonality, trend detection
privacy	PII detection and compliance rules
security	SQL injection, ReDoS protection
sdk	Custom validator development
timeout	Distributed timeout management
i18n	Internationalized error messages
streaming	Streaming data validation
memory	Memory-aware processing
optimization	Execution optimization

Data Sources

Category	Sources
DataFrame	Polars, Pandas, PySpark
Core SQL	PostgreSQL, MySQL, SQLite
Cloud DW	BigQuery, Snowflake, Redshift, Databricks
Enterprise	Oracle, SQL Server
File	CSV, Parquet, JSON, NDJSON
Streaming	Kafka, Kinesis, Pub/Sub

Installation Options

# Core installation
pip install truthound

# Feature-specific extras
pip install truthound[drift]      # Drift detection (scipy)
pip install truthound[anomaly]    # Anomaly detection (scikit-learn)
pip install truthound[pdf]        # PDF export (weasyprint)

# Data source extras
pip install truthound[bigquery]   # Google BigQuery
pip install truthound[snowflake]  # Snowflake
pip install truthound[redshift]   # Amazon Redshift
pip install truthound[databricks] # Databricks
pip install truthound[oracle]     # Oracle Database
pip install truthound[sqlserver]  # SQL Server

# Security extras
pip install truthound[encryption] # Encryption (cryptography)

# Full installation
pip install truthound[all]

Requirements

Python 3.11+
Polars 1.x
PyYAML
Rich
Typer

Development

git clone https://github.com/seadonggyun4/Truthound.git
cd Truthound
pip install hatch
hatch env create
hatch run test

References

Polars Documentation. https://pola.rs/
Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione"
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"
Breunig, M. M., et al. (2000). "LOF: Identifying Density-Based Local Outliers"

License

Apache License 2.0

Acknowledgments

Built with Polars, Rich, Typer, scikit-learn, and SciPy.

Name		Name	Last commit message	Last commit date
Latest commit History 393 Commits
docs		docs
scripts		scripts
src/truthound		src/truthound
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
netlify.toml		netlify.toml
pyproject.toml		pyproject.toml
schema.yaml		schema.yaml
truthound.yaml		truthound.yaml
uv.lock		uv.lock

License

seadonggyun4/truthound

Folders and files

Latest commit

History

Repository files navigation

Truthound

Abstract

Metrics

Quick Start

Installation

Python API

CLI

CLI Reference

Core Commands

Profiler Commands

Checkpoint Commands (CI/CD)

ML Commands

Docs Commands

Lineage Commands

Realtime Commands (Streaming)

Benchmark Commands

Scaffolding Commands

Plugin Commands

Dashboard Command

Python API Guides

Validators

Data Sources

Profiler

Data Docs

Reporters

Storage

Checkpoint & CI/CD

Configuration

Advanced

Validation Engine Enhancement (VE)

Key API Additions

Validator Categories

Data Sources

Installation Options

Requirements

Development

References

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages