Skip to content

seadonggyun4/truthound

Repository files navigation

Truthound Banner

Truthound

Zero-Configuration Data Quality Framework Powered by Polars

Sniffs out bad data

Beta Release: Core features are stable, APIs may still change in minor versions.


Abstract

Truthound_icon

Truthound is a data quality validation framework built on Polars, a Rust-based DataFrame library. The framework provides zero-configuration validation through automatic schema inference and supports a wide range of validation scenarios from basic schema checks to statistical drift detection.

PyPI version Python 3.11+ License: Apache 2.0 Powered by Polars Awesome Downloads

Documentation: Document Site

Related Projects

Project Description Status
truthound-orchestration Workflow integration for Airflow, Dagster, Prefect, and dbt α test
truthound-dashboard Web-based data quality monitoring dashboard α test

Metrics

Metric Value
Test Cases 8,585+
Validators 264
Validator Categories 28
VE Test Cases 316 (Validation Engine Enhancement)

Quick Start

Installation

pip install truthound

# With optional features
pip install truthound[all]

Python API

import truthound as th

# Basic validation
report = th.check("data.csv")

# Schema-based validation
schema = th.learn("baseline.csv")
report = th.check("new_data.csv", schema=schema)

# Drift detection
drift = th.compare("train.csv", "production.csv")

# PII scanning and masking
pii_report = th.scan(df)
masked_df = th.mask(df, strategy="hash")

# Statistical profiling
profile = th.profile("data.csv")

# Validation Engine Enhancement features
report = th.check("data.csv",
    result_format="complete",       # 4-level detail control
    catch_exceptions=True,          # Exception isolation
    max_retries=2,                  # Auto-retry for transient errors
    parallel=True,                  # DAG-based parallel execution
)

CLI

truthound check data.csv                    # Validate
truthound check data.csv --rf complete      # With full result detail
truthound check data.csv --catch-exceptions --max-retries 2  # Resilient mode
truthound compare baseline.csv current.csv  # Drift detection
truthound scan data.csv                     # PII scanning
truthound auto-profile data.csv             # Profiling
truthound new validator my_validator        # Code scaffolding

CLI Reference

Core Commands

Command Description Key Options
learn Learn schema from data --output, --no-constraints
check Validate data quality --validators, --min-severity, --schema, --strict, --format, --rf, --catch-exceptions, --max-retries
scan Scan for PII --format, --output
mask Mask sensitive data --columns, --strategy (redact/hash/fake), --strict
profile Generate data profile --format, --output
compare Detect data drift --method (auto/ks/psi/chi2/js), --threshold, --strict

Profiler Commands

Command Description Key Options
auto-profile Profile with auto-detection --patterns, --correlations, --sample, --top-n
generate-suite Generate validation rules from profile --strictness, --preset, --code-style
quick-suite Profile and generate rules in one step --strictness, --sample-size
list-formats List supported output formats -
list-presets List available presets -
list-categories List rule categories -

Checkpoint Commands (CI/CD)

Command Description Key Options
checkpoint run Run validation pipeline --config, --data, --strict, --slack, --webhook
checkpoint list List available checkpoints --config, --format
checkpoint validate Validate configuration --strict
checkpoint init Initialize sample config --output, --format

ML Commands

Command Description Key Options
ml anomaly Detect anomalies --method (zscore/iqr/mad/isolation_forest), --contamination
ml drift Detect data drift --method (distribution/feature/multivariate), --threshold
ml learn-rules Learn validation rules --strictness, --min-confidence, --max-rules

Docs Commands

Command Description Key Options
docs generate Generate HTML/PDF report --theme, --format (html/pdf), --title
docs themes List available themes -

Lineage Commands

Command Description Key Options
lineage show Display lineage information --node, --direction (upstream/downstream/both)
lineage impact Analyze change impact --max-depth, --output
lineage visualize Generate lineage visualization --renderer (d3/cytoscape/graphviz/mermaid), --theme

Realtime Commands (Streaming)

Command Description Key Options
realtime validate Validate streaming data --batch-size, --max-batches
realtime monitor Monitor validation metrics --interval, --duration
realtime checkpoint list List checkpoints --dir
realtime checkpoint show Show checkpoint details --dir
realtime checkpoint delete Delete checkpoint --dir, --force

Benchmark Commands

Command Description Key Options
benchmark run Run performance benchmarks --suite (quick/ci/full), --size, --iterations
benchmark list List available benchmarks --format
benchmark compare Compare benchmark results --threshold

Scaffolding Commands

Command Description Key Options
new validator Create custom validator --template (basic/column/pattern/range/comparison/composite/full)
new reporter Create custom reporter --template (basic/full), --extension
new plugin Create plugin package --type (validator/reporter/hook/datasource/action/full)
new list List scaffold types --verbose
new templates List available templates -

Plugin Commands

Command Description Key Options
plugin list List discovered plugins --type, --state, --verbose
plugin info Show plugin details --json
plugin load Load a plugin --activate/--no-activate
plugin unload Unload a plugin -
plugin enable Enable a plugin -
plugin disable Disable a plugin -
plugin create Create plugin template --type, --author

Dashboard Command

Command Description Key Options
dashboard Launch interactive dashboard --profile, --port, --host, --debug

Python API Guides

Validators

Guide Description
Categories 28 validator categories overview
Built-in 264 built-in validators reference
Custom Validators @custom_validator decorator, ValidatorBuilder fluent API
Enterprise SDK Sandbox, signing, licensing, fuzzing
Security ReDoS protection, SQL injection prevention
i18n 7-language error messages
Optimization Expression batch execution, DAG parallel

Data Sources

Guide Description
Files CSV, JSON, Parquet, NDJSON, JSONL
Databases PostgreSQL, MySQL, SQLite, Oracle, SQL Server
Cloud Warehouses BigQuery, Snowflake, Redshift, Databricks
Streaming Kafka, Kinesis, Pub/Sub adapters
Custom Sources IDataSource protocol implementation

Profiler

Guide Description
Basics Column statistics, distribution analysis
Patterns Email, phone, credit card detection
Rule Generation Auto-generate validation rules
Drift Detection KS, PSI, Chi-square, Wasserstein
Quality Scoring Data quality metrics
Sampling Block, multi-stage, progressive sampling
Caching xxhash fingerprint-based caching
ML Inference ML-based rule generation
Threshold Tuning Automatic threshold optimization
Visualization Profile visualization
i18n Localized profiling output
Schema Evolution Change detection, compatibility analysis
Distributed Spark, Dask, Ray backends
Enterprise Sampling 100M+ row sampling strategies

Data Docs

Guide Description
HTML Reports Report generation pipeline
Charts ApexCharts, Chart.js, Plotly.js, SVG
Sections Report sections configuration
Themes 6 built-in themes, white-labeling
Versioning 4 versioning strategies
PDF Export Chunked rendering, parallel processing
Custom Renderers Jinja2, String, File, Callable templates
Dashboard Interactive dashboard integration

Reporters

Guide Description
Console Rich terminal output
JSON/YAML Structured output formats
HTML/Markdown Document formats
CI Reporters JUnit, GitHub Actions, GitLab CI
Custom SDK IReporter protocol implementation

Storage

Guide Description
Filesystem Local file storage
Cloud Storage S3, GCS, Azure Blob
Versioning Incremental, Semantic, Timestamp, GitLike
Retention Time, Count, Size, Status, Tag policies
Tiering Hot/Warm/Cold/Archive migration
Caching LRU, LFU, TTL backends
Replication Sync/Async/Semi-Sync cross-region
Observability Audit, Metrics, Tracing

Checkpoint & CI/CD

Guide Description
Basics Checkpoint configuration
Triggers Event-based triggering
Actions Notifications, Webhook, Storage, Incident
Routing Python + Jinja2 rule engine, 11 built-in rules
Deduplication InMemory/Redis, 4 window strategies
Throttling Token Bucket, Fixed/Sliding Window
Escalation Multi-level policies, state machine
Async Celery, Ray, Kubernetes backends
CI Platforms GitHub Actions, GitLab CI, Jenkins, etc.

Configuration

Guide Description
Environment Variables Environment-based configuration
Sources Data source configuration
Datasource Config Connection settings
Store Config Storage backend settings
Profiler Config Profiler settings
Checkpoint Config CI/CD pipeline settings
Logging JSON format, ELK/Loki integration
Metrics Prometheus counters, gauges, histograms
Audit Operation trail, compliance reporting
Encryption AES-256-GCM, Cloud KMS integration
Resilience Circuit breaker, retry, bulkhead

Advanced

Guide Description
ML Anomaly Isolation Forest, LOF, One-Class SVM
Lineage DAG tracking, OpenLineage integration
Plugins Security sandbox, signing, hot reload
Performance Optimization strategies

Validation Engine Enhancement (VE)

Truthound v1.3.0 introduces a GX-inspired Validation Engine Enhancement comprising five phases that strengthen the validation pipeline's expressiveness, performance, and fault tolerance.

Phase Feature Description
VE-1 Result Format System 4-level detail control (BOOLEAN_ONLY < BASIC < SUMMARY < COMPLETE) with progressive enrichment
VE-2 Structured Results ValidationDetail dataclass mirroring GX ExpectationValidationResult.result
VE-3 Metric Deduplication SharedMetricStore with MetricKey-based caching, CommonMetrics (11 standard metrics), deduplication across validators
VE-4 Dependency DAG SkipCondition for conditional execution, should_skip() based on prior results, priority-based level grouping
VE-5 Exception Isolation ExceptionInfo with 4-category classification, 3-tier fallback (batch → per-validator → per-expression), exponential backoff retry

Key API Additions

import truthound as th

# Result format control (VE-1)
report = th.check("data.csv", result_format="complete")

# Exception isolation with retry (VE-5)
report = th.check("data.csv", catch_exceptions=True, max_retries=3)

# Access structured results (VE-2)
for issue in report.issues:
    if issue.result:
        print(f"Unexpected: {issue.result.unexpected_count} ({issue.result.unexpected_percent:.1%})")

Validator Categories

Category Description
schema Column structure, types, relationships
completeness Null detection, required fields
uniqueness Duplicates, primary keys, composite keys
distribution Range, outliers, statistical tests
string Regex, email, URL, JSON validation
datetime Format, range, sequence validation
aggregate Mean, median, sum constraints
cross_table Multi-table relationships
multi_column Column comparisons, conditional logic
query SQL/Polars expression validation
table Row count, freshness, metadata
geospatial Coordinates, bounding boxes
drift KS, PSI, Chi-square, Wasserstein
anomaly IQR, Z-score, Isolation Forest, LOF
business_rule Luhn, IBAN, VAT, ISBN validation
localization Korean, Japanese, Chinese identifiers
ml_feature Leakage detection, correlation
profiling Cardinality, entropy, frequency
referential Foreign keys, orphan records
timeseries Gaps, seasonality, trend detection
privacy PII detection and compliance rules
security SQL injection, ReDoS protection
sdk Custom validator development
timeout Distributed timeout management
i18n Internationalized error messages
streaming Streaming data validation
memory Memory-aware processing
optimization Execution optimization

Data Sources

Category Sources
DataFrame Polars, Pandas, PySpark
Core SQL PostgreSQL, MySQL, SQLite
Cloud DW BigQuery, Snowflake, Redshift, Databricks
Enterprise Oracle, SQL Server
File CSV, Parquet, JSON, NDJSON
Streaming Kafka, Kinesis, Pub/Sub

Installation Options

# Core installation
pip install truthound

# Feature-specific extras
pip install truthound[drift]      # Drift detection (scipy)
pip install truthound[anomaly]    # Anomaly detection (scikit-learn)
pip install truthound[pdf]        # PDF export (weasyprint)

# Data source extras
pip install truthound[bigquery]   # Google BigQuery
pip install truthound[snowflake]  # Snowflake
pip install truthound[redshift]   # Amazon Redshift
pip install truthound[databricks] # Databricks
pip install truthound[oracle]     # Oracle Database
pip install truthound[sqlserver]  # SQL Server

# Security extras
pip install truthound[encryption] # Encryption (cryptography)

# Full installation
pip install truthound[all]

Requirements

  • Python 3.11+
  • Polars 1.x
  • PyYAML
  • Rich
  • Typer

Development

git clone https://github.com/seadonggyun4/Truthound.git
cd Truthound
pip install hatch
hatch env create
hatch run test

References

  1. Polars Documentation. https://pola.rs/
  2. Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione"
  3. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"
  4. Breunig, M. M., et al. (2000). "LOF: Identifying Density-Based Local Outliers"

License

Apache License 2.0


Acknowledgments

Built with Polars, Rich, Typer, scikit-learn, and SciPy.

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages