22 <img src =" docs/assets/duckguard-logo.svg " alt =" DuckGuard " width =" 420 " >
33
44 <h3 >Data Quality That Just Works</h3 >
5- <p ><strong >3 lines of code</strong > &bull ; <strong >10x faster </strong > &bull ; <strong >20x less memory </strong ></p >
5+ <p ><strong >3 lines of code</strong > &bull ; <strong >Any data source </strong > &bull ; <strong >10x faster </strong ></p >
66
7- <p ><em >Stop wrestling with 50+ lines of boilerplate. Start validating data in seconds .</em ></p >
7+ <p ><em >One API for CSV, Parquet, Snowflake, Databricks, BigQuery, and 15+ sources. No boilerplate .</em ></p >
88
99 [ ![ PyPI version] ( https://img.shields.io/pypi/v/duckguard.svg )] ( https://pypi.org/project/duckguard/ )
1010 [ ![ Downloads] ( https://static.pepy.tech/badge/duckguard )] ( https://pepy.tech/project/duckguard )
1515 [ ![ Docs] ( https://img.shields.io/badge/docs-GitHub%20Pages-blue )] ( https://xdatahubai.github.io/duckguard/ )
1616
1717 [ ![ Open In Colab] ( https://colab.research.google.com/assets/colab-badge.svg )] ( https://colab.research.google.com/github/XDataHubAI/duckguard/blob/main/examples/getting_started.ipynb )
18- [ ![ Kaggle] ( https://kaggle.com/static/images/open-in-kaggle.svg )] ( https://kaggle.com/kernels/welcome?src=https://github.com/XDataHubAI/duckguard/blob/main/examples/getting_started .ipynb )
18+ [ ![ Kaggle] ( https://kaggle.com/static/images/open-in-kaggle.svg )] ( https://kaggle.com/kernels/welcome?src=https://github.com/XDataHubAI/duckguard/blob/main/examples/kaggle_data_quality .ipynb )
1919</div >
2020
2121---
@@ -29,16 +29,47 @@ pip install duckguard
2929``` python
3030from duckguard import connect
3131
32- orders = connect(" orders.csv " ) # CSV, Parquet, JSON, S3, databases...
32+ orders = connect(" s3://warehouse/ orders.parquet " ) # Cloud, local, or warehouse
3333assert orders.customer_id.is_not_null() # Just like pytest!
34- assert orders.total_amount.between(0 , 10000 ) # Readable validations
34+ assert orders.total_amount.between(0 , 10000 ) # Readable validations
3535assert orders.status.isin([" pending" , " shipped" , " delivered" ])
3636
3737quality = orders.score()
3838print (f " Grade: { quality.grade} " ) # A, B, C, D, or F
3939```
4040
41- ** That's it.** No context. No datasource. No validator. No expectation suite. Just data quality.
41+ ** That's it.** Same 3 lines whether your data lives in S3, Snowflake, Databricks, or a local CSV. No context. No datasource. No validator. No expectation suite. Just data quality.
42+
43+ ### Works with Your Data Stack
44+
45+ ``` python
46+ from duckguard import connect
47+
48+ # Data Lakes
49+ orders = connect(" s3://bucket/orders.parquet" ) # AWS S3
50+ orders = connect(" gs://bucket/orders.parquet" ) # Google Cloud
51+ orders = connect(" az://container/orders.parquet" ) # Azure Blob
52+
53+ # Data Warehouses
54+ orders = connect(" snowflake://account/db" , table = " orders" ) # Snowflake
55+ orders = connect(" databricks://host/catalog" , table = " orders" ) # Databricks
56+ orders = connect(" bigquery://project" , table = " orders" ) # BigQuery
57+ orders = connect(" redshift://cluster/db" , table = " orders" ) # Redshift
58+
59+ # Modern Table Formats
60+ orders = connect(" delta://path/to/delta_table" ) # Delta Lake
61+ orders = connect(" iceberg://path/to/iceberg_table" ) # Apache Iceberg
62+
63+ # Databases
64+ orders = connect(" postgres://localhost/db" , table = " orders" ) # PostgreSQL
65+ orders = connect(" mysql://localhost/db" , table = " orders" ) # MySQL
66+
67+ # Files & DataFrames
68+ orders = connect(" orders.parquet" ) # Parquet, CSV, JSON, Excel
69+ orders = connect(pandas_dataframe) # pandas DataFrame
70+ ```
71+
72+ > ** 15+ connectors.** Install what you need: ` pip install duckguard[snowflake] ` , ` duckguard[databricks] ` , or ` duckguard[all] `
4273
4374---
4475
@@ -93,7 +124,10 @@ validator.expect_column_values_to_be_between(
93124``` python
94125from duckguard import connect
95126
96- orders = connect(" orders.csv" )
127+ orders = connect(
128+ " snowflake://account/db" ,
129+ table = " orders"
130+ )
97131
98132assert orders.customer_id.is_not_null()
99133assert orders.total_amount.between(0 , 10000 )
@@ -247,40 +281,6 @@ pip install duckguard[all] # Everything
247281
248282---
249283
250- ## Connect to Anything
251-
252- ``` python
253- from duckguard import connect
254-
255- # Files
256- orders = connect(" orders.csv" )
257- orders = connect(" orders.parquet" )
258- orders = connect(" orders.json" )
259-
260- # Cloud Storage
261- orders = connect(" s3://bucket/orders.parquet" )
262- orders = connect(" gs://bucket/orders.parquet" )
263- orders = connect(" az://container/orders.parquet" )
264-
265- # Databases
266- orders = connect(" postgres://localhost/db" , table = " orders" )
267- orders = connect(" mysql://localhost/db" , table = " orders" )
268- orders = connect(" snowflake://account/db" , table = " orders" )
269- orders = connect(" bigquery://project/dataset" , table = " orders" )
270- orders = connect(" databricks://workspace/catalog/schema" , table = " orders" )
271- orders = connect(" redshift://cluster/db" , table = " orders" )
272-
273- # Modern Formats
274- orders = connect(" delta://path/to/delta_table" )
275- orders = connect(" iceberg://path/to/iceberg_table" )
276-
277- # pandas DataFrame
278- import pandas as pd
279- orders = connect(pd.read_csv(" orders.csv" ))
280- ```
281-
282- ** Supported:** CSV, Parquet, JSON, Excel | S3, GCS, Azure Blob | PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, Redshift, Databricks, SQL Server, Oracle, MongoDB | Delta Lake, Apache Iceberg | pandas DataFrames
283-
284284---
285285
286286## Cookbook
0 commit comments