Skip to content

Commit 8eb2111

Browse files
author
if
committed
feat: add expression rules, registry, and pre-commit hooks
1 parent f4aa51a commit 8eb2111

File tree

11 files changed

+373
-68
lines changed

11 files changed

+373
-68
lines changed

.pre-commit-config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
repos:
2+
- repo: local
3+
hooks:
4+
- id: ruff
5+
name: ruff
6+
entry: ruff check src tests
7+
language: system
8+
pass_filenames: false
9+
- id: black
10+
name: black
11+
entry: black --check src tests
12+
language: system
13+
pass_filenames: false
14+
- id: pytest
15+
name: pytest
16+
entry: pytest -vv
17+
language: system
18+
pass_filenames: false

README.md

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,32 +3,68 @@
33
Polars-first data-quality toolkit delivering deterministic validation, structured logging, and a composable rule registry.
44

55
## Why Aqualisys?
6-
- **Declarative rules**: ship reusable expectations such as not-null, uniqueness, accepted-values, and referential checks.
6+
- **Declarative rules**: ship reusable expectations such as not-null, uniqueness, accepted-values, referential checks, and full Polars expression rules.
77
- **Deterministic logging**: every run is persisted to SQLite (JSON-friendly) for audits and debugging.
88
- **Pipeline-ready**: run from Python code or via `aqualisys validate configs/orders.yml` in CI.
99

1010
## Quick Start
1111
```bash
1212
python -m venv .venv && source .venv/bin/activate
1313
pip install -e .[dev]
14+
pre-commit install
1415
pytest
1516
aqualisys validate configs/orders.yml
1617
```
1718

1819
## Usage Example
1920
```python
2021
import polars as pl
21-
from aqualisys import DataQualityChecker, NotNullRule, UniqueRule, SQLiteRunLogger
22+
from aqualisys import (
23+
DataQualityChecker,
24+
ExpressionRule,
25+
NotNullRule,
26+
UniqueRule,
27+
SQLiteRunLogger,
28+
)
2229

23-
df = pl.DataFrame({"order_id": [1, 2, 3], "status": ["pending", "shipped", "shipped"]})
30+
df = pl.DataFrame(
31+
{
32+
"order_id": [1, 2, 3],
33+
"status": ["pending", "shipped", "shipped"],
34+
"total": [10, 20, 10],
35+
}
36+
)
2437
checker = DataQualityChecker(
25-
rules=[NotNullRule("order_id"), UniqueRule("order_id")],
38+
rules=[
39+
NotNullRule("order_id"),
40+
UniqueRule("order_id"),
41+
ExpressionRule("pl.col('total') >= 0", description="Totals stay positive"),
42+
],
2643
logger=SQLiteRunLogger("artifacts/example_runs.db"),
2744
)
2845
report = checker.run(df, dataset_name="orders")
2946
assert report.passed
3047
```
3148

49+
## Rule Catalog
50+
51+
Rules are registered via metadata so configs can reference them by type and even override severity:
52+
53+
```yaml
54+
rules:
55+
- type: not_null
56+
column: order_id
57+
- type: accepted_values
58+
column: order_status
59+
allowed_values: ["pending", "shipped", "delivered", "cancelled"]
60+
- type: expression
61+
expression: "pl.col('total') >= 0"
62+
severity: warn
63+
description: "Order totals must be non-negative"
64+
```
65+
66+
Available built-in types today: `not_null`, `unique`, `accepted_values`, `relationship`, and `expression`. Use `severity: warn|error` per rule and add descriptions for richer logging.
67+
3268
## Project Structure
3369
- `src/aqualisys/`: library source (rules, checker, logging, CLI).
3470
- `tests/`: pytest suites (unit + integration).

configs/orders.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,7 @@ rules:
1313
- type: accepted_values
1414
column: order_status
1515
allowed_values: ["pending", "shipped", "delivered", "cancelled"]
16+
- type: expression
17+
expression: "pl.col('total') >= 0"
18+
severity: warn
19+
description: "Order totals must be non-negative"

docs/ROADMAP.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,21 +13,21 @@
1313

1414
## Milestones
1515
1. **Foundation (Week 1)**
16-
- Scaffold repo: `pyproject.toml`, `src/aqualisys`, `tests`.
17-
- Implement minimal Polars rule set (unique, not_null) and SQLite logger.
18-
- Provide quick-start documentation + architecture diagrams in README.
16+
- Scaffold repo: `pyproject.toml`, `src/aqualisys`, `tests`.
17+
- Implement minimal Polars rule set (unique, not_null) and SQLite logger.
18+
- Provide quick-start documentation + architecture diagrams in README.
1919
2. **Rule Expansion (Week 2)**
20-
- Add accepted-values, referential-integrity, expression-based checks.
21-
- Introduce rule registry + tagging for bundles.
22-
- Emit structured results (JSON + SQLite) to support downstream observability.
20+
- Add accepted-values, referential-integrity, expression-based checks (new `ExpressionRule`).
21+
- Introduce rule registry + tagging for bundles (config now resolves via metadata, severity overrides supported).
22+
- Emit structured results (JSON + SQLite) to support downstream observability.
2323
3. **Configuration & CLI (Week 3)**
2424
- YAML config parser, CLI wrappers for running suites locally or in CI.
2525
- Support `--fail-fast`, severity overrides, include/exclude selectors.
2626
- Harden logging with retries + summary tables.
2727
4. **DX & Publishing (Week 4)**
2828
- Add docs site snippets, end-to-end demo notebook, telemetry opt-in.
29-
- Set up `uv build`, publish to TestPyPI, smoke-test install, then promote to PyPI.
30-
- Configure CI (lint, type-check, pytest with coverage) and badges.
29+
- Set up `uv build`, publish to TestPyPI, smoke-test install, then promote to PyPI.
30+
- Configure CI (lint, type-check, pytest with coverage) and badges.
3131

3232
## Success Metrics
3333
- Unit + integration coverage ≥90% on validators/loggers.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ dev = [
2222
"mypy>=1.8.0",
2323
"pytest>=7.4.4",
2424
"pytest-cov>=4.1.0",
25+
"pre-commit>=3.5.0",
2526
"ruff>=0.2.1",
2627
]
2728

src/aqualisys/__init__.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,19 @@
55
"""
66

77
from .checker import DataQualityChecker, RuleBundle
8-
from .checks.rules import AcceptedValuesRule, NotNullRule, RelationshipRule, UniqueRule
8+
from .checks.rules import (
9+
AcceptedValuesRule,
10+
ExpressionRule,
11+
NotNullRule,
12+
RelationshipRule,
13+
UniqueRule,
14+
)
915
from .logging.sqlite import SQLiteRunLogger
1016

1117
__all__ = [
1218
"AcceptedValuesRule",
1319
"DataQualityChecker",
20+
"ExpressionRule",
1421
"NotNullRule",
1522
"RelationshipRule",
1623
"RuleBundle",

src/aqualisys/checks/registry.py

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
from __future__ import annotations
2+
3+
from collections.abc import Callable, Iterable, Mapping
4+
from dataclasses import dataclass
5+
from pathlib import Path
6+
from typing import Any
7+
8+
import polars as pl
9+
10+
from .base import BaseRule, RuleSeverity
11+
from .rules import (
12+
AcceptedValuesRule,
13+
ExpressionRule,
14+
NotNullRule,
15+
RelationshipRule,
16+
UniqueRule,
17+
)
18+
19+
RuleFactory = Callable[[Mapping[str, Any]], BaseRule]
20+
21+
22+
@dataclass(frozen=True)
23+
class RuleDefinition:
24+
name: str
25+
description: str
26+
tags: frozenset[str]
27+
builder: RuleFactory
28+
29+
30+
_REGISTRY: dict[str, RuleDefinition] = {}
31+
32+
33+
def _register_builtin(
34+
name: str,
35+
builder: RuleFactory,
36+
*,
37+
description: str,
38+
tags: Iterable[str],
39+
) -> None:
40+
register_rule(name, builder, description=description, tags=tags)
41+
42+
43+
def register_rule(
44+
name: str,
45+
builder: RuleFactory,
46+
*,
47+
description: str = "",
48+
tags: Iterable[str] | None = None,
49+
) -> None:
50+
key = name.lower()
51+
if key in _REGISTRY:
52+
raise ValueError(f"rule '{name}' is already registered")
53+
_REGISTRY[key] = RuleDefinition(
54+
name=name,
55+
description=description,
56+
tags=frozenset(tags or ()),
57+
builder=builder,
58+
)
59+
60+
61+
def get_rule(name: str) -> RuleDefinition:
62+
try:
63+
return _REGISTRY[name.lower()]
64+
except KeyError as exc: # pragma: no cover - defensive
65+
raise KeyError(f"unknown rule type: {name}") from exc
66+
67+
68+
def list_rules(tag: str | None = None) -> list[RuleDefinition]:
69+
definitions = _REGISTRY.values()
70+
if tag:
71+
tag = tag.lower()
72+
definitions = [
73+
definition for definition in definitions if tag in definition.tags
74+
]
75+
return sorted(definitions, key=lambda definition: definition.name)
76+
77+
78+
def _resolve_severity(config: Mapping[str, Any]) -> RuleSeverity:
79+
level = config.get("severity")
80+
if not level:
81+
return RuleSeverity.ERROR
82+
try:
83+
return RuleSeverity(level.lower())
84+
except ValueError as exc:
85+
raise ValueError(f"unknown severity '{level}'") from exc
86+
87+
88+
def _resolve_description(
89+
config: Mapping[str, Any],
90+
fallback: str,
91+
) -> str:
92+
return config.get("description") or fallback
93+
94+
95+
def _build_not_null(config: Mapping[str, Any]) -> BaseRule:
96+
return NotNullRule(
97+
column=config["column"],
98+
severity=_resolve_severity(config),
99+
description=_resolve_description(config, f"NotNull on {config['column']}"),
100+
)
101+
102+
103+
def _build_unique(config: Mapping[str, Any]) -> BaseRule:
104+
return UniqueRule(
105+
column=config["column"],
106+
severity=_resolve_severity(config),
107+
description=_resolve_description(config, f"Unique on {config['column']}"),
108+
)
109+
110+
111+
def _build_accepted_values(config: Mapping[str, Any]) -> BaseRule:
112+
return AcceptedValuesRule(
113+
column=config["column"],
114+
allowed_values=config["allowed_values"],
115+
severity=_resolve_severity(config),
116+
)
117+
118+
119+
def _build_relationship(config: Mapping[str, Any]) -> BaseRule:
120+
reference_cfg = config["reference"]
121+
ref_path = Path(reference_cfg["path"])
122+
ref_format = reference_cfg.get("format", "parquet")
123+
if ref_format == "parquet":
124+
reference_df = pl.read_parquet(ref_path)
125+
elif ref_format == "csv":
126+
reference_df = pl.read_csv(ref_path)
127+
else: # pragma: no cover - validated via config tests
128+
raise ValueError(f"unsupported reference format: {ref_format}")
129+
return RelationshipRule(
130+
column=config["column"],
131+
reference_df=reference_df,
132+
reference_column=reference_cfg["column"],
133+
severity=_resolve_severity(config),
134+
)
135+
136+
137+
def _build_expression(config: Mapping[str, Any]) -> BaseRule:
138+
return ExpressionRule(
139+
expression=config["expression"],
140+
severity=_resolve_severity(config),
141+
description=_resolve_description(
142+
config,
143+
f"Expression rule {config['expression']}",
144+
),
145+
)
146+
147+
148+
_register_builtin(
149+
name="not_null",
150+
builder=_build_not_null,
151+
description="Fails when the specified column contains null values.",
152+
tags=("nulls", "integrity"),
153+
)
154+
_register_builtin(
155+
name="unique",
156+
builder=_build_unique,
157+
description="Fails when duplicate values are detected in the column.",
158+
tags=("uniqueness", "integrity"),
159+
)
160+
_register_builtin(
161+
name="accepted_values",
162+
builder=_build_accepted_values,
163+
description="Ensures all column values are part of an allowed set.",
164+
tags=("reference", "categorical"),
165+
)
166+
_register_builtin(
167+
name="relationship",
168+
builder=_build_relationship,
169+
description="Verifies referential integrity with an on-disk reference dataset.",
170+
tags=("reference", "integrity"),
171+
)
172+
_register_builtin(
173+
name="expression",
174+
builder=_build_expression,
175+
description="Evaluates a boolean Polars expression defined as a string.",
176+
tags=("expression", "flexible"),
177+
)

src/aqualisys/checks/rules.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,3 +128,55 @@ def evaluate(self, df: pl.DataFrame) -> RuleResult:
128128
"reference_size": len(reference_set),
129129
},
130130
)
131+
132+
133+
class ExpressionRule(BaseRule):
134+
"""Evaluates a boolean Polars expression string for every row."""
135+
136+
def __init__(
137+
self,
138+
expression: str,
139+
*,
140+
severity: RuleSeverity = RuleSeverity.ERROR,
141+
description: str | None = None,
142+
) -> None:
143+
self.expression = expression
144+
self.severity = severity
145+
self.description = description or f"ExpressionRule on {expression}"
146+
147+
@property
148+
def name(self) -> str:
149+
return f"ExpressionRule::{self.expression}"
150+
151+
def _compile(self) -> pl.Expr:
152+
try:
153+
compiled = eval(self.expression, {"pl": pl}, {})
154+
except Exception as exc: # pragma: no cover - exercised via tests
155+
raise ValueError(f"invalid expression: {self.expression}") from exc
156+
if not isinstance(compiled, pl.Expr):
157+
raise ValueError(
158+
"expression must evaluate to a Polars expression, got "
159+
f"{type(compiled)!r}"
160+
)
161+
return compiled
162+
163+
def evaluate(self, df: pl.DataFrame) -> RuleResult:
164+
expr = self._compile()
165+
result_series = df.select(expr.alias("result")).to_series()
166+
violations = int((~result_series).sum())
167+
status = RuleStatus.PASSED if violations == 0 else RuleStatus.FAILED
168+
message = (
169+
"expression satisfied for all rows"
170+
if status is RuleStatus.PASSED
171+
else f"{violations} expression violations detected"
172+
)
173+
return RuleResult(
174+
rule_name=self.name,
175+
status=status,
176+
message=message,
177+
severity=self.severity,
178+
metrics={
179+
"expression": self.expression,
180+
"violation_count": violations,
181+
},
182+
)

0 commit comments

Comments
 (0)