-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Being Discussed in #162
Originally posted by addloopy July 5, 2025
RFC00001: Snowflake Feature Pull Connector for BharatMLStack
Metadata
- Author: Adarsha Das
- Date Created: 2025-07-05
- Status: Draft
- Target Release: v1.1.0
- Discussion Channel: #connectors-snowflake
1. Summary
This proposal introduces native support for Snowflake as an offline feature store and describes an end-to-end pipeline that ports those features into BharatMLStack's online feature store so they can be served for real-time inference. The connector retrieves features from Snowflake using the snowflake.ml.feature_store API, applies the transformation/deduplication logic in data_helpers.py, then relies on spark_feature_push_client to serialize the resulting Spark DataFrame into Protobuf and push it—via Kafka or another configured sink—into the online store. The aim is to provide a single, frictionless path from Snowflake to online serving with zero intermediate exports.
2. Motivation
- Industry Adoption – Snowflake is the de-facto cloud data warehouse for many ML teams.
- Unified Governance – Keeping features in Snowflake allows teams to reuse established data contracts, lineage, and RBAC.
- Performance & Scale – Snowflake's automatic scaling matches BharatMLStack's design philosophy of elastic cost.
- Developer Ergonomics – The
snowflake.ml.feature_storeSDK abstracts low-level SQL and improves developer productivity.
Without this connector, users have to export data to object storage before ingesting it, introducing latency and operational overhead.
3. Goals
- Add
SNOWFLAKE_TABLEas a newsource_typevalue consumed byread_from_source(). - Use the
snowflake.ml.feature_store.FeatureStoreabstraction for reading data, not raw JDBC. - Support predicate-pushdown on a configurable partition column (default:
ingestion_timestamp). - Preserve existing semantics: the helper must return a Spark DataFrame deduplicated on entity keys.
- Provide clear configuration & credential management guidelines.
- Seamlessly push the transformed Snowflake features to BharatMLStack's online feature store through
spark_feature_push_client, enabling immediate consumption at inference time.
4. Non-Goals
- Online serving from Snowflake (future work).
- Bi-directional sync or writing features back into Snowflake.
- Support for Snowflake external tables or stages (out of scope for first version).
5. Detailed Design
5.1 Data Pull Flow
- Session Creation – A
snowpark.Sessionis established using env-vars or a passed-in config map. - Feature Store Bind – Instantiate
FeatureStore(session). - Table Selection – Retrieve the maximum value of the partition column via the Feature Store API.
- Data Retrieval – Load the filtered rows into a Spark DataFrame through
feature_store.read().to_spark(). - Schema Alignment – Column names are normalized and optionally renamed according to
feature_mapping.
5.2 Data Push Flow
- Null Handling & Default Filling – Invoke
fill_na_features_with_default_valuesto replace missing values based on feature metadata. - Proto Generation – Use
OnlineFeatureStorePyClient.generate_df_with_protobuf_messages(fromspark_feature_push_client) to convert rows into Protobuf messages. - Batching & Kafka Write – Call
write_protobuf_df_to_kafka, optionally in batches, to stream messages to the online feature-store ingress topic. - Audit Sink (Optional) – If
features_write_to_cloud_storageistrue, persist the transformed DataFrame to the configured cloud path for replay or debugging. - Online Availability – Once consumed by BharatMLStack's ingestion service, features become queryable for real-time inference within SLA (p95 < X s).
5.3 API Touch-Points
| Function | Change | Rationale |
|---|---|---|
read_from_source() |
Add elif source_type == "SNOWFLAKE_TABLE" branch to perform steps in 5.1 |
Aligns with existing connector pattern |
src_type_to_partition_col_map |
Append "SNOWFLAKE_TABLE": "ingestion_timestamp" |
Ensures incremental loads |
5.4 Configuration
| Parameter | Description | Example |
|---|---|---|
SNOWFLAKE_ACCOUNT |
Snowflake account identifier | xy12345.us-east-1 |
SNOWFLAKE_USER |
Login user | ml_readonly |
SNOWFLAKE_PASSWORD |
Password or OAuth token | **** |
SNOWFLAKE_WAREHOUSE |
Virtual warehouse used for queries | ML_WH |
SNOWFLAKE_DATABASE |
Database containing feature tables | FEATURE_DB |
SNOWFLAKE_SCHEMA |
Default schema | PUBLIC |
Credentials are loaded by helper utilities and injected into the Snowpark session. Users may override through function arguments for multi-account scenarios.
5.5 Error Handling & Observability
- Retry Logic: Exponential backoff for transient Snowflake errors (
XYZ-0001,XYZ-0002). - Metrics:
snowflake.connector.latency_mssnowflake.connector.rows_readsnowflake.connector.bytes_scanned
- Logging: Obfuscate sensitive parameters before logging connection strings.
6. Test Plan
- Unit Tests – Mock
snowflake.ml.feature_storeto simulate data fetches. - Integration Tests – Use Snowflake's
TPCH_SF1sample to validate partition filtering, column renaming, null-handling, Protobuf generation, and successful push to the online feature store. - Schema Drift Tests – Verify connector surface errors when expected columns are missing or types change.
7. Rollout Strategy
| Phase | Description |
|---|---|
| Alpha | Flag-guarded release for internal users. Collect feedback. |
| Beta | Publicly documented. Backwards compatibility guaranteed. |
| GA | Remove feature flag. Add SLA monitoring. |
8. Alternatives Considered
- Spark JDBC Connector – Simpler, but lacks feature-store metadata awareness and type mapping.
- Materializing Views to Parquet – Adds latency and duplication of storage.
9. Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Snowflake API instability | Ingest failures | Pin to major version; add compatibility tests |
| Credential leakage | Security breach | Use secrets manager integration; no plaintext logs |
| Large data pulls | Cost overruns | Enforce partition filters; sample row counts during CI |
10. Success Metrics
- 100% schema parity with Snowflake source tables.
11. Future Work
- Automatic statistics push to BharatML observability dashboard.
12. References
- Snowflake ML Feature Store Documentation (2024-04) — "snowflake.ml.feature_store".

