-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Summary
Introduce optional Sentry-based error monitoring and observability support in akavesdk-py to improve production debugging, failure visibility, and operational insight without introducing a hard dependency.
Problem
Currently, akavesdk-py does not provide centralized error monitoring or structured runtime observability.
When SDK users encounter issues such as:
- Network or transport failures
- O3 streaming errors
- CID resolution failures
- Authentication / UCAN validation errors
- Retry exhaustion from tenacity
There is no built-in mechanism to:
- Capture stack traces centrally
- Monitor error frequency
- Attach contextual metadata to failures
- Trace performance bottlenecks
- Debug real-world production issues
This limits visibility for both SDK maintainers and integrators.
Proposed Solution
Integrate Sentry (Python SDK) as an optional observability layer that:
- Captures unhandled exceptions
- Captures structured SDK-level errors
- Tracks retry exhaustion events
- Attaches contextual metadata (dataset_id, CID, node endpoint, operation name)
- Supports async workflows
- Optionally enables performance tracing
The integration must:
- Be optional (no forced dependency)
- Be configurable via environment variables
- Avoid logging sensitive information (private keys, tokens, auth headers)
- Introduce minimal performance overhead
- Not create breaking changes
High-Level Implementation Plan
- Add Sentry as an optional dependency (
akavesdk[sentry]) - Create a centralized observability module inside the SDK
- Provide an initialization function (e.g.,
init_sentry) - Instrument core SDK error boundaries
- Capture retry exhaustion events from tenacity
- Optionally add tracing spans around streaming/upload operations
- Add documentation explaining setup and configuration
Security Considerations
- Ensure private keys and UCAN tokens are never logged
- Mask authentication headers
- Avoid sending large payload bodies
- Provide a sanitization hook if needed
Expected Impact
- Production-grade observability
- Faster debugging for SDK consumers
- Better insight into decentralized network reliability
- Improved SDK trust and enterprise readiness
- Clearer operational visibility during distributed ML workloads
Acceptance Criteria
- Optional dependency added
- Observability module implemented
- Core error boundaries instrumented
- Retry exhaustion captured
- Documentation updated
- No breaking changes introduced
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels