Skip to content

Add Sentry Integration for Error Monitoring in akavesdk-py #122

@gitsofaryan

Description

@gitsofaryan

Summary

Introduce optional Sentry-based error monitoring and observability support in akavesdk-py to improve production debugging, failure visibility, and operational insight without introducing a hard dependency.


Problem

Currently, akavesdk-py does not provide centralized error monitoring or structured runtime observability.

When SDK users encounter issues such as:

  • Network or transport failures
  • O3 streaming errors
  • CID resolution failures
  • Authentication / UCAN validation errors
  • Retry exhaustion from tenacity

There is no built-in mechanism to:

  • Capture stack traces centrally
  • Monitor error frequency
  • Attach contextual metadata to failures
  • Trace performance bottlenecks
  • Debug real-world production issues

This limits visibility for both SDK maintainers and integrators.


Proposed Solution

Integrate Sentry (Python SDK) as an optional observability layer that:

  • Captures unhandled exceptions
  • Captures structured SDK-level errors
  • Tracks retry exhaustion events
  • Attaches contextual metadata (dataset_id, CID, node endpoint, operation name)
  • Supports async workflows
  • Optionally enables performance tracing

The integration must:

  • Be optional (no forced dependency)
  • Be configurable via environment variables
  • Avoid logging sensitive information (private keys, tokens, auth headers)
  • Introduce minimal performance overhead
  • Not create breaking changes

High-Level Implementation Plan

  1. Add Sentry as an optional dependency (akavesdk[sentry])
  2. Create a centralized observability module inside the SDK
  3. Provide an initialization function (e.g., init_sentry)
  4. Instrument core SDK error boundaries
  5. Capture retry exhaustion events from tenacity
  6. Optionally add tracing spans around streaming/upload operations
  7. Add documentation explaining setup and configuration

Security Considerations

  • Ensure private keys and UCAN tokens are never logged
  • Mask authentication headers
  • Avoid sending large payload bodies
  • Provide a sanitization hook if needed

Expected Impact

  • Production-grade observability
  • Faster debugging for SDK consumers
  • Better insight into decentralized network reliability
  • Improved SDK trust and enterprise readiness
  • Clearer operational visibility during distributed ML workloads

Acceptance Criteria

  • Optional dependency added
  • Observability module implemented
  • Core error boundaries instrumented
  • Retry exhaustion captured
  • Documentation updated
  • No breaking changes introduced

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions