Skip to content

enhance observability with structured logging and metrics #2098

@guoer9

Description

@guoer9

[Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing

Enhance Qlib's observability infrastructure to support structured logging, performance metrics collection, and workflow tracing.

Currently, Qlib uses basic Python logging via get_module_logger and TimeInspector for timing. This proposal aims to build upon the existing infrastructure to provide comprehensive observability capabilities that help users monitor, debug, and optimize their quantitative research workflows.


Motivation

1. Application Scenarios

Data Pipeline Debugging

When processing large datasets through the DatasetH → DataHandlerLP → Processor chain, it's difficult to identify bottlenecks:

  • Cache hit/miss rates (ExpressionCache, DatasetCache) are not exposed
  • Memory usage during D.features() calls is invisible

Model Training Monitoring

TrainerR / TrainerRM lack visibility into per-epoch resource consumption:

  • No metrics for comparing training efficiency across different models in qlib/contrib/model/
  • DelayTrainer execution timeline is hard to trace

Backtest Performance Analysis

  • Exchange order execution timing not captured
  • Executor decision-making latency invisible
  • Nested executor scenarios (NestedExecutor) are especially hard to debug

Online Serving Observability

  • OnlineManager model update cycles lack monitoring
  • Rolling training (qlib/contrib/rolling/) progress tracking is limited

2. Related Works

OpenTelemetry Python


3. Important Information

Current infrastructure to build upon

  • qlib/log.py: QlibLogger, TimeInspector, get_module_logger
  • qlib/workflow/recorder.py: log_metrics() for experiment metrics
  • qlib/config.py: logging_config for log configuration

Proposed Solution

Phase 1: Enhanced Logging (Low effort, High value)

Extend qlib/log.py to support structured logging.

Example usage

from qlib.log import get_module_logger

logger = get_module_logger("data.handler", structured=True)
logger.info("Dataset loaded", extra={
   "dataset_size": len(dataset),
   "features_count": 158,
   "time_range": "2020-01-01 to 2023-12-31",
   "cache_hit": True
})

Configuration via qlib.init()

qlib.init(
    provider_uri="~/.qlib/qlib_data/cn_data",
    logging_config={
        "structured": True,
        "format": "json",  # or "console"
    }
)

Phase 2: Performance Metrics Collection

Add optional metrics collection to key components.

Example (data layer)

# In qlib/data/data.py
class LocalDatasetProvider:
    def dataset(self, ...):
        with MetricsCollector.timer("data.dataset.load_time"):
            # existing logic
            MetricsCollector.gauge("data.dataset.memory_mb", get_memory_usage())
            MetricsCollector.counter("data.dataset.cache_hits", cache_hit_count)

Expose metrics via

  • Prometheus-compatible endpoint (optional)
  • R.log_metrics() integration for experiment correlation
  • Console summary at workflow end

Phase 3: Workflow Tracing (Optional)

Add context propagation for complex workflows.

Automatic span creation for key operations

with R.start(experiment_name="test"):
    # trace_id automatically propagated
    dataset = init_instance_by_config(task["dataset"])  # span: dataset.init
    model.fit(dataset)  # span: model.fit
    backtest(...)  # span: backtest.execute

Configuration Design

# workflow_config.yaml
qlib_init:
  provider_uri: "~/.qlib/qlib_data/cn_data"
  observability:
    enabled: true
    structured_logging: true
    metrics:
      enabled: true
      export: "prometheus"  # or "console", "mlflow"
    tracing:
      enabled: false  # opt-in for advanced users

Alternatives

  • Keep current approach: Use TimeInspector.logt() manually — lacks structured data and aggregation
  • External APM tools: Requires significant integration effort and may not understand Qlib-specific semantics
  • MLflow-only: Already integrated but focused on experiment tracking, not system observability

Additional Notes

Backward Compatibility

  • All features opt-in via configuration
  • Default behavior unchanged
  • Zero overhead when disabled

Implementation Priority

  • Structured logging in qlib/log.py (1–2 PRs)
  • Key metrics in qlib/data/ and qlib/model/trainer.py (2–3 PRs)
  • Backtest metrics in qlib/backtest/ (1–2 PRs)
  • Tracing (future, based on community feedback)

Affected Modules

  • qlib/log.py — Core changes
  • qlib/config.py — New configuration options
  • qlib/data/data.pyqlib/data/cache.py — Data layer metrics
  • qlib/model/trainer.py — Training metrics
  • qlib/backtest/exchange.pyqlib/backtest/executor.py — Backtest metrics

Are you willing to submit a PR?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions