-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
[Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing
Enhance Qlib's observability infrastructure to support structured logging, performance metrics collection, and workflow tracing.
Currently, Qlib uses basic Python logging via get_module_logger and TimeInspector for timing. This proposal aims to build upon the existing infrastructure to provide comprehensive observability capabilities that help users monitor, debug, and optimize their quantitative research workflows.
Motivation
1. Application Scenarios
Data Pipeline Debugging
When processing large datasets through the DatasetH → DataHandlerLP → Processor chain, it's difficult to identify bottlenecks:
- Cache hit/miss rates (
ExpressionCache,DatasetCache) are not exposed - Memory usage during
D.features()calls is invisible
Model Training Monitoring
TrainerR / TrainerRM lack visibility into per-epoch resource consumption:
- No metrics for comparing training efficiency across different models in
qlib/contrib/model/ DelayTrainerexecution timeline is hard to trace
Backtest Performance Analysis
- Exchange order execution timing not captured
- Executor decision-making latency invisible
- Nested executor scenarios (
NestedExecutor) are especially hard to debug
Online Serving Observability
OnlineManagermodel update cycles lack monitoring- Rolling training (
qlib/contrib/rolling/) progress tracking is limited
2. Related Works
OpenTelemetry Python
3. Important Information
Current infrastructure to build upon
qlib/log.py:QlibLogger,TimeInspector,get_module_loggerqlib/workflow/recorder.py:log_metrics()for experiment metricsqlib/config.py:logging_configfor log configuration
Proposed Solution
Phase 1: Enhanced Logging (Low effort, High value)
Extend qlib/log.py to support structured logging.
Example usage
from qlib.log import get_module_logger
logger = get_module_logger("data.handler", structured=True)
logger.info("Dataset loaded", extra={
"dataset_size": len(dataset),
"features_count": 158,
"time_range": "2020-01-01 to 2023-12-31",
"cache_hit": True
})Configuration via qlib.init()
qlib.init(
provider_uri="~/.qlib/qlib_data/cn_data",
logging_config={
"structured": True,
"format": "json", # or "console"
}
)Phase 2: Performance Metrics Collection
Add optional metrics collection to key components.
Example (data layer)
# In qlib/data/data.py
class LocalDatasetProvider:
def dataset(self, ...):
with MetricsCollector.timer("data.dataset.load_time"):
# existing logic
MetricsCollector.gauge("data.dataset.memory_mb", get_memory_usage())
MetricsCollector.counter("data.dataset.cache_hits", cache_hit_count)Expose metrics via
- Prometheus-compatible endpoint (optional)
R.log_metrics()integration for experiment correlation- Console summary at workflow end
Phase 3: Workflow Tracing (Optional)
Add context propagation for complex workflows.
Automatic span creation for key operations
with R.start(experiment_name="test"):
# trace_id automatically propagated
dataset = init_instance_by_config(task["dataset"]) # span: dataset.init
model.fit(dataset) # span: model.fit
backtest(...) # span: backtest.executeConfiguration Design
# workflow_config.yaml
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
observability:
enabled: true
structured_logging: true
metrics:
enabled: true
export: "prometheus" # or "console", "mlflow"
tracing:
enabled: false # opt-in for advanced usersAlternatives
- Keep current approach: Use
TimeInspector.logt()manually — lacks structured data and aggregation - External APM tools: Requires significant integration effort and may not understand Qlib-specific semantics
- MLflow-only: Already integrated but focused on experiment tracking, not system observability
Additional Notes
Backward Compatibility
- All features opt-in via configuration
- Default behavior unchanged
- Zero overhead when disabled
Implementation Priority
- Structured logging in
qlib/log.py(1–2 PRs) - Key metrics in
qlib/data/andqlib/model/trainer.py(2–3 PRs) - Backtest metrics in
qlib/backtest/(1–2 PRs) - Tracing (future, based on community feedback)
Affected Modules
qlib/log.py— Core changesqlib/config.py— New configuration optionsqlib/data/data.py,qlib/data/cache.py— Data layer metricsqlib/model/trainer.py— Training metricsqlib/backtest/exchange.py,qlib/backtest/executor.py— Backtest metrics