enhance observability with structured logging and metrics

# [Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing

Enhance Qlib's observability infrastructure to support **structured logging**, **performance metrics collection**, and **workflow tracing**.

Currently, Qlib uses basic Python logging via `get_module_logger` and `TimeInspector` for timing. This proposal aims to build upon the existing infrastructure to provide comprehensive observability capabilities that help users monitor, debug, and optimize their quantitative research workflows.

---

## Motivation

### 1. Application Scenarios

#### Data Pipeline Debugging
When processing large datasets through the `DatasetH → DataHandlerLP → Processor` chain, it's difficult to identify bottlenecks:
- Cache hit/miss rates (`ExpressionCache`, `DatasetCache`) are not exposed
- Memory usage during `D.features()` calls is invisible

#### Model Training Monitoring
`TrainerR / TrainerRM` lack visibility into per-epoch resource consumption:
- No metrics for comparing training efficiency across different models in `qlib/contrib/model/`
- `DelayTrainer` execution timeline is hard to trace

#### Backtest Performance Analysis
- Exchange order execution timing not captured
- Executor decision-making latency invisible
- Nested executor scenarios (`NestedExecutor`) are especially hard to debug

#### Online Serving Observability
- `OnlineManager` model update cycles lack monitoring
- Rolling training (`qlib/contrib/rolling/`) progress tracking is limited

---

### 2. Related Works
**OpenTelemetry Python**

---

### 3. Important Information

#### Current infrastructure to build upon
- `qlib/log.py`: `QlibLogger`, `TimeInspector`, `get_module_logger`
- `qlib/workflow/recorder.py`: `log_metrics()` for experiment metrics
- `qlib/config.py`: `logging_config` for log configuration

---

## Proposed Solution

### Phase 1: Enhanced Logging (Low effort, High value)
Extend `qlib/log.py` to support structured logging.

**Example usage**
```python
from qlib.log import get_module_logger

logger = get_module_logger("data.handler", structured=True)
logger.info("Dataset loaded", extra={
   "dataset_size": len(dataset),
   "features_count": 158,
   "time_range": "2020-01-01 to 2023-12-31",
   "cache_hit": True
})
```
**Configuration via `qlib.init()`**
```python
qlib.init(
    provider_uri="~/.qlib/qlib_data/cn_data",
    logging_config={
        "structured": True,
        "format": "json",  # or "console"
    }
)
```
### **Phase 2: Performance Metrics Collection**

Add optional metrics collection to key components.

**Example (data layer)**
```python
# In qlib/data/data.py
class LocalDatasetProvider:
    def dataset(self, ...):
        with MetricsCollector.timer("data.dataset.load_time"):
            # existing logic
            MetricsCollector.gauge("data.dataset.memory_mb", get_memory_usage())
            MetricsCollector.counter("data.dataset.cache_hits", cache_hit_count)
```
**Expose metrics via**

- Prometheus-compatible endpoint (optional)
- `R.log_metrics()` integration for experiment correlation
- Console summary at workflow end

### **Phase 3: Workflow Tracing (Optional)**
Add context propagation for complex workflows.

**Automatic span creation for key operations**
```python
with R.start(experiment_name="test"):
    # trace_id automatically propagated
    dataset = init_instance_by_config(task["dataset"])  # span: dataset.init
    model.fit(dataset)  # span: model.fit
    backtest(...)  # span: backtest.execute
```
Configuration Design
```python
# workflow_config.yaml
qlib_init:
  provider_uri: "~/.qlib/qlib_data/cn_data"
  observability:
    enabled: true
    structured_logging: true
    metrics:
      enabled: true
      export: "prometheus"  # or "console", "mlflow"
    tracing:
      enabled: false  # opt-in for advanced users
```

## **Alternatives**

- Keep current approach: Use `TimeInspector.logt()` manually — lacks structured data and aggregation
- External APM tools: Requires significant integration effort and may not understand Qlib-specific semantics
- MLflow-only: Already integrated but focused on experiment tracking, not system observability

---

## **Additional Notes**

### **Backward Compatibility**

- All features opt-in via configuration
- Default behavior unchanged
- Zero overhead when disabled

### **Implementation Priority**

- Structured logging in `qlib/log.py` (1–2 PRs)
- Key metrics in `qlib/data/` and `qlib/model/trainer.py` (2–3 PRs)
- Backtest metrics in `qlib/backtest/` (1–2 PRs)
- Tracing (future, based on community feedback)

### **Affected Modules**

- `qlib/log.py` — Core changes
- `qlib/config.py` — New configuration options
- `qlib/data/data.py`, `qlib/data/cache.py` — Data layer metrics
- `qlib/model/trainer.py` — Training metrics
- `qlib/backtest/exchange.py`, `qlib/backtest/executor.py` — Backtest metrics

---

## **Are you willing to submit a PR?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance observability with structured logging and metrics #2098

[Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing

Motivation

1. Application Scenarios

Data Pipeline Debugging

Model Training Monitoring

Backtest Performance Analysis

Online Serving Observability

2. Related Works

3. Important Information

Current infrastructure to build upon

Proposed Solution

Phase 1: Enhanced Logging (Low effort, High value)

Phase 2: Performance Metrics Collection

Phase 3: Workflow Tracing (Optional)

Alternatives

Additional Notes

Backward Compatibility

Implementation Priority

Affected Modules

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

enhance observability with structured logging and metrics #2098

Description

[Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing

Motivation

1. Application Scenarios

Data Pipeline Debugging

Model Training Monitoring

Backtest Performance Analysis

Online Serving Observability

2. Related Works

3. Important Information

Current infrastructure to build upon

Proposed Solution

Phase 1: Enhanced Logging (Low effort, High value)

Phase 2: Performance Metrics Collection

Phase 3: Workflow Tracing (Optional)

Alternatives

Additional Notes

Backward Compatibility

Implementation Priority

Affected Modules

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions