code-kern-ai · lumburovskalina · Feb 9, 2026 · Feb 4, 2026 · Feb 6, 2026
diff --git a/.cursor/rules/guidelines.mdc b/.cursor/rules/guidelines.mdc
@@ -0,0 +1,91 @@
+---
+description: Core coding guidelines and best practices for refinery-ml-exec-env
+alwaysApply: true
+---
+
+# Core Coding Guidelines
+
+## Project Overview
+
+This is a containerized execution environment for active transfer learning in refinery. The codebase processes embeddings and labels, trains ML models (classification and extraction), and performs inference using scikit-learn and sequence-learn. It runs as a containerized function-as-a-service that receives input data via HTTP and returns ML predictions.
+
+## General Principles
+
+### Code Quality
+- **Readability over cleverness**: Code is read far more often than written
+- **Explicit over implicit**: Make intentions clear through naming and structure
+- **Fail fast**: Validate inputs early and raise clear errors
+- **Single responsibility**: Functions should do one thing well
+
+### Error Handling
+- Use specific exception types, not bare `except Exception`
+- Preserve traceback information when re-raising exceptions
+- Provide clear, actionable error messages with context
+- For containerized execution, print error messages to stdout/stderr with `flush=True`
+
+### Naming Conventions
+- Functions and variables: `snake_case`
+- Constants: `UPPER_SNAKE_CASE` (e.g., `CONSTANT__OUTSIDE`)
+- Classes: `PascalCase`
+- Private functions: Prefix with `__` (double underscore) only when truly private
+- Avoid magic suffixes like `_a2vybg` in new code - use descriptive names
+
+### Type Hints
+- Use type hints for all function signatures
+- Prefer modern Python 3.10+ syntax when possible:
+ - `str | None` instead of `Optional[str]` (if Python 3.10+)
+ - `list[str]` instead of `List[str]` (if Python 3.10+)
+ - `dict[str, int]` instead of `Dict[str, int]` (if Python 3.10+)
+- Use `Any` sparingly - prefer specific types or protocols
+- Use `object` instead of `Any` when accepting any object
+- Note: Current codebase uses `typing` module imports (`List`, `Dict`, `Tuple`, `Optional`) for compatibility
+
+### Global Variables
+- **Avoid global variables** when possible
+- If globals are necessary (e.g., runtime configuration), document why
+- Consider using a configuration class or dataclass instead
+- Group related globals into a single namespace object
+
+### Progress Reporting
+- **Use print statements with `flush=True`** for containerized execution environments
+- Format: `print("progress: X", flush=True)` where X is a float between 0.0 and 1.0
+- Progress messages are consumed by the container orchestrator
+- Include informative messages: `print("Preparing data for machine learning.", flush=True)`
+- Report progress at key milestones (e.g., 0.05, 0.5, 0.8, 0.9, 0.95, 1.0)
+
+### Magic Numbers and Strings
+- Extract magic numbers to named constants
+- Use enums or constants for string literals that represent states/types
+- Document the meaning of non-obvious values
+- Example: `CONSTANT__OUTSIDE = "OUTSIDE"` - enum from graphql-gateway
+
+### Function Length
+- Keep functions focused and under 50 lines when possible
+- Extract complex logic into helper functions
+- Use descriptive function names that explain purpose
+
+### Comments
+- Write self-documenting code first
+- Add comments for "why", not "what"
+- Document complex algorithms or business logic
+- Keep comments up-to-date with code changes
+- Document external dependencies (e.g., "enum from graphql-gateway")
+
+## ML-Specific Guidelines
+
+### Model Persistence
+- Save trained models using `pickle` to `/inference` directory when it exists
+- Use descriptive filenames: `f"active-learner-{information_source_id}.pkl"`
+- Check directory existence before writing: `if os.path.exists("/inference")`
+
+### Data Processing
+- Use pandas DataFrames for structured data manipulation
+- Use numpy arrays for numerical computations and embeddings
+- Handle ragged arrays appropriately (e.g., variable-length sequences)
+- Transform embeddings and labels according to task type (classification vs extraction)
+
+### Active Transfer Learning
+- Implement `BaseModel` abstract base class for ML models
+- Use `fit_predict` pattern for training and inference
+- Support both classification (`ATLClassifier`) and extraction (`ATLExtractor`) tasks
+- Filter predictions by confidence threshold (`min_confidence`) and label names
diff --git a/.cursor/rules/ml-patterns.mdc b/.cursor/rules/ml-patterns.mdc
@@ -0,0 +1,271 @@
+---
+description: Machine learning patterns and best practices for active transfer learning
+globs: **/*.py
+alwaysApply: false
+---
+
+# ML Patterns and Best Practices
+
+## Active Transfer Learning Architecture
+
+### Base Model Pattern
+All ML models should inherit from `BaseModel` and implement the required abstract methods:
+
+```python
+# ✅ GOOD - Proper BaseModel implementation
+from abc import ABC, abstractmethod
+from . import util
+
+class BaseModel(ABC):
+    @abstractmethod
+    def __init__(self):
+        super().__init__()
+        self.embedding_name = None
+        self.min_confidence = None
+        self.label_names = None
+
+    @abstractmethod
+    def fit(self, embeddings, labels):
+        """Train the model on embeddings and labels."""
+        pass
+
+    @abstractmethod
+    def predict_proba(self, embeddings):
+        """Return prediction probabilities."""
+        pass
+
+    def fit_predict(self, embeddings, labels, records_ids, training_ids):
+        """Fit model and return predictions for all records."""
+        self.records_ids = records_ids
+        self.training_ids = training_ids
+        self.fit(embeddings, labels)
+        predictions = self.predict_proba(embeddings)
+        if self.label_names is None:
+            self.label_names = self.model.classes_
+        return predictions
+```
+
+### Classification vs Extraction
+- **Classification**: Single label per record (`ATLClassifier`)
+- **Extraction**: Sequence labeling with multiple spans per record (`ATLExtractor`)
+- Determine task type: `is_extractor = any([isinstance(val, list) for val in corpus_labels["manual"]])`
+
+## Data Processing Patterns
+
+### Embedding Handling
+```python
+# ✅ GOOD - Handle different embedding types
+if embedding_type == "ON_ATTRIBUTE":
+    # Single embedding per record: [num_records x embedding_dim]
+    embeddings = {
+        embedding_name: [
+            [float(y) for y in x[1:-1].split(", ")]
+            for x in embedding_df.data
+            if x != "data"
+        ]
+    }
+else:
+    # Multiple embeddings per record: [num_records x num_tokens x embedding_dim]
+    embeddings = {
+        embedding_name: [
+            [[float(z) for z in y.split(", ")] for y in x[2:-2].split("], [")]
+            for x in embedding_df.data
+            if x != "data"
+        ]
+    }
+```
+
+### Training Data Filtering
+```python
+# ✅ GOOD - Filter training data by training_ids
+def transform_corpus_classification_fit(
+    embeddings, labels_training, record_ids, training_ids
+):
+    """Filter embeddings and labels to only training examples."""
+    training_mask = [True if id in training_ids else False for id in record_ids]
+    labels_training = np.array(labels_training)[training_mask]
+    embedding_training = np.array(embeddings)[training_mask]
+    return embedding_training, labels_training
+```
+
+### Label Vector Construction (Extraction)
+```python
+# ✅ GOOD - Build label vectors for sequence labeling
+def build_label_vector(embedding_length: int, label_annotations: list) -> list:
+    """Create label vector with OUTSIDE tokens."""
+    label_vector = np.full([embedding_length], None)
+    for _, row in label_annotations.iterrows():
+        for token_idx in row.token_list:
+            label_vector[token_idx] = row.label_name
+    # Fill None values with OUTSIDE
+    np.place(label_vector, label_vector == None, CONSTANT__OUTSIDE)
+    return label_vector.tolist()
+```
+
+## Model Persistence
+
+### Saving Models
+```python
+# ✅ GOOD - Save trained models with descriptive names
+if os.path.exists("/inference"):
+    pickle_path = os.path.join(
+        "/inference", f"active-learner-{information_source_id}.pkl"
+    )
+    with open(pickle_path, "wb") as f:
+        pickle.dump(classifier, f)
+        print("Saved model to disk", flush=True)
+
+# ❌ BAD - Hardcoded paths or missing directory check
+pickle.dump(classifier, open("model.pkl", "wb"))  # No directory check
+```
+
+## Prediction Filtering
+
+### Confidence and Label Filtering
+```python
+# ✅ GOOD - Filter predictions by confidence and valid labels
+ml_results_by_record_id = {}
+for record_id, (probability, prediction) in zip(
+    corpus_ids, predictions_with_probabilities
+):
+    if (
+        probability > classifier.min_confidence
+        and prediction in classifier.label_names
+    ):
+        ml_results_by_record_id[record_id] = (probability, prediction)
+
+# ❌ BAD - No filtering
+ml_results_by_record_id[record_id] = (probability, prediction)  # Includes low-confidence predictions
+```
+
+### Extraction Result Processing
+```python
+# ✅ GOOD - Process extraction predictions with span detection
+df = pd.DataFrame(
+    list(zip(prediction, probability)),
+    columns=["prediction", "probability"],
+)
+df["next"] = df["prediction"].shift(-1)
+predictions_with_probabilities = []
+new_start_idx = True
+
+for idx, row in df.loc[
+    (df.prediction != CONSTANT__OUTSIDE)
+    & (df.prediction.isin(extractor.label_names))
+    & (df.probability > extractor.min_confidence)
+].iterrows():
+    if new_start_idx:
+        start_idx = idx
+        new_start_idx = False
+    if row.prediction != row.next:
+        prob = df.loc[start_idx:idx].probability.mean()
+        end_idx = idx + 1
+        predictions_with_probabilities.append(
+            [float(prob), row.prediction, start_idx, end_idx]
+        )
+        new_start_idx = True
+```
+
+## Data Transformation Utilities
+
+### Corpus Transformation Functions
+- `transform_corpus_classification_fit`: Filter training data for classification
+- `transform_corpus_extraction_fit`: Build label vectors and filter for extraction
+- `transform_corpus_classification_inference`: Convert embeddings to numpy array
+- `transform_corpus_extraction_inference`: Pass through embeddings (already in correct format)
+
+### Pandas DataFrame Operations
+```python
+# ✅ GOOD - Use pandas for structured data manipulation
+df = pd.DataFrame(
+    list(zip(prediction, probability)),
+    columns=["prediction", "probability"],
+)
+df["next"] = df["prediction"].shift(-1)  # Look ahead for span detection
+
+# ✅ GOOD - Disable chained assignment warnings when intentional
+pd.options.mode.chained_assignment = None  # default='warn'
+```
+
+## Error Handling for ML Operations
+
+### Embedding Parsing Errors
+```python
+# ✅ GOOD - Clear error messages for data parsing failures
+try:
+    embeddings = parse_embeddings(embedding_df)
+except Exception:
+    print("Can't parse the embedding. Please contact the support.", flush=True)
+    raise ValueError("Can't parse the embedding. Please contact the support.")
+
+# ❌ BAD - Silent failures or vague errors
+try:
+    embeddings = parse_embeddings(embedding_df)
+except Exception:
+    pass  # Swallows error
+```
+
+### Empty Prediction Results
+```python
+# ✅ GOOD - Informative message when no predictions
+if len(ml_results_by_record_id) == 0:
+    print(
+        "No records were predicted. Try lowering the confidence threshold.",
+        flush=True,
+    )
+
+# ❌ BAD - Silent failure
+if len(ml_results_by_record_id) == 0:
+    pass  # No feedback to user
+```
+
+## Progress Reporting for ML Operations
+
+### Progress Milestones
+Report progress at key stages:
+- `0.05`: Initialization complete
+- `0.5`: Model training complete (for extraction)
+- `0.8`: Model training complete (for classification)
+- `0.9`: Post-processing complete
+- `0.95`: Results filtering complete
+- `1.0`: Final results ready
+
+```python
+# ✅ GOOD - Progress reporting at milestones
+print("progress: 0.05", flush=True)
+classifier = ATLClassifier()
+prediction_probabilities = classifier.fit_predict(...)
+print("progress: 0.8", flush=True)
+# ... post-processing ...
+print("progress: 0.9", flush=True)
+# ... filtering ...
+print("progress: 0.95", flush=True)
+print("progress: 1", flush=True)
+```
+
+## Decorator Patterns
+
+### Parametrized Decorators
+The codebase uses parametrized decorators for model configuration:
+
+```python
+@parametrized
+def params_fit(function: Callable, embedding_name: str, train_test_split: float):
+    """Decorator to prepare training data before fit."""
+    # Transforms corpus embeddings and labels
+    # Filters to training_ids
+    # Calls original function with transformed data
+    pass
+
+@parametrized
+def params_inference(
+    function: Callable,
+    label_names: Optional[List[str]] = None,
+    min_confidence: float = 0.8,
+):
+    """Decorator to prepare inference data and set model parameters."""
+    # Transforms corpus embeddings for inference
+    # Sets min_confidence and label_names
+    # Calls original function with transformed embeddings
+    pass
+```