Skip to content

Commit 1aab217

Browse files
Cursor files refinery-embedder (#184)
* Cursor files refinery-embedder * Fix rules * submodules merge
1 parent f8750e0 commit 1aab217

File tree

10 files changed

+754
-2
lines changed

10 files changed

+754
-2
lines changed

.cursor/rules/api-models.mdc

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
description: Rules for Pydantic models and request/response validation
3+
globs: ["src/data/data_type.py"]
4+
alwaysApply: true
5+
---
6+
7+
# API Models Guidelines
8+
9+
Pydantic models validate request bodies and ensure type safety. Models are defined in `src/data/data_type.py`.
10+
11+
## Model Definition
12+
13+
**Basic structure:**
14+
```python
15+
from typing import Dict, List, Any
16+
from pydantic import BaseModel
17+
18+
class EmbeddingRequest(BaseModel):
19+
project_id: str
20+
embedding_id: str
21+
22+
class EmbeddingRebuildRequest(BaseModel):
23+
# example request structure:
24+
# {"<embedding_id>":[{"record_id":"<record_id>","attribute_name":"<attribute_name>","sub_key":<sub_key>}]}
25+
# note that sub_key is optional and only for embedding lists relevant
26+
# also sub_key is an int but converted to string in the request
27+
changes: Dict[str, List[Dict[str, Any]]]
28+
29+
class EmbeddingCalcTensorByPkl(BaseModel):
30+
texts: List[str]
31+
```
32+
33+
## Naming Conventions
34+
35+
- Request models: `EmbeddingRequest`, `EmbeddingRebuildRequest`, `EmbeddingCalcTensorByPkl`
36+
- Use descriptive names that indicate the operation and data type
37+
- Use `Request` suffix for request body models
38+
39+
## Usage in Routes
40+
41+
```python
42+
from src.data import data_type
43+
44+
@app.post("/embed")
45+
def embed(request: data_type.EmbeddingRequest) -> responses.PlainTextResponse:
46+
status_code = controller.manage_encoding_thread(
47+
request.project_id, request.embedding_id
48+
)
49+
return responses.PlainTextResponse(status_code=status_code)
50+
51+
@app.post("/re_embed_records/{project_id}")
52+
def re_embed_record(
53+
project_id: str,
54+
request: data_type.EmbeddingRebuildRequest
55+
) -> responses.PlainTextResponse:
56+
controller.re_embed_records(project_id, request.changes)
57+
return responses.PlainTextResponse(status_code=status.HTTP_200_OK)
58+
```
59+
60+
## Field Validation
61+
62+
```python
63+
from pydantic import field_validator, Field
64+
65+
class EmbeddingRequest(BaseModel):
66+
project_id: str = Field(min_length=1)
67+
embedding_id: str = Field(min_length=1)
68+
69+
@field_validator('project_id', 'embedding_id')
70+
@classmethod
71+
def validate_ids(cls, v):
72+
if not v or not v.strip():
73+
raise ValueError('ID cannot be empty')
74+
return v.strip()
75+
```
76+
77+
## Best Practices
78+
79+
1. Use standard Python types (`str`, `int`, `List`, `Dict`) - Pydantic handles validation
80+
2. Provide defaults for optional fields using `Optional[Type] = None`
81+
3. Use descriptive model names that indicate purpose
82+
4. Document complex nested structures with comments
83+
5. Use proper type hints for all fields
84+
6. Keep models focused on request/response data structure
85+
7. Use `Dict[str, Any]` for flexible nested structures when needed

.cursor/rules/controllers.mdc

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
description: Rules for controller module and business logic
3+
globs: ["controller.py"]
4+
alwaysApply: true
5+
---
6+
7+
# Controllers Guidelines
8+
9+
The controller module (`controller.py`) contains business logic for embedding operations and orchestrates interactions between routes, submodules, embedders, and external services.
10+
11+
## Import Patterns
12+
13+
```python
14+
# Submodules
15+
from submodules.model.business_objects import (
16+
attribute,
17+
embedding,
18+
general,
19+
project,
20+
record,
21+
tokenization,
22+
notification,
23+
organization,
24+
)
25+
from submodules.model import enums, daemon
26+
from submodules.s3 import controller as s3
27+
28+
# Embedders
29+
from src.embedders import Transformer, util
30+
from src.embedders.classification.contextual import (
31+
OpenAISentenceEmbedder,
32+
HuggingFaceSentenceEmbedder,
33+
)
34+
from src.util import request_util
35+
from src.util.decorator import param_throttle
36+
from src.util.embedders import get_embedder
37+
from src.util.notification import send_project_update
38+
```
39+
40+
## Function Patterns
41+
42+
**Async embedding operations:**
43+
```python
44+
from submodules.model import daemon
45+
from fastapi import status
46+
47+
def manage_encoding_thread(project_id: str, embedding_id: str) -> int:
48+
daemon.run_without_db_token(prepare_run, project_id, embedding_id)
49+
return status.HTTP_200_OK
50+
```
51+
52+
**Embedding lifecycle:**
53+
```python
54+
def delete_embedding(project_id: str, embedding_id: str) -> int:
55+
object_name = f"embedding_tensors_{embedding_id}.csv.bz2"
56+
org_id = organization.get_id_by_project_id(project_id)
57+
s3.delete_object(org_id, f"{project_id}/{object_name}")
58+
request_util.delete_embedding_from_neural_search(embedding_id)
59+
json_path = util.INFERENCE_DIR / project_id / f"embedder-{embedding_id}.json"
60+
json_path.unlink(missing_ok=True)
61+
return status.HTTP_200_OK
62+
```
63+
64+
**Embedding state management:**
65+
```python
66+
def run_encoding(project_id: str, user_id: str, embedding_id: str, ...) -> int:
67+
session_token = general.get_ctx_token()
68+
try:
69+
# Update embedding state
70+
embedding.update_embedding_state_encoding(project_id, embedding_id, with_commit=True)
71+
send_project_update(project_id, f"embedding:{embedding_id}:state:{enums.EmbeddingState.ENCODING.value}")
72+
73+
# Process batches
74+
for pair in generate_batches(...):
75+
embedding.create_tensors(project_id, embedding_id, record_ids_batched, tensors, with_commit=True)
76+
send_progress_update_throttle(project_id, embedding_id, state, initial_count)
77+
78+
# Finalize
79+
embedding.update_embedding_state_finished(project_id, embedding_id, with_commit=True)
80+
finally:
81+
general.remove_and_refresh_session(session_token)
82+
return status.HTTP_200_OK
83+
```
84+
85+
## Business Logic Patterns
86+
87+
**Batch processing:**
88+
```python
89+
def generate_batches(
90+
project_id: str,
91+
record_ids: List[str],
92+
embedding_type: str,
93+
attribute_values_raw: List[str],
94+
embedder: Transformer,
95+
attribute_name: str,
96+
for_delta: bool = False,
97+
) -> Iterator[Dict[List[str], List[Any]]]:
98+
# Process records in batches using embedder.batch_size
99+
# Yield batches of record_ids and embeddings
100+
pass
101+
```
102+
103+
**Session management:**
104+
```python
105+
def prepare_run(project_id: str, embedding_id: str) -> None:
106+
session_token = general.get_ctx_token()
107+
try:
108+
t = __prepare_encoding(project_id, embedding_id)
109+
finally:
110+
general.remove_and_refresh_session(session_token)
111+
if t:
112+
run_encoding(*t)
113+
```
114+
115+
**Error handling with notifications:**
116+
```python
117+
try:
118+
# Embedding operation
119+
pass
120+
except Exception as e:
121+
embedding.update_embedding_state_failed(project_id, embedding_id, with_commit=True)
122+
send_project_update(project_id, f"embedding:{embedding_id}:state:{enums.EmbeddingState.FAILED.value}")
123+
notification.create(
124+
project_id,
125+
user_id,
126+
str(e),
127+
enums.Notification.ERROR.value,
128+
enums.NotificationType.EMBEDDING_CREATION_FAILED.value,
129+
True,
130+
)
131+
return status.HTTP_500_INTERNAL_SERVER_ERROR
132+
```
133+
134+
**Throttled progress updates:**
135+
```python
136+
@param_throttle(seconds=5)
137+
def send_progress_update_throttle(
138+
project_id: str, embedding_id: str, state: str, initial_count: int
139+
) -> None:
140+
progress = resolve_progress(embedding_id, state, initial_count)
141+
send_project_update(project_id, f"embedding:{embedding_id}:progress:{progress}")
142+
```
143+
144+
## Best Practices
145+
146+
1. Single responsibility per function
147+
2. Always validate inputs and check embedding existence
148+
3. Use type hints for all parameters
149+
4. Use `with_commit=True` when modifying database state
150+
5. Use submodule business objects, never SQLAlchemy directly
151+
6. Manage database sessions with `general.get_ctx_token()` and `general.remove_and_refresh_session()`
152+
7. Use `daemon.run_without_db_token()` for background operations
153+
8. Update embedding state and send project updates for progress tracking
154+
9. Clean up resources (delete embedders, call gc.collect()) after operations
155+
10. Handle errors gracefully with appropriate notifications and state updates

.cursor/rules/exceptions.mdc

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
description: Rules for exception handling and custom exceptions
3+
globs: ["**/*.py"]
4+
alwaysApply: true
5+
---
6+
7+
# Exceptions Guidelines
8+
9+
## Exception Locations
10+
11+
**Submodule exceptions:**
12+
```python
13+
from submodules.model.exceptions import EntityNotFoundException, EntityAlreadyExistsException
14+
```
15+
16+
**Standard Python exceptions:**
17+
- `ValueError` - Invalid input values
18+
- `Exception` - General errors (with specific messages)
19+
20+
## Usage Patterns
21+
22+
**Raising exceptions:**
23+
```python
24+
# Validation
25+
if not embedding.get(project_id, embedding_id):
26+
raise ValueError(f"Embedding {embedding_id} not found in project {project_id}")
27+
28+
# Not found (from submodules)
29+
embedding_item = embedding.get(project_id, embedding_id)
30+
if not embedding_item:
31+
# Handle gracefully - return early or raise
32+
return
33+
34+
# Business logic errors
35+
if not embedder:
36+
raise Exception(
37+
f"couldn't find matching embedder for requested embedding with type {embedding_type} model {model} and platform {platform}"
38+
)
39+
```
40+
41+
**Handling in controllers:**
42+
```python
43+
try:
44+
embedder = get_embedder(...)
45+
if not embedder:
46+
raise Exception("Could not initialize embedder")
47+
except Exception as e:
48+
print(traceback.format_exc(), flush=True)
49+
embedding.update_embedding_state_failed(project_id, embedding_id, with_commit=True)
50+
send_project_update(project_id, f"embedding:{embedding_id}:state:{enums.EmbeddingState.FAILED.value}")
51+
notification.create(...)
52+
return status.HTTP_422_UNPROCESSABLE_ENTITY
53+
```
54+
55+
**Handling in routes:**
56+
```python
57+
@app.post("/calc-tensor-by-pkl/{project_id}/{embedding_id}")
58+
def calc_tensor(...):
59+
if tensor := controller.calc_tensors(project_id, embedding_id, request.texts):
60+
return responses.JSONResponse(status_code=status.HTTP_200_OK, content={"tensor": tensor})
61+
return responses.PlainTextResponse(
62+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
63+
content="Error while calculating tensor",
64+
)
65+
```
66+
67+
## HTTP Status Code Mapping
68+
69+
- `200`: Successful operations
70+
- `422`: `UnprocessableEntity` - Invalid input or model initialization failures
71+
- `500`: `InternalServerError` - Runtime errors, API connection errors, general exceptions
72+
73+
## Error Handling Best Practices
74+
75+
1. Use specific exception types when available from submodules
76+
2. Provide clear error messages with context (project_id, embedding_id, etc.)
77+
3. Log exceptions with `print(traceback.format_exc(), flush=True)` for debugging
78+
4. Update embedding state to `FAILED` when errors occur
79+
5. Send project updates to notify users of failures
80+
6. Create notifications for user-facing errors
81+
7. Return appropriate HTTP status codes from routes
82+
8. Clean up resources (sessions, embedders) in `finally` blocks
83+
9. Don't swallow exceptions silently - always handle or propagate
84+
10. Use early returns for validation failures to avoid deep nesting

0 commit comments

Comments
 (0)