Skip to content

Commit c6be48d

Browse files
FradSerclaude
andcommitted
fix: resolve sqlalchemy metadata field conflicts and pydantic model_ namespace issues
- rename metadata columns to extra_metadata in GenerationRecord and DatasetItem - fix pydantic model field naming conflicts (model_used -> models_used) - update claude.md with accurate architecture details and file references 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 6cff643 commit c6be48d

File tree

3 files changed

+1001
-0
lines changed

3 files changed

+1001
-0
lines changed

CLAUDE.md

Lines changed: 349 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is an AI fine-tuning dataset generator that uses knowledge distillation techniques to create high-quality training data. The system transforms expensive teacher models into cost-effective student models through intelligent evaluation and automated optimization workflows.
8+
9+
## Architecture
10+
11+
This is a full-stack application with the following key components:
12+
13+
### Backend (FastAPI + Python)
14+
- **Framework**: FastAPI with async/await support using Pydantic v2 models
15+
- **Database**: PostgreSQL with SQLAlchemy ORM, async support via asyncpg
16+
- **Cache/Queue**: Redis for caching and Celery for background tasks
17+
- **LLM Integration**: OpenAI, Anthropic, Google APIs for model distillation
18+
- **Location**: `./backend/` directory
19+
- **Key Files**:
20+
- `services.py`: Business logic with DatasetService, QualityService, CostService, ExportService
21+
- `models.py`: Pydantic models with enums for DataType, ExportFormat, OptimizationLevel
22+
- `database.py`: SQLAlchemy models (GenerationRecord, DatasetItem, CostRecord, QualityValidation)
23+
- `config.py`: Settings using pydantic-settings with environment-based configuration
24+
25+
### Frontend (Next.js + React + TypeScript)
26+
- **Framework**: Next.js 14 with React 18 (App Router)
27+
- **Language**: TypeScript with strict configuration
28+
- **Styling**: Tailwind CSS with custom configurations
29+
- **Components**: Headless UI, Lucide React icons, Framer Motion for animations
30+
- **State Management**: React Query v3 for server state, React Hook Form for forms
31+
- **UI Libraries**: Recharts for analytics, React Hot Toast for notifications
32+
- **Location**: `./frontend/` directory
33+
- **Key Pages**: `/generate`, `/datasets`, `/analytics` (App Router structure)
34+
35+
### Knowledge Distillation System
36+
- **Purpose**: Teacher-student model architecture for cost optimization (70-90% cost reduction)
37+
- **Location**: `./distillation/` directory
38+
- **Key Files**:
39+
- `core.py`: Core orchestrator, teacher/student models, quality validation
40+
- `providers.py`: Multi-provider LLM integration (OpenAI, Anthropic, Google, Local)
41+
- `integration.py`: API integration layer and configuration management
42+
- `transfer.py`: Knowledge transfer algorithms and adaptive learning
43+
- **Configuration**: `./config/distillation.json`
44+
45+
## Development Commands
46+
47+
### Starting the Development Environment
48+
```bash
49+
# Start full development environment with hot reload
50+
./start-dev.sh
51+
52+
# Alternative: Start individual services
53+
docker-compose -f docker-compose.dev.yml up -d postgres redis # Basic services
54+
docker-compose -f docker-compose.dev.yml up -d --build # Full services
55+
```
56+
57+
### Starting Production Environment
58+
```bash
59+
# Start production environment
60+
./start-prod.sh
61+
# OR
62+
docker-compose up -d
63+
64+
# With monitoring (Prometheus + Grafana)
65+
docker-compose --profile monitoring up -d
66+
67+
# With background tasks (Celery)
68+
docker-compose --profile background-tasks up -d
69+
```
70+
71+
### Backend Development
72+
```bash
73+
cd backend
74+
75+
# Install dependencies
76+
pip install -r requirements.txt
77+
78+
# Run locally (requires database and Redis running)
79+
uvicorn main:app --reload --host 0.0.0.0 --port 8000
80+
81+
# Code formatting and linting
82+
black . # Format code
83+
isort . # Sort imports
84+
mypy . # Type checking
85+
86+
# Run tests
87+
pytest # All tests
88+
pytest tests/unit/ # Unit tests only
89+
pytest tests/integration/ # Integration tests only
90+
pytest -v -s # Verbose output
91+
pytest --cov=app # With coverage
92+
```
93+
94+
### Frontend Development
95+
```bash
96+
cd frontend
97+
98+
# Install dependencies
99+
npm install
100+
101+
# Development server
102+
npm run dev
103+
104+
# Build for production
105+
npm run build
106+
107+
# Type checking
108+
npm run type-check
109+
110+
# Linting
111+
npm run lint
112+
```
113+
114+
### Service Management
115+
```bash
116+
# View service status
117+
docker-compose ps
118+
docker-compose -f docker-compose.dev.yml ps
119+
120+
# View logs
121+
docker-compose logs -f [service_name]
122+
docker-compose -f docker-compose.dev.yml logs -f backend
123+
124+
# Stop all services
125+
./stop-dev.sh
126+
# OR
127+
docker-compose down
128+
docker-compose -f docker-compose.dev.yml down
129+
130+
# Restart specific service
131+
docker-compose restart backend
132+
```
133+
134+
## Environment Configuration
135+
136+
### Required Environment Variables (.env)
137+
```bash
138+
# Database Configuration
139+
POSTGRES_PASSWORD=password123
140+
DATABASE_URL=postgresql://postgres:password123@localhost:5432/qa_generator
141+
142+
# Redis Configuration
143+
REDIS_PASSWORD=redis123
144+
REDIS_URL=redis://:redis123@localhost:6379/0
145+
146+
# LLM API Keys (at least one required)
147+
OPENAI_API_KEY=sk-your-openai-api-key
148+
ANTHROPIC_API_KEY=sk-ant-your-anthropic-api-key
149+
GOOGLE_API_KEY=your-google-api-key
150+
GROQ_API_KEY=your_groq_api_key
151+
QIANFAN_ACCESS_KEY=your_qianfan_access_key
152+
QIANFAN_SECRET_KEY=your_qianfan_secret_key
153+
OPENAI_BASE_URL=your_openai_base_url
154+
155+
# Application Configuration
156+
DEBUG=false
157+
NODE_ENV=production
158+
LOG_LEVEL=INFO
159+
SECRET_KEY=your-secret-key-change-in-production-very-long-and-random
160+
161+
# Frontend Configuration
162+
NEXT_PUBLIC_API_BASE_URL=http://localhost:8000
163+
164+
# Optional: Monitoring
165+
GRAFANA_PASSWORD=admin123
166+
167+
# Optional: Background Tasks
168+
CELERY_BROKER_URL=redis://:redis123@localhost:6379/1
169+
CELERY_RESULT_BACKEND=redis://:redis123@localhost:6379/1
170+
171+
# Optional: Production SSL
172+
SSL_CERT_PATH=/path/to/cert.pem
173+
SSL_KEY_PATH=/path/to/key.pem
174+
175+
# Optional: Email Notifications
176+
SMTP_HOST=smtp.gmail.com
177+
SMTP_PORT=587
178+
SMTP_USER=your-email@gmail.com
179+
SMTP_PASS=your-email-password
180+
181+
# Optional: AWS File Storage
182+
AWS_ACCESS_KEY_ID=your-aws-access-key
183+
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
184+
AWS_REGION=us-east-1
185+
S3_BUCKET_NAME=qa-generator-storage
186+
187+
# Optional: Third-party Services
188+
SENTRY_DSN=https://your-sentry-dsn@sentry.io/project-id
189+
ANALYTICS_ID=GA-your-google-analytics-id
190+
```
191+
192+
### Creating Environment File
193+
- Copy `.env.example` to `.env`: `cp .env.example .env`
194+
- The `./start-dev.sh` script automatically creates a basic `.env` file if missing
195+
- **Important**: Replace placeholder API keys with actual keys for LLM providers
196+
- **Minimum Required**: At least one LLM API key (OpenAI, Anthropic, Google, Groq, or Qianfan)
197+
- **Extended Providers**: System supports Groq and Qianfan (Baidu) in addition to standard providers
198+
199+
## Key Architectural Patterns
200+
201+
### Service Layer Architecture
202+
The backend uses a service-oriented pattern with distinct responsibilities:
203+
- **DatasetService**: Manages dataset creation, storage, and retrieval (`backend/services.py:25`)
204+
- **QualityService**: Handles quality analytics, validation, and metrics (`backend/services.py:141`)
205+
- **CostService**: Provides cost estimation, tracking, and optimization analysis (`backend/services.py:227`)
206+
- **ExportService**: Manages data export in multiple formats (JSON, JSONL, CSV, Hugging Face, OpenAI) (`backend/services.py:315`)
207+
- **BatchService**: Handles concurrent batch processing of generation requests (`backend/services.py:432`)
208+
209+
### Knowledge Distillation Architecture
210+
The distillation system implements a sophisticated teacher-student learning framework:
211+
- **KnowledgeDistillationOrchestrator**: Main orchestrator coordinating the 4-stage process (`distillation/core.py:313`)
212+
- **TeacherModel**: High-capacity model for seed data generation (`distillation/core.py:80`)
213+
- **StudentModel**: Cost-efficient model learning from teacher examples (`distillation/core.py:140`)
214+
- **QualityValidator**: Automated quality assessment using teacher models (`distillation/core.py:246`)
215+
- **ProviderFactory**: Multi-provider abstraction with optimal model pairing (`distillation/providers.py:399`)
216+
217+
### Knowledge Distillation Pipeline (4-Stage Process)
218+
1. **Teacher Seed Generation**: High-capacity model generates 10% high-quality seed data (`TeacherModel.generate_seed_data`)
219+
2. **Student Learning**: Pattern extraction from teacher examples (`StudentModel.learn_from_teacher`)
220+
3. **Bulk Generation**: Student model generates remaining 90% using learned patterns (`StudentModel.generate_bulk_data`)
221+
4. **Quality Validation**: Automated quality assessment with sampling for cost efficiency (`QualityValidator.validate_batch`)
222+
223+
### Multi-Provider LLM Integration
224+
- **Provider Classes**: `OpenAIProvider`, `AnthropicProvider`, `GoogleProvider`, `LocalLlamaProvider` (`distillation/providers.py`)
225+
- **Real API Integration**: Actual HTTP calls to OpenAI, Anthropic Claude, Google Gemini, and Ollama endpoints
226+
- **Cost Optimization**: Real 2024 pricing data with intelligent teacher-student model pairing
227+
- **Rate Limiting**: Per-provider request throttling with `RateLimiter` class
228+
- **Concurrent Processing**: Async batch processing with configurable batch sizes
229+
230+
### Data Models & Storage
231+
- **SQLAlchemy Models** (`backend/database.py`):
232+
- `GenerationRecord`: Tracks generation requests, costs, quality metrics, metadata
233+
- `DatasetItem`: Individual data items with content, quality scores, validation status
234+
- `CostRecord`: Detailed cost tracking per provider/model with token usage
235+
- `QualityValidation`: Validation results with quality breakdowns and recommendations
236+
- **Pydantic Models** (`backend/models.py`): API request/response validation with strict typing
237+
- **IMPORTANT**: Avoid reserved field names like `metadata` (use `extra_metadata`) and `model_*` prefixes
238+
239+
## Testing Strategy
240+
241+
### Backend Testing
242+
```bash
243+
cd backend
244+
pytest # Run all tests
245+
pytest tests/unit/ # Unit tests only
246+
pytest tests/integration/ # Integration tests only
247+
pytest --cov=app # With coverage
248+
```
249+
250+
### Frontend Testing
251+
```bash
252+
cd frontend
253+
npm test # Run tests
254+
npm run test:watch # Watch mode
255+
npm run test:coverage # With coverage
256+
```
257+
258+
## Deployment Profiles
259+
260+
The system supports multiple deployment configurations via Docker Compose profiles:
261+
262+
- **Default**: Basic application (frontend + backend + database + redis)
263+
- **monitoring**: Includes Prometheus + Grafana for metrics
264+
- **background-tasks**: Includes Celery workers + Flower monitoring
265+
- **production**: Includes Nginx reverse proxy with SSL
266+
267+
## API Documentation
268+
269+
- **Development**: http://localhost:8000/docs (FastAPI auto-generated)
270+
- **Health Check**: http://localhost:8000/api/system/status
271+
- **Frontend**: http://localhost:3000
272+
273+
## Common Development Tasks
274+
275+
### Adding New LLM Provider
276+
1. Implement provider class in `distillation/providers.py` extending `LLMProvider`
277+
2. Add provider to `ProviderFactory.PROVIDER_MAP`
278+
3. Update `config/distillation.json` with new provider configuration
279+
4. Add pricing information and role suitability scoring
280+
5. Add tests for new provider integration
281+
282+
### Modifying Distillation Strategy
283+
1. Update distillation configuration in `config/distillation.json`
284+
2. Implement strategy in `distillation/transfer.py` or `distillation/core.py`
285+
3. Update `DistillationStrategy` enum in `distillation/core.py`
286+
4. Add quality metrics and validation in `QualityValidator` class
287+
5. Test with different teacher-student combinations
288+
289+
### Database Schema Changes
290+
1. Create Alembic migration: `cd backend && alembic revision --autogenerate -m "description"`
291+
2. Apply migration: `alembic upgrade head`
292+
3. Update SQLAlchemy models in `backend/database.py` (not models.py - that's for Pydantic)
293+
294+
### Working with Data Types
295+
Supported data types are defined in `DataType` enum:
296+
- `qa`: Question-Answer pairs
297+
- `classification`: Text classification data
298+
- `generation`: Text generation prompts
299+
- `code`: Code generation and completion
300+
- `translation`: Translation pairs
301+
- `ner`: Named Entity Recognition data
302+
303+
### Cost Management
304+
Cost estimation and tracking includes:
305+
- Base cost per data type with quality multipliers
306+
- Teacher-student ratio impact on pricing
307+
- Budget limits and utilization tracking
308+
- Provider-specific token and cost recording
309+
- ROI analysis vs traditional annotation methods
310+
311+
## Performance Considerations
312+
313+
- **Concurrent Processing**: System supports 1000+ concurrent requests
314+
- **Rate Limiting**: Configured per LLM provider to avoid quota issues
315+
- **Caching**: Redis caching for frequently accessed data and API responses
316+
- **Background Tasks**: Use Celery for long-running distillation processes
317+
- **Database**: Connection pooling and query optimization for large datasets
318+
319+
## Security Features
320+
321+
- **API Authentication**: JWT tokens with configurable secret keys
322+
- **Data Encryption**: All API communications over HTTPS in production
323+
- **Input Validation**: Pydantic models for request/response validation
324+
- **Rate Limiting**: Per-endpoint and per-user request throttling
325+
- **Environment Isolation**: Separate configurations for dev/staging/production
326+
327+
## Troubleshooting
328+
329+
### Common Issues
330+
- **Port conflicts**: Ensure ports 3000, 8000, 5432, 6379 are available
331+
- **API key errors**: Verify LLM provider API keys in `.env`
332+
- **Database connection**: Check PostgreSQL container health and credentials
333+
- **Memory issues**: Increase Docker memory limits for large dataset processing
334+
335+
### Debug Commands
336+
```bash
337+
# Check container health
338+
docker-compose ps
339+
docker-compose logs [service]
340+
341+
# Database connection test
342+
docker-compose exec postgres pg_isready -U postgres
343+
344+
# Redis connection test
345+
docker-compose exec redis redis-cli ping
346+
347+
# API health check
348+
curl http://localhost:8000/api/system/status
349+
```

0 commit comments

Comments
 (0)