-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Implement dataset acquisition and curation pipeline for training Tritter's single-persona (Rust/Python/Triton specialist) foundation model.
Target: ~400GB total dataset stack
Model progression: 1B baseline → 3B → optionally 7B
Hardware: RTX 5080 16GB
Dataset Tiers (from research)
Primary Tier (~150GB) - With Quality Annotations
- NVIDIA OpenCodeReasoning (735K reasoning traces)
- Strandset-Rust-v1 (191K Rust examples)
- GPUMODE Triton datasets (18.2K + 864 kernels)
- CodeReviewer + CodeUltraFeedback (alignment)
Secondary Tier (~200GB) - Pretraining Base
- OpenCoder RefineCode (Rust/Python/Triton subset)
- The Stack v2 (filtered, permissive Rust/Python)
- Rust ecosystem: Clippy, RFCs, Rustlings
Tertiary Tier - rust-ai Ecosystem
- Extract training data from tzervas crates (docs.rs + source)
Training Phases
| Phase | Data Size | Datasets |
|---|---|---|
| Pretraining (1B) | ~20B tokens | OpenCoder RefineCode, Stack v2 subset |
| Instruction Tuning | ~5B tokens | OpenCodeReasoning, Strandset-Rust, GPUMODE |
| Alignment | ~500M tokens | CodeUltraFeedback, CodeReviewer |
Licensing Matrix
All datasets must be MIT-compatible:
- ✓ MIT, Apache 2.0, BSD, CC-BY 4.0
⚠️ Review: OpenRAIL-M (Stack v2)- ❌ Exclude: CC-BY-NC, C-UDA, MPL 2.0
Deliverables
- Dataset download scripts in `scripts/data/`
- Curation pipeline using existing `tritter.curation` module
- Quality scoring via tr-curate CLI
- JSONL shards in `data/pretrain/`, `data/instruct/`, `data/align/`
References
- Research doc: `docs/research/dataset-research.md`
- Curation spec: `docs/specs/SPEC-007-dataset-quality-gates.md`
- Training strategy: `docs/TRAINING_STRATEGY.md`
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request