Skip to content

Dataset Pipeline: Single-Persona Training Data for 1B-3B Model #75

@tzervas

Description

@tzervas

Overview

Implement dataset acquisition and curation pipeline for training Tritter's single-persona (Rust/Python/Triton specialist) foundation model.

Target: ~400GB total dataset stack
Model progression: 1B baseline → 3B → optionally 7B
Hardware: RTX 5080 16GB

Dataset Tiers (from research)

Primary Tier (~150GB) - With Quality Annotations

  • NVIDIA OpenCodeReasoning (735K reasoning traces)
  • Strandset-Rust-v1 (191K Rust examples)
  • GPUMODE Triton datasets (18.2K + 864 kernels)
  • CodeReviewer + CodeUltraFeedback (alignment)

Secondary Tier (~200GB) - Pretraining Base

  • OpenCoder RefineCode (Rust/Python/Triton subset)
  • The Stack v2 (filtered, permissive Rust/Python)
  • Rust ecosystem: Clippy, RFCs, Rustlings

Tertiary Tier - rust-ai Ecosystem

  • Extract training data from tzervas crates (docs.rs + source)

Training Phases

Phase Data Size Datasets
Pretraining (1B) ~20B tokens OpenCoder RefineCode, Stack v2 subset
Instruction Tuning ~5B tokens OpenCodeReasoning, Strandset-Rust, GPUMODE
Alignment ~500M tokens CodeUltraFeedback, CodeReviewer

Licensing Matrix

All datasets must be MIT-compatible:

  • ✓ MIT, Apache 2.0, BSD, CC-BY 4.0
  • ⚠️ Review: OpenRAIL-M (Stack v2)
  • ❌ Exclude: CC-BY-NC, C-UDA, MPL 2.0

Deliverables

  1. Dataset download scripts in `scripts/data/`
  2. Curation pipeline using existing `tritter.curation` module
  3. Quality scoring via tr-curate CLI
  4. JSONL shards in `data/pretrain/`, `data/instruct/`, `data/align/`

References

  • Research doc: `docs/research/dataset-research.md`
  • Curation spec: `docs/specs/SPEC-007-dataset-quality-gates.md`
  • Training strategy: `docs/TRAINING_STRATEGY.md`

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions