"Reproduction in science is not about running the code without errors; it is about independently verifying core hypotheses and exploring the boundaries of robustness."
Data GPT is not merely a codebase for training Large Language Models (LLMs); it is a research instrument designed to cultivate "Research Sense" in the domain of Data-Centric AI.
Moving beyond the engineering paradigm of "building for throughput," this project adopts the scientific paradigm of "dissecting for causality." We systematically reproduce, ablate, and stress-test state-of-the-art data processing algorithms—including Deduplication (MinHash/SemDeDup), Domain Reweighting (DoReMi), and Gradient-based Selection (LESS).
Our primary objective is to quantify the causal link between data artifacts (e.g., near-duplicates, domain mixtures) and training dynamics, with a rigorous emphasis on documenting Negative Results and Failed Attempts.
This project operates on three core epistemological principles:
- Hypothesis-Driven Implementation: Every line of code serves to verify a specific hypothesis (e.g., "Does semantic deduplication retain more reasoning diversity than n-gram deduplication?").
- The Value of Negative Results: A divergent loss curve is not a bug; it is a data point. We treat failed experiments as critical findings regarding the fragility of published methods.
- Controlled Environments: We utilize the Pythia model suite and DataComp benchmarks to ensure that all performance gains are attributable to data interventions, not architectural variances.
We investigate four pillars of Data-Centric AI. Each pillar is associated with specific Research Questions (RQs).
Focus: Trade-offs between information density and semantic diversity.
-
Algorithms: MinHash LSH, SemDeDup (Embedding-based).
-
Key Hypotheses:
-
H_1: Aggressive deduplication (low Jaccard threshold) harms generalization by removing "near-duplicate" concept reinforcements.
-
H_2: Semantic deduplication outperforms surface-level deduplication on reasoning tasks (ARC, HellaSwag) but incurs O(N^2) computational cost.
-
Experimental Control: S-Curve manipulation (b, r parameters) to control False Positive Rates.
Focus: DoReMi and Distributionally Robust Optimization (DRO).
- Algorithms: DoReMi (Domain Reweighting with Minimax Optimization).
- Key Hypotheses:
- H_1: The Proxy-Target Gap: Domain weights learned by a small proxy model (e.g., 160M) are suboptimal for larger target models (e.g., 1B) due to capacity constraints.
- H_2: Tokenizer mismatch between Proxy and Reference models leads to catastrophic weight collapse.
Focus: Instance-level selection using Influence Functions.
- Algorithms: LESS (Low-rank GradiEnt Similarity Search).
- Key Hypotheses:
- H_1: Small-to-Large Transferability: Data selected by a 160M model effectively improves a 1B model.
- H_2: Gradient-based selection introduces a bias towards short sequence lengths.
Focus: Evol-Instruct and complexity metrics.
- Algorithms: Evol-Instruct, Tree-of-Thought (ToT) generation.
- Key Hypotheses:
- H_1: Verbosity \neq Complexity. Many "evolved" samples increase token count without increasing logic depth (measured by Type-Token Ratio).
To ensure strict reproducibility and academic comparability:
| Component | Specification | Rationale |
|---|---|---|
| Model Architecture | Pythia (160M, 410M, 1B) | Transparent checkpoints; consistent data ordering. |
| Baseline Dataset | CommonCrawl (DataComp-Small) | Standardized baseline for SOTA comparison. |
| Evaluation | MMLU, HellaSwag, Perplexity | Standard academic metrics (via lm-evaluation-harness). |
| Infrastructure | PyTorch, Hydra, Accelerate, WandB | Modern, scalable research stack. |
The structure reflects the research workflow, separating hypothesis configuration from implementation.
data-gpt/
├── configs/
│ ├── hypothesis/ # Specific configs for testing assumptions (e.g., high_recall_minhash)
│ ├── experimental_setup/ # Control vs. Treatment definitions
│ └── ...
├── reports/
│ ├── 01_deduplication/
│ │ ├── README.md # Main Findings
│ │ └── failed_attempts.md # rigorous analysis of what DIDN'T work
│ └── ...
├── notebooks/ # Analysis of S-Curves, Gradient Norms, and Token distributions
├── src/ # Core implementation of algorithms
└── ...
Science is built on the graveyard of failed hypotheses. This section (linked below) documents experiments that failed to reproduce reported results or diverged, providing root cause analysis.
- Log 001: The Tokenizer Mismatch in DoReMi - How a vocabulary difference of 200 tokens caused domain weights to collapse to one-hot vectors.
- Log 002: MinHash Sensitivity - Why Jaccard=0.6 deleted 25% of valid code samples.
- Python 3.10+
- CUDA 11.8+
- uv (Recommended for dependency management)
# Clone the laboratory
git clone https://github.com/duoan/data-gpt.git
cd data-gpt
# Setup environment
uv init
uv sync
# Run a baseline training on raw data (DataComp-Small) using Pythia-160M
python src/train.py experiment=control_baseline model=pythia_160m
# Run training with specific deduplication hypothesis
python src/train.py experiment=dedup_minhash_conservative \
hypothesis.minhash.num_perm=128 \
hypothesis.minhash.threshold=0.8
This project is deeply inspired by the following works. If you find this repository useful, please cite the original papers:
- Pythia: Biderman et al., 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.
- DataComp: Gadre et al., 2023. DataComp: In search of the next generation of multimodal datasets.
- DoReMi: Xie et al., 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.
- LESS: Xia et al., 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning.
Author: Victor An Last Updated: December 2025