Data GPT: A Systematic Laboratory for Data-Centric AI Research

"Reproduction in science is not about running the code without errors; it is about independently verifying core hypotheses and exploring the boundaries of robustness."

📖 Abstract

Data GPT is not merely a codebase for training Large Language Models (LLMs); it is a research instrument designed to cultivate "Research Sense" in the domain of Data-Centric AI.

Moving beyond the engineering paradigm of "building for throughput," this project adopts the scientific paradigm of "dissecting for causality." We systematically reproduce, ablate, and stress-test state-of-the-art data processing algorithms—including Deduplication (MinHash/SemDeDup), Domain Reweighting (DoReMi), and Gradient-based Selection (LESS).

Our primary objective is to quantify the causal link between data artifacts (e.g., near-duplicates, domain mixtures) and training dynamics, with a rigorous emphasis on documenting Negative Results and Failed Attempts.

🔬 Research Philosophy

This project operates on three core epistemological principles:

Hypothesis-Driven Implementation: Every line of code serves to verify a specific hypothesis (e.g., "Does semantic deduplication retain more reasoning diversity than n-gram deduplication?").
The Value of Negative Results: A divergent loss curve is not a bug; it is a data point. We treat failed experiments as critical findings regarding the fragility of published methods.
Controlled Environments: We utilize the Pythia model suite and DataComp benchmarks to ensure that all performance gains are attributable to data interventions, not architectural variances.

🗺️ Research Roadmap & Methodology

We investigate four pillars of Data-Centric AI. Each pillar is associated with specific Research Questions (RQs).

Phase I: The Purity Hypothesis (Deduplication)

Focus: Trade-offs between information density and semantic diversity.

Algorithms: MinHash LSH, SemDeDup (Embedding-based).
Key Hypotheses:
H_1: Aggressive deduplication (low Jaccard threshold) harms generalization by removing "near-duplicate" concept reinforcements.
H_2: Semantic deduplication outperforms surface-level deduplication on reasoning tasks (ARC, HellaSwag) but incurs O(N^2) computational cost.
Experimental Control: S-Curve manipulation (b, r parameters) to control False Positive Rates.

Phase II: The Proxy Paradox (Domain Reweighting)

Focus: DoReMi and Distributionally Robust Optimization (DRO).

Algorithms: DoReMi (Domain Reweighting with Minimax Optimization).
Key Hypotheses:
H_1: The Proxy-Target Gap: Domain weights learned by a small proxy model (e.g., 160M) are suboptimal for larger target models (e.g., 1B) due to capacity constraints.
H_2: Tokenizer mismatch between Proxy and Reference models leads to catastrophic weight collapse.

Phase III: The Gradient Fingerprint (Data Selection)

Focus: Instance-level selection using Influence Functions.

Algorithms: LESS (Low-rank GradiEnt Similarity Search).
Key Hypotheses:
H_1: Small-to-Large Transferability: Data selected by a 160M model effectively improves a 1B model.
H_2: Gradient-based selection introduces a bias towards short sequence lengths.

Phase IV: The Illusion of Complexity (Synthetic Data)

Focus: Evol-Instruct and complexity metrics.

Algorithms: Evol-Instruct, Tree-of-Thought (ToT) generation.
Key Hypotheses:
H_1: Verbosity \neq Complexity. Many "evolved" samples increase token count without increasing logic depth (measured by Type-Token Ratio).

🧪 Experimental Setup

To ensure strict reproducibility and academic comparability:

Component	Specification	Rationale
Model Architecture	Pythia (160M, 410M, 1B)	Transparent checkpoints; consistent data ordering.
Baseline Dataset	CommonCrawl (DataComp-Small)	Standardized baseline for SOTA comparison.
Evaluation	MMLU, HellaSwag, Perplexity	Standard academic metrics (via `lm-evaluation-harness`).
Infrastructure	PyTorch, Hydra, Accelerate, WandB	Modern, scalable research stack.

📂 Repository Structure

The structure reflects the research workflow, separating hypothesis configuration from implementation.

data-gpt/
├── configs/
│   ├── hypothesis/         # Specific configs for testing assumptions (e.g., high_recall_minhash)
│   ├── experimental_setup/ # Control vs. Treatment definitions
│   └── ...
├── reports/
│   ├── 01_deduplication/
│   │   ├── README.md       # Main Findings
│   │   └── failed_attempts.md # rigorous analysis of what DIDN'T work
│   └── ...
├── notebooks/              # Analysis of S-Curves, Gradient Norms, and Token distributions
├── src/                    # Core implementation of algorithms
└── ...

📉 Featured "Failed Attempts"

Science is built on the graveyard of failed hypotheses. This section (linked below) documents experiments that failed to reproduce reported results or diverged, providing root cause analysis.

Log 001: The Tokenizer Mismatch in DoReMi - How a vocabulary difference of 200 tokens caused domain weights to collapse to one-hot vectors.
Log 002: MinHash Sensitivity - Why Jaccard=0.6 deleted 25% of valid code samples.

🚀 Getting Started

Prerequisites

Python 3.10+
CUDA 11.8+
uv (Recommended for dependency management)

Installation

# Clone the laboratory
git clone https://github.com/duoan/data-gpt.git
cd data-gpt

# Setup environment
uv init
uv sync

Running a Control Experiment

# Run a baseline training on raw data (DataComp-Small) using Pythia-160M
python src/train.py experiment=control_baseline model=pythia_160m

Running an Ablation Study (e.g., MinHash)

# Run training with specific deduplication hypothesis
python src/train.py experiment=dedup_minhash_conservative \
    hypothesis.minhash.num_perm=128 \
    hypothesis.minhash.threshold=0.8

📚 Citation & Acknowledgements

This project is deeply inspired by the following works. If you find this repository useful, please cite the original papers:

Pythia: Biderman et al., 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.
DataComp: Gadre et al., 2023. DataComp: In search of the next generation of multimodal datasets.
DoReMi: Xie et al., 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.
LESS: Xia et al., 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning.

Author: Victor An Last Updated: December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
devtools		devtools
src/data_gpt		src/data_gpt
tests		tests
.copier-answers.yml		.copier-answers.yml
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
development.md		development.md
installation.md		installation.md
publishing.md		publishing.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data GPT: A Systematic Laboratory for Data-Centric AI Research

📖 Abstract

🔬 Research Philosophy

🗺️ Research Roadmap & Methodology

Phase I: The Purity Hypothesis (Deduplication)

Phase II: The Proxy Paradox (Domain Reweighting)

Phase III: The Gradient Fingerprint (Data Selection)

Phase IV: The Illusion of Complexity (Synthetic Data)

🧪 Experimental Setup

📂 Repository Structure

📉 Featured "Failed Attempts"

🚀 Getting Started

Prerequisites

Installation

Running a Control Experiment

Running an Ablation Study (e.g., MinHash)

📚 Citation & Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

duoan/data-gpt

Folders and files

Latest commit

History

Repository files navigation

Data GPT: A Systematic Laboratory for Data-Centric AI Research

📖 Abstract

🔬 Research Philosophy

🗺️ Research Roadmap & Methodology

Phase I: The Purity Hypothesis (Deduplication)

Phase II: The Proxy Paradox (Domain Reweighting)

Phase III: The Gradient Fingerprint (Data Selection)

Phase IV: The Illusion of Complexity (Synthetic Data)

🧪 Experimental Setup

📂 Repository Structure

📉 Featured "Failed Attempts"

🚀 Getting Started

Prerequisites

Installation

Running a Control Experiment

Running an Ablation Study (e.g., MinHash)

📚 Citation & Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages