Skip to content
/ pepo Public

This repository implements PEPO (Pessimistic Ensemble Preference Optimization) and related techniques for aligning language models with human preferences. The codebase is designed to be modular, efficient, and scalable, supporting ensemble-based methods through parallelized training.

Notifications You must be signed in to change notification settings

adambarla/pepo

Repository files navigation

PEPO

A project for preference alignment of language models using techniques like DPO (Direct Preference Optimization), RLHF (Reinforcement Learning from Human Feedback), and PEPO.

Installation

The installation process is simplified for CUDA-enabled systems (Linux/Windows):

  1. Edit pyproject.toml to set the correct CUDA version for your system: Open pyproject.toml Update the url in the [tool.uv.index] section to match your CUDA version:

    [[tool.uv.index]]
    name = "pytorch"
    url = "https://download.pytorch.org/whl/cu126"  # Change cu126 to your CUDA version (e.g., cu118, cu121)
  2. Install dependencies:

    uv sync

    This will create a virtual environment and install all required packages, including PyTorch with the specified CUDA version.

  3. The alpaca_eval library is included in this package. If you cloned this repo, run:

    git submodule update --init --recursive

Adding Dependencies

To add new dependencies to the project, use uv add:

uv add package-name

This will automatically update pyproject.toml and uv.lock with the new dependency.

Environment Variables

The project uses environment variables for configuration. Create a .env file in the project root (.env is already in .gitignore):

cp .env.example .env

Then edit .env and fill in all the values:

# HuggingFace Token (needs WRITE permissions for pushing models)
# Get from: https://huggingface.co/settings/tokens
HF_TOKEN=your_huggingface_token_here

# Weights & Biases Configuration (for experiment tracking)
WANDB_API_KEY=your_wandb_token_here
WANDB_ENTITY=your_wandb_entity

# HuggingFace Hub Base Directory (custom cache/storage location)
HF_HUB_BASE_DIR=your_hf_hub_base_dir

Running the Project

Basic Usage

Run scripts using uv run:

uv run scripts/eval.py

Or run the training script:

uv run scripts/train.py

SLURM Scripts

For running on SLURM clusters, use the scripts in scripts/slurm/:

  • get_interactive.sh: Allocates an interactive node with 4 GPUs for 12 hours. Use this to get a shell on a compute node for development and testing.
  • *.slurm: Batch job scripts for training and evaluation (e.g., train.slurm, eval.slurm).
  • connect_to_node.sh: Connects to an existing interactive job. Shows a menu of all active jobs with details (job name, node, start time, time remaining) to help you choose which node to connect to. (can connect to batch jobs too)

Configuration Management

This project uses Hydra for configuration management. Configuration files are stored in the configs directory, with configs/train.yaml as the default (specified in scripts/train.py).

Overriding Parameters

Override any configuration parameter from the command line using dot notation:

python scripts/train.py hub.push=false L=1 log_level=debug

This example:

  • Disables pushing models to HuggingFace Hub (hub.push=false)
  • Sets the number of ensemble networks to 1 (L=1)
  • Sets the log level to debug (log_level=debug)

For more details, see the Hydra documentation.

Configuration File Structure

The configs/train.yaml file uses Hydra's defaults: to compose configurations:

defaults:
  - model: smollm
  - dataset: ultrafeedback
  - _self_

This loads:

  • Model configuration from configs/model/smollm.yaml (directory name matches the field name)
  • Dataset configuration from configs/dataset/ultrafeedback.yaml
  • The train.yaml config itself (via _self_), which can override the defaults

For example, train.yaml overrides the number of ensemble networks using a variable reference:

model:
  num_networks: ${L}

The values in the train.yaml config file can be overridden by the command line arguments.

Development

Pre-commit hooks

This project uses pre-commit hooks to ensure code quality and consistency.

# Install development dependencies
uv sync --group dev

# Install pre-commit hooks
uv run pre-commit install

Note for macOS/non-CUDA users: While you cannot run the training scripts locally without a CUDA device, you can still contribute to the codebase. The pre-commit hooks are configured to use uvx, so they will run in isolated environments without requiring you to install the project's heavy CUDA dependencies.

About

This repository implements PEPO (Pessimistic Ensemble Preference Optimization) and related techniques for aligning language models with human preferences. The codebase is designed to be modular, efficient, and scalable, supporting ensemble-based methods through parallelized training.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •