Skip to content

IDEA-Emdoor-Lab/LPO

Repository files navigation

LPO

LINEAR PREFERENCE OPTIMIZATION: DECOUPLED GRADIENT CONTROL VIA ABSOLUTE REGULARIZATION

Paper | Code

Institution 1

Institution 2 Institution 3

Overview

we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that LPO consistently improves performance on various tasks, including general text tasks, math tasks, and text-to-speech (TTS) tasks. These results establish LPO as a robust and tunable paradigm for preference alignment, and we release the source code, models, and training data publicly.

Available models

We conducted comprehensive experiments across general text tasks, mathematics-specific domains, and text-to-speech (TTS) systems. The results demonstrate consistent improvements across all models. Below are the corresponding trained open-source models for each scenario.

Model Version Huggingface Application
General Text Task on perference dataset HuggingFace Text Q&A
General Text Task on instruct dataset HuggingFace Text Q&A
Math Model HuggingFace Math Q&A
TTS Model HuggingFace Chinese TTS infernce/text dialogue/long-cot
ASR Model HuggingFace ASR

Install

Clone and Install

  • Clone the repo
git clone git@github.com:IDEA-Emdoor-Lab/LPO.git
cd LPO
  • Installation environment
conda create -n lpo -y python=3.10
conda activate lpo
pip install -r requirements.txt

Training Usage

We have open-sourced our data construction pipeline and training code, detailed below:

Step 1: Data Pipeline The data construction process consists of three components: Candidate generation,logit computation for the data, onversion to mmap-format training data

cd ./toolkits/lpo_data_preprocessing
sbatch scripts/gene_lpo_lam_cand.sh

Step 2: Model Training

cd ./examples/qwen2_5
sbatch run_mcore_qwen_xpo.sh

Evaluation

To validate the effectiveness of the algorithm, comprehensive experiments were conducted in three domains: general text tasks, domain-specific mathematical text tasks, TTS speech generation tasks and ASR tasks.

RESULTS ON GENERAL TASKS

During the alignment phase, we validated the algorithm's robustness using both noisy training data and high-quality preference training data.

  1. Infinity-Preference: An open-source, high-quality preference dataset characterized by subtle distinctions between chosen and rejected responses, minimal noise, and greater learning difficulty.

  2. Infinity-instruct-1w: We randomly selected 10,000 samples from the remaining Infinity-instruct data. The responses of the original Infinity-instruct dataset serve as the chosen set, while data generated by Qwen2.5-SFT (under temperature=1.0, top_p=1.0) serves as the rejected set. This constructed training dataset is of lower quality than Infinity-Preference. alt text alt text alt text alt text

RESULTS ON MATH TASKS

LPO achieves a score of 88.86 on the GSM8K benchmark, representing a 4.71-point improvement over the SFT model and surpassing the performance of Qwen2.5-Instruct. In contrast, DPO exhibits a 1.81-point degradation compared to the SFT baseline. As noted in DPOP, DPO often fails to achieve strong results on mathematical reasoning tasks. alt text

RESULTS ON TEXT-TO-SPEECH TASKS

The LPO algorithm demonstrates significant improvements in emotional expressiveness and fidelity compared to the SFT model, while exhibiting a slight decrease in stability. This outcome validates the effectiveness of the LPO algorithm in the field of speech generation. alt text

RESULTS ON ASR TASKS

The LPO algorithm shows that although constrained by the base model's fundamental capabilities, our model did not achieve state-of-the-art (SOTA) performance during SFT; however, the LPO algorithm effectively reduced the speech recognition error rate. alt text

Citation

@misc{wang2025linearpreferenceoptimizationdecoupled,
      title={Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization}, 
      author={Rui Wang and Qianguo Sun and Chao Song and Junlong Wu and Tianrong Chen and Zhiyun Zeng and Yu Li},
      year={2025},
      eprint={2508.14947},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.14947}, 
}

License

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization © 2025 by Rui Wang, Qianguo Sun, Chao Song, Junlong Wu, Tianrong Chen, Zhiyun Zeng, Yu Li is licensed under CC BY-NC-ND 4.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published