LPO

LINEAR PREFERENCE OPTIMIZATION: DECOUPLED GRADIENT CONTROL VIA ABSOLUTE REGULARIZATION

Paper | Code

Overview

we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that LPO consistently improves performance on various tasks, including general text tasks, math tasks, and text-to-speech (TTS) tasks. These results establish LPO as a robust and tunable paradigm for preference alignment, and we release the source code, models, and training data publicly.

Available models

We conducted comprehensive experiments across general text tasks, mathematics-specific domains, and text-to-speech (TTS) systems. The results demonstrate consistent improvements across all models. Below are the corresponding trained open-source models for each scenario.

Model Version	Huggingface	Application
General Text Task on perference dataset	HuggingFace	Text Q&A
General Text Task on instruct dataset	HuggingFace	Text Q&A
Math Model	HuggingFace	Math Q&A
TTS Model	HuggingFace	Chinese TTS infernce/text dialogue/long-cot
ASR Model	HuggingFace	ASR

Install

Clone and Install

Clone the repo

git clone git@github.com:IDEA-Emdoor-Lab/LPO.git
cd LPO

Installation environment

conda create -n lpo -y python=3.10
conda activate lpo
pip install -r requirements.txt

Training Usage

We have open-sourced our data construction pipeline and training code, detailed below:

Step 1: Data Pipeline The data construction process consists of three components: Candidate generation，logit computation for the data， onversion to mmap-format training data

cd ./toolkits/lpo_data_preprocessing
sbatch scripts/gene_lpo_lam_cand.sh

Step 2: Model Training

cd ./examples/qwen2_5
sbatch run_mcore_qwen_xpo.sh

Evaluation

To validate the effectiveness of the algorithm, comprehensive experiments were conducted in three domains: general text tasks, domain-specific mathematical text tasks, TTS speech generation tasks and ASR tasks.

RESULTS ON GENERAL TASKS

During the alignment phase, we validated the algorithm's robustness using both noisy training data and high-quality preference training data.

Infinity-Preference: An open-source, high-quality preference dataset characterized by subtle distinctions between chosen and rejected responses, minimal noise, and greater learning difficulty.
Infinity-instruct-1w: We randomly selected 10,000 samples from the remaining Infinity-instruct data. The responses of the original Infinity-instruct dataset serve as the chosen set, while data generated by Qwen2.5-SFT (under temperature=1.0, top_p=1.0) serves as the rejected set. This constructed training dataset is of lower quality than Infinity-Preference.

RESULTS ON MATH TASKS

LPO achieves a score of 88.86 on the GSM8K benchmark, representing a 4.71-point improvement over the SFT model and surpassing the performance of Qwen2.5-Instruct. In contrast, DPO exhibits a 1.81-point degradation compared to the SFT baseline. As noted in DPOP, DPO often fails to achieve strong results on mathematical reasoning tasks.

RESULTS ON TEXT-TO-SPEECH TASKS

The LPO algorithm demonstrates significant improvements in emotional expressiveness and fidelity compared to the SFT model, while exhibiting a slight decrease in stability. This outcome validates the effectiveness of the LPO algorithm in the field of speech generation.

RESULTS ON ASR TASKS

The LPO algorithm shows that although constrained by the base model's fundamental capabilities, our model did not achieve state-of-the-art (SOTA) performance during SFT; however, the LPO algorithm effectively reduced the speech recognition error rate.

Citation

@misc{wang2025linearpreferenceoptimizationdecoupled,
      title={Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization}, 
      author={Rui Wang and Qianguo Sun and Chao Song and Junlong Wu and Tianrong Chen and Zhiyun Zeng and Yu Li},
      year={2025},
      eprint={2508.14947},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.14947}, 
}

License

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization © 2025 by Rui Wang, Qianguo Sun, Chao Song, Junlong Wu, Tianrong Chen, Zhiyun Zeng, Yu Li is licensed under CC BY-NC-ND 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Bigcode-Evaluation-Harness-240327		Bigcode-Evaluation-Harness-240327
LM-Evaluation-Harness-240310		LM-Evaluation-Harness-240310
Megatron-LM-231007		Megatron-LM-231007
Megatron-LM-240126		Megatron-LM-240126
Megatron-LM-240405		Megatron-LM-240405
Megatron-LM-241113		Megatron-LM-241113
Megatron-LM-MegaBlocks		Megatron-LM-MegaBlocks
PAI-Megatron-LM-240718		PAI-Megatron-LM-240718
examples		examples
figures		figures
megatron		megatron
megatron_patch		megatron_patch
rlhf		rlhf
toolkits		toolkits
LICENSE		LICENSE
README.md		README.md
README_zh-CN.md		README_zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LPO

Overview

Available models

Install

Training Usage

Evaluation

RESULTS ON GENERAL TASKS

RESULTS ON MATH TASKS

RESULTS ON TEXT-TO-SPEECH TASKS

RESULTS ON ASR TASKS

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

IDEA-Emdoor-Lab/LPO

Folders and files

Latest commit

History

Repository files navigation

LPO

Overview

Available models

Install

Training Usage

Evaluation

RESULTS ON GENERAL TASKS

RESULTS ON MATH TASKS

RESULTS ON TEXT-TO-SPEECH TASKS

RESULTS ON ASR TASKS

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages