we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that LPO consistently improves performance on various tasks, including general text tasks, math tasks, and text-to-speech (TTS) tasks. These results establish LPO as a robust and tunable paradigm for preference alignment, and we release the source code, models, and training data publicly.
We conducted comprehensive experiments across general text tasks, mathematics-specific domains, and text-to-speech (TTS) systems. The results demonstrate consistent improvements across all models. Below are the corresponding trained open-source models for each scenario.
| Model Version | Huggingface | Application |
|---|---|---|
| General Text Task on perference dataset | HuggingFace | Text Q&A |
| General Text Task on instruct dataset | HuggingFace | Text Q&A |
| Math Model | HuggingFace | Math Q&A |
| TTS Model | HuggingFace | Chinese TTS infernce/text dialogue/long-cot |
| ASR Model | HuggingFace | ASR |
Clone and Install
- Clone the repo
git clone git@github.com:IDEA-Emdoor-Lab/LPO.git
cd LPO- Installation environment
conda create -n lpo -y python=3.10
conda activate lpo
pip install -r requirements.txtWe have open-sourced our data construction pipeline and training code, detailed below:
Step 1: Data Pipeline The data construction process consists of three components: Candidate generation,logit computation for the data, onversion to mmap-format training data
cd ./toolkits/lpo_data_preprocessing
sbatch scripts/gene_lpo_lam_cand.shStep 2: Model Training
cd ./examples/qwen2_5
sbatch run_mcore_qwen_xpo.shTo validate the effectiveness of the algorithm, comprehensive experiments were conducted in three domains: general text tasks, domain-specific mathematical text tasks, TTS speech generation tasks and ASR tasks.
During the alignment phase, we validated the algorithm's robustness using both noisy training data and high-quality preference training data.
-
Infinity-Preference: An open-source, high-quality preference dataset characterized by subtle distinctions between chosen and rejected responses, minimal noise, and greater learning difficulty.
-
Infinity-instruct-1w: We randomly selected 10,000 samples from the remaining Infinity-instruct data. The responses of the original Infinity-instruct dataset serve as the chosen set, while data generated by Qwen2.5-SFT (under temperature=1.0, top_p=1.0) serves as the rejected set. This constructed training dataset is of lower quality than Infinity-Preference.

LPO achieves a score of 88.86 on the GSM8K benchmark, representing a 4.71-point improvement over the SFT model and surpassing the performance of Qwen2.5-Instruct. In contrast, DPO exhibits a 1.81-point degradation compared to the SFT baseline. As noted in DPOP, DPO often fails to achieve strong results on mathematical reasoning tasks.

The LPO algorithm demonstrates significant improvements in emotional expressiveness and fidelity compared to the SFT model, while exhibiting a slight decrease in stability. This outcome validates the effectiveness of the LPO algorithm in the field of speech generation.

The LPO algorithm shows that although constrained by the base model's fundamental capabilities, our model did not achieve state-of-the-art (SOTA) performance during SFT; however, the LPO algorithm effectively reduced the speech recognition error rate.

@misc{wang2025linearpreferenceoptimizationdecoupled,
title={Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization},
author={Rui Wang and Qianguo Sun and Chao Song and Junlong Wu and Tianrong Chen and Zhiyun Zeng and Yu Li},
year={2025},
eprint={2508.14947},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.14947},
}
Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization © 2025 by Rui Wang, Qianguo Sun, Chao Song, Junlong Wu, Tianrong Chen, Zhiyun Zeng, Yu Li is licensed under CC BY-NC-ND 4.0


