OptiMHC is an optimum rescoring pipeline for immunopeptidomics data. It enhances peptide identification by integrating multiple feature generators and machine learning-based rescoring.
git clone https://github.com/5h4ng/OptiMHC.git
cd OptiMHC
pip install -e .Using a YAML configuration file is recommended because it provides a more flexible and user-friendly way to configure the pipeline.
optimhc pipeline --config /path/to/config.yamlNote: The default configuration is stored in optimhc/core/config.py. Your custom configuration will override the default values.
The pipeline can be configured by using a YAML file. This file defines the input settings, the list of feature generators, rescore parameters, and (optionally) experiment configurations. Below you will find a table summarizing the main configuration parameters along with examples and descriptions.
| Parameter | Type | Example | Description |
|---|---|---|---|
experimentName |
String | classI_example |
Name of the experiment and output subdirectory name. |
inputType |
String | pepxml |
Type of input file. Supported values: pepxml, pin. |
inputFile |
String or List | ./data/xxx.pep.xml |
Path(s) to the input PSM file(s). |
decoyPrefix |
String | DECOY_ |
Prefix used to identify decoy sequences. |
outputDir |
String | ./results |
Base directory where output files, logs and figures are stored. |
visualization |
Boolean | True |
Enable or disable generation of visualization plots. |
removePreNxtAA |
Boolean | False |
Remove pre/post neighboring amino acids in sequence processing. |
numProcesses |
Integer | 32 |
Number of parallel processes to use. |
showProgress |
Boolean | True |
Show progress information during execution. |
logLevel |
String | INFO |
Logging level (DEBUG, INFO, WARNING, ERROR). Default is "INFO". |
modificationMap |
Dictionary | { '147.035385': 'UNIMOD:35' } |
Maps FULL modified residue masses (amino acid+modification) to their 'UNIMOD' identifiers. These masses can be found in the pepXML parameters section. See https://www.unimod.org/ for details. |
allele |
List | [HLA-A*02:02] |
List of alleles for which predictions will be computed. |
toFlashLFQ |
Boolean | True |
Whether to export the rescored results at the FDR threshold defined in rescore.testFDR into a FlashLFQ‑compatible format for downstream quantification. |
featureGenerator |
List of Dictionaries | See table below | List of feature generator configurations (each with a name and optional params). |
rescore |
Dictionary | See table below | Rescore settings including FDR threshold, model and number of jobs. |
Each feature generator is specified with its name and an optional params subsection. Some common generators include:
| Generator Name | Example Parameters | Description |
|---|---|---|
Basic |
N/A | Generates basic sequence features. |
SpectralSimilarity |
mzmlDir: ./dataspectrumIdPattern: (.+?)\.\d+\.\d+\.\d+model: AlphaPeptDeep_ms2_genericcollisionEnergy: 28instrument: LUMOStolerance: 20numTopPeaks: 36url: koina.wilhelmlab.org:443 |
Computes features based on the similarity between experimental spectra and predicted spectra. The spectrumIdPattern is a regular expression used to extract mzML file names from spectrum IDs. The default pattern (.+?)\.\d+\.\d+\.\d+ expects spectrum IDs in the format "filename.scan.scan.charge". The tolerance parameter (10-50 ppm) sets the mass tolerance for peak matching. See more options on https://koina.proteomicsdb.org/ |
DeepLC |
calibrationCriteria: expectlowerIsBetter: TruecalibrationSize: 0.1 |
Creates retention time predictions by calibrating using DeepLC. The calibrationCriteria should be set to a score field in the PSM data (e.g., expect, xcorr, hyperscore). |
OverlappingPeptide |
minOverlapLength: 7minLength: 7maxLength: 20overlappingScore: expect |
Generates overlapping peptide features for grouping similar peptides. The overlappingScore should be set to a score field in the PSM data (e.g., expect, xcorr, hyperscore). |
PWM |
class: I |
Generates position weight matrix features for MHC class I and class II peptides. |
MHCflurry |
N/A | Predicts class I binding affinities using the MHCflurry model. |
NetMHCpan |
N/A | Predicts class I peptide-MHC binding affinity using NetMHCpan. |
NetMHCIIpan |
N/A | Predicts class II peptide-MHC binding affinity using NetMHCIIpan. |
Rescore parameters control how the rescoring step is executed and include:
| Parameter | Type | Example | Description |
|---|---|---|---|
testFDR |
Float | 0.01 |
The false-discovery rate threshold at which to evaluate the learned models. |
model |
String | Percolator |
Model to use for rescoring (valid options include Percolator, XGBoost, or RandomForest). |
numJobs |
Integer | 4 |
The number of parallel jobs to run. This value is passed to Scikit-learn's n_jobs parameter to control parallelism for model training or scoring. Set to -1 to use all available CPU cores. |
Below is an example YAML configuration for class I based on the latest pipeline version:
experimentName: classI_example
inputType: pepxml
inputFile:
- ./examples/data/YE_20180428_SK_HLA_A0202_3Ips_a50mio_R1_01.pep.xml
decoyPrefix: DECOY_
outputDir: ./examples/results/
visualization: True
removePreNxtAA: False
numProcesses: 32
showProgress: True
# Mapping of FULL modified residue masses (residue+modification) to UNIMOD IDs
# These masses can be found in pepXML parameters section
modificationMap:
"147.035385": "UNIMOD:35" # Oxidation (M) - full modified residue mass
"160.030649": "UNIMOD:4" # Carbamidomethyl (C) - full modified residue mass
# Allele settings
allele:
- HLA-A*02:02
# Feature generator configurations
featureGenerator:
- name: Basic
- name: SpectralSimilarity
params:
mzmlDir: ./examples/data
spectrumIdPattern: (.+?)\.\d+\.\d+\.\d+
model: AlphaPeptDeep_ms2_generic
collisionEnergy: 28
instrument: LUMOS
tolerance: 20
numTopPeaks: 36
url: koina.wilhelmlab.org:443
- name: DeepLC
params:
calibrationCriteria: expect
lowerIsBetter: True
calibrationSize: 0.1
- name: OverlappingPeptide
params:
minOverlapLength: 7
minLength: 7
maxLength: 20
overlappingScore: expect
- name: PWM
params:
class: I
- name: MHCflurry
- name: NetMHCpan
# Rescore settings
rescore:
testFDR: 0.01
model: Percolator
numJobs: 4While we recommend using the YAML configuration file, you can also use command-line parameters to configure the pipeline:
Note: The command-line configuration mode has not been fully tested.
optimhc pipeline \
--inputType pepxml \
--inputFile ./data/YE_20180428_SK_HLA_A0202_3Ips_a50mio_R1_01.pep.xml \
--decoyPrefix DECOY_ \
--outputDir ./results \
--visualization \
--numProcesses 32 \
--allele HLA-A*02:02 \
--logLevel INFO \
--featureGenerator '{"name": "Basic"}' \
--testFDR 0.01 \
--model PercolatorNote: If you use both YAML configuration file and command-line parameters, command-line parameters will override the corresponding values in the YAML configuration file.
The --featureGenerator option accepts JSON formatted strings that define the feature generator configuration. You can specify multiple feature generators by using the option multiple times.
But be careful that if you use --featureGenerator in command-line, all your feature generator configurations in YAML file (--config) will be ignored.
Thus, rather than using both methods simultaneously, use either command-line arguments or YAML for feature generator configuration.
Here are some examples:
Basic feature generator (no parameters)
--featureGenerator '{"name": "Basic"}'SpectralSimilarity with parameters
--featureGenerator '{
"name": "SpectralSimilarity",
"params": {
"mzmlDir": "./data",
"spectrumIdPattern": "(.+?)\.\d+\.\d+\.\d+",
"model": "AlphaPeptDeep_ms2_generic",
"collisionEnergy": 28,
"instrument": "LUMOS",
"tolerance": 20,
"numTopPeaks": 36,
"url": "koina.wilhelmlab.org:443"
}
}'Multiple feature generators
--featureGenerator '{"name": "Basic"}' \
--featureGenerator '{
"name": "SpectralSimilarity",
"params": {
"mzmlDir": "./data",
"model": "AlphaPeptDeep_ms2_generic"
}
}' \
--featureGenerator '{
"name": "DeepLC",
"params": {
"calibrationCriteria": "expect",
"lowerIsBetter": true,
"calibrationSize": 0.1
}
}'Some tips for JSON format
- Use single quotes (
') to wrap the entire JSON string - All JSON strings must be valid JSON format (e.g., use
trueinstead ofTrue,falseinstead ofFalse) - For complex parameters, you can use a single line with proper escaping:
--featureGenerator '{"name":"SpectralSimilarity","params":{"mzmlDir":"./data","model":"AlphaPeptDeep_ms2_generic"}}'optimhc guioptimhc --help
optimhc pipeline --help
optimhc experiment --help