Skip to content

Fix issues with MaxQuant and Fragpipe tables#88

Merged
ammarcsj merged 10 commits intomainfrom
investigate_mq_fragpipe_issues
Mar 17, 2025
Merged

Fix issues with MaxQuant and Fragpipe tables#88
ammarcsj merged 10 commits intomainfrom
investigate_mq_fragpipe_issues

Conversation

@ammarcsj
Copy link
Member

@ammarcsj ammarcsj commented Mar 14, 2025

Addressing issue #85

  • Changed the intable config for the GUI back to the default version also used in the python version
  • Update config for MaxQuant evidence.txt
  • Extend sample name extraction in GUI to parquet files
  • Extend sample name extraction in GUI to wideformat tables (MaxQuant peptides.txt and fragpipe)
  • Disable ML for wideformat tables
  • Add quicktests that run short AlphaQuant runs on different MQ, Fragpipe, DIA-NN, Spectronaut

@ammarcsj ammarcsj marked this pull request as ready for review March 14, 2025 14:12
@ammarcsj ammarcsj requested a review from mschwoer March 14, 2025 14:13
@github-actions
Copy link

The following feedback could not be added to specific lines, but still contains valuable information:

Looking at the code changes, I've identified several issues that need to be addressed:

[
  {
    "change_id": 1,
    "file_name": "./alphaquant/diffquant/condpair_analysis.py",
    "start_line": 237,
    "end_line": 238,
    "proposed_code": "def check_if_has_precursor_nodes(condpair_node):\n    try:\n        return condpair_node.children[0].children[0].children[0].children[0].type == "mod_seq_charge"\n    except:\n        return False",
    "comment": "The original code doesn't handle cases where the node structure might not have the expected depth. This can cause IndexError exceptions when processing certain data types. The try-except block provides a safer way to check for precursor nodes."
  },
  {
    "change_id": 2,
    "file_name": "./alphaquant/config/quant_reader_config.yaml",
    "start_line": 145,
    "end_line": 154,
    "proposed_code": "maxquant_evidence:\n  format: longtable\n  sample_ID: Experiment\n  quant_ID:\n    precursor: Intensity\n  protein_cols:\n   - Gene names\n  ion_hierarchy:\n    precursor:\n      order: [SEQ, MOD, CHARGE]\n      mapping:\n        SEQ:\n          - Sequence\n        MOD:\n          - Mass\n        CHARGE:\n          - Charge\n  filters:\n    reverse:\n      param: Reverse\n      comparator: "!="\n      value: "+"\n    contaminant:\n      param: Potential contaminant\n      comparator: "!="\n      value: "+"\n  ml_level: SEQ\n  use_iontree: False",
    "comment": "The MaxQuant evidence configuration was updated to use a hierarchical structure for ion representation rather than a flat list of ion_cols. This improves compatibility with the AlphaQuant pipeline and provides better support for proper precursor quantification. The filters for reverse and contaminant proteins were also added to improve data quality."
  },
  {
    "change_id": 3,
    "file_name": "./alphaquant/run_pipeline.py",
    "start_line": 124,
    "end_line": 125,
    "proposed_code": "    input_type, config_dict, _ = config_dict_loader.get_input_type_and_config_dict(input_file_original, input_type_to_use)\n    annotation_file = load_annotation_file(input_file_original, input_type, annotation_columns)\n    use_ml = check_if_table_supports_ml(config_dict)",
    "comment": "The original code doesn't capture the config_dict from the return value, which is needed later to check if ML is supported. Also, a new use_ml variable is introduced to track whether ML should be used for the current input type."
  },
  {
    "change_id": 4,
    "file_name": "./alphaquant/run_pipeline.py",
    "start_line": 240,
    "end_line": 240,
    "proposed_code": "def check_if_table_supports_ml(config_dict):\n    return config_dict["format"] == "longtable"",
    "comment": "Added a helper function to check if machine learning is supported for the current input format. Currently, ML is only supported for longtable formats."
  },
  {
    "change_id": 5,
    "file_name": "./alphaquant/run_pipeline.py",
    "start_line": 130,
    "end_line": 136,
    "proposed_code": "        input_file_reformat = load_ptm_input_file(input_file = input_file_original, input_type_to_use = "spectronaut_ptm_fragion", results_dir = results_dir, samplemap_df = samplemap_df, modification_type = modification_type, organism = organism)\n        if use_ml:\n            ml_input_file = load_ml_info_file(input_file_original, input_type, modification_type)\n\n    elif "fragment_precursorfiltered.matrix" in input_file_original:\n        alphadia_tableprocessor = aq_table_alphadiareader.AlphaDIAFragTableProcessor(input_file_original)\n        input_file_reformat = alphadia_tableprocessor.input_file_reformat\n        if use_ml:\n            ml_input_file = alphadia_tableprocessor.ml_info_file",
    "comment": "Modified the code to check if ML is supported before loading ML information files. This prevents errors when trying to use ML with formats that don't support it."
  },
  {
    "change_id": 6,
    "file_name": "./alphaquant/run_pipeline.py",
    "start_line": 142,
    "end_line": 143,
    "proposed_code": "        input_file_reformat = load_input_file(input_file_original, input_type)\n        if use_ml:\n            ml_input_file = load_ml_info_file(input_file_original, input_type)",
    "comment": "Added the same ML support check in the general case, ensuring ML info is only loaded when the format supports it."
  },
  {
    "change_id": 7,
    "file_name": "./alphaquant/utils/reader_utils.py",
    "start_line": 7,
    "end_line": 10,
    "proposed_code": "def read_file(file_path, decimal=".", usecols=None, chunksize=None, sep=None, nrows=None):\n    file_path = str(file_path)\n    if ".parquet" in file_path:\n        if nrows is not None:\n            LOGGER.warning(f"nrows parameter is set, but not supported for parquet files. Ignoring nrows parameter.")\n        return _read_parquet_file(file_path, usecols=usecols, chunksize=chunksize)",
    "comment": "Added 'nrows' parameter to the read_file function to allow reading only a subset of rows. This is helpful for examining large files or for testing. The function also adds a warning when nrows is used with parquet files since that format doesn't directly support it."
  },
  {
    "change_id": 8,
    "file_name": "./alphaquant/utils/reader_utils.py",
    "start_line": 28,
    "end_line": 30,
    "proposed_code": "            usecols=usecols,\n            encoding="latin1",\n            chunksize=chunksize,\n            nrows=nrows,\n        )",
    "comment": "Added the nrows parameter to the pandas.read_csv call to enable partial file reading."
  },
  {
    "change_id": 9,
    "file_name": "./alphaquant/ui/dashboard_parts_run_pipeline.py",
    "start_line": 886,
    "end_line": 903,
    "proposed_code": "				input_file = self.path_analysis_file.value\n				_, config_dict, sep = config_dict_loader.get_input_type_and_config_dict(input_file)\n				if config_dict["format"] == "longtable":\n					sample_column = config_dict["sample_ID"]\n					sample_names = set()\n\n					for chunk in aq_reader_utils.read_file(input_file, sep=sep, usecols=[sample_column], chunksize=400000):\n						sample_names.update(chunk[sample_column].unique())\n					self.sample_names = sample_names\n				elif config_dict["format"] == "widetable":\n					# Read the headers first to identify sample columns\n					headers = aq_reader_utils.read_file(input_file, sep=sep, nrows=0).columns.tolist()\n\n					quant_pre_or_suffix = config_dict.get("quant_pre_or_suffix")\n					# Filter headers to find those with the prefix or suffix\n					sample_columns = [\n						col for col in headers if (\n							col.startswith(quant_pre_or_suffix) or\n							col.endswith(quant_pre_or_suffix)\n						)\n					]\n					self.sample_names = set([col.replace(quant_pre_or_suffix, "") for col in sample_columns])\n				else:\n					print("ERROR: Could not idenfity sample names in the input file.")\n					self.run_pipeline_error.object = "Could not idenfity sample names . Please check your input file."\n					self.run_pipeline_error.visible = True",
    "comment": "Added support for wide format tables in the UI sample detection code. Previously it only worked with longtable formats. The code now checks the format type and uses different methods to extract sample names based on the table format."
  }
]

@github-actions
Copy link

Number of tokens: input_tokens=42831 output_tokens=2417 max_tokens=4096
review_instructions=''
config={}
thinking: ```
[]

Copy link
Contributor

@mschwoer mschwoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently, the code-review but also prefers ruff-formatted code :-p

@ammarcsj ammarcsj merged commit 7c67554 into main Mar 17, 2025
5 checks passed
@ammarcsj ammarcsj deleted the investigate_mq_fragpipe_issues branch March 17, 2025 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants