Skip to content

feat(prepro): add function to process phenotype_values and extract regex#5885

Open
anna-parker wants to merge 9 commits intomainfrom
improve-nextclade-metadata
Open

feat(prepro): add function to process phenotype_values and extract regex#5885
anna-parker wants to merge 9 commits intomainfrom
improve-nextclade-metadata

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jan 27, 2026

resolves #5902, #5901

Screenshot

This PR is currently being used on the private virus preview:

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude bot added the preprocessing Issues related to the preprocessing component label Jan 27, 2026
@anna-parker anna-parker changed the title feat: add function to process phenotype_values feat: add function to process phenotype_values and extract regex Jan 27, 2026
@anna-parker anna-parker force-pushed the restructure-anya branch 3 times, most recently from 03c0980 to 0deeeb0 Compare January 29, 2026 19:32
@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from bfea53c to a874d49 Compare January 30, 2026 11:04
Base automatically changed from restructure-anya to main January 30, 2026 15:28
@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from a874d49 to b62278b Compare January 30, 2026 19:59
@anna-parker anna-parker marked this pull request as ready for review February 2, 2026 13:31
@anna-parker anna-parker changed the title feat: add function to process phenotype_values and extract regex feat(prepro): add function to process phenotype_values and extract regex Feb 2, 2026
@claude
Copy link
Contributor

claude bot commented Feb 2, 2026

Claude finished @anna-parker's task in 1m 39s —— View job


Code Review Complete

I've reviewed the PR that adds functionality to process phenotype_values and extract regex patterns. Here's my assessment:

Summary

The implementation adds two useful preprocessing functions (extract_regex and implicit support for phenotype_values processing) that extend the metadata processing capabilities. The code is generally well-structured and follows existing patterns. However, there are several areas that need attention before merging.

Key Issues

Critical:

  1. Missing input validation in process_phenotype_values (preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py:1243) - The function doesn't validate that args contains the required "name" key, which could cause KeyError exceptions.

  2. Overly broad exception handling (line 1249) - Using bare except Exception could mask programming errors. Should be more specific about expected exceptions from ast.literal_eval.

Important:
3. Incomplete documentation - The process_phenotype_values function is used in the code but not documented in the README alongside the other preprocessing functions.

  1. Limited test coverage - The tests for process_phenotype_values (test_nextclade_preprocessing.py:1250) only cover the happy path. Missing tests for error cases, edge cases, and invalid inputs.

  2. Ambiguous behavior with duplicates (line 1246) - If multiple phenotype entries have the same name, the function silently returns the last match without warning.

Minor:
6. Type handling - The function converts all values to strings, which might lose type information (e.g., distinguishing between 0.0 and 0).

  1. Error messages could be more informative by including which phenotype name was being searched for.

Positive Aspects

  • Clean separation of concerns with the process_phenotype_values function
  • Good test coverage for the extract_regex functionality with both success and failure cases
  • Consistent code style matching the existing codebase
  • The regex extraction implementation is robust with proper error handling

Recommendations

  1. Add input validation to process_phenotype_values
  2. Make exception handling more specific
  3. Add comprehensive tests for edge cases and error conditions
  4. Document the process_phenotype_values function in the README
  5. Consider whether duplicate phenotype names should trigger a warning

The functionality is sound and the implementation follows good patterns, but the missing validation and documentation need to be addressed.


@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from 2641a12 to d189ec4 Compare February 4, 2026 10:54
@anna-parker
Copy link
Contributor Author

Comment from @rneher in the private preview we use the extract_regex function for obtaining the subtype from the reference name and there it would be good to have the subtype capitalized.

I proposed adding this as an option to the function.

@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from cd7c474 to d950ab8 Compare February 6, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prepro nextclade: add function to extract/parse string of reference name

1 participant