feat(prepro): add function to process phenotype_values and extract regex by anna-parker · Pull Request #5885 · loculus-project/loculus

anna-parker · 2026-01-27T15:32:00Z

resolves #5902, #5901

Screenshot

This PR is currently being used on the private virus preview:

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

claude · 2026-02-02T13:31:39Z

Claude finished @anna-parker's task in 1m 39s —— View job

Code Review Complete

I've reviewed the PR that adds functionality to process phenotype_values and extract regex patterns. Here's my assessment:

Summary

The implementation adds two useful preprocessing functions (extract_regex and implicit support for phenotype_values processing) that extend the metadata processing capabilities. The code is generally well-structured and follows existing patterns. However, there are several areas that need attention before merging.

Key Issues

Critical:

Missing input validation in process_phenotype_values (preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py:1243) - The function doesn't validate that args contains the required "name" key, which could cause KeyError exceptions.
Overly broad exception handling (line 1249) - Using bare except Exception could mask programming errors. Should be more specific about expected exceptions from ast.literal_eval.

Important:
3. Incomplete documentation - The process_phenotype_values function is used in the code but not documented in the README alongside the other preprocessing functions.

Limited test coverage - The tests for process_phenotype_values (test_nextclade_preprocessing.py:1250) only cover the happy path. Missing tests for error cases, edge cases, and invalid inputs.
Ambiguous behavior with duplicates (line 1246) - If multiple phenotype entries have the same name, the function silently returns the last match without warning.

Minor:
6. Type handling - The function converts all values to strings, which might lose type information (e.g., distinguishing between 0.0 and 0).

Error messages could be more informative by including which phenotype name was being searched for.

Positive Aspects

Clean separation of concerns with the process_phenotype_values function
Good test coverage for the extract_regex functionality with both success and failure cases
Consistent code style matching the existing codebase
The regex extraction implementation is robust with proper error handling

Recommendations

Add input validation to process_phenotype_values
Make exception handling more specific
Add comprehensive tests for edge cases and error conditions
Document the process_phenotype_values function in the README
Consider whether duplicate phenotype names should trigger a warning

The functionality is sound and the implementation follows good patterns, but the missing validation and documentation need to be addressed.

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py

preprocessing/nextclade/tests/test_nextclade_preprocessing.py

preprocessing/nextclade/README.md

anna-parker · 2026-02-05T10:24:43Z

Comment from @rneher in the private preview we use the extract_regex function for obtaining the subtype from the reference name and there it would be good to have the subtype capitalized.

I proposed adding this as an option to the function.

claude bot added the preprocessing Issues related to the preprocessing component label Jan 27, 2026

anna-parker changed the title ~~feat: add function to process phenotype_values~~ feat: add function to process phenotype_values and extract regex Jan 27, 2026

anna-parker force-pushed the restructure-anya branch 3 times, most recently from 03c0980 to 0deeeb0 Compare January 29, 2026 19:32

anna-parker force-pushed the improve-nextclade-metadata branch from bfea53c to a874d49 Compare January 30, 2026 11:04

This was referenced Jan 30, 2026

prepro nextclade: add support for parsing results with phenotypeValues #5901

Open

prepro nextclade: add function to extract/parse string of reference name #5902

Open

Base automatically changed from restructure-anya to main January 30, 2026 15:28

anna-parker force-pushed the improve-nextclade-metadata branch from a874d49 to b62278b Compare January 30, 2026 19:59

anna-parker marked this pull request as ready for review February 2, 2026 13:31

anna-parker changed the title ~~feat: add function to process phenotype_values and extract regex~~ feat(prepro): add function to process phenotype_values and extract regex Feb 2, 2026