Performance & correctness improvements in haplogroup prediction and pileup parsing#40
Open
cascadingstyletrees wants to merge 9 commits intogenid:masterfrom
Open
Performance & correctness improvements in haplogroup prediction and pileup parsing#40cascadingstyletrees wants to merge 9 commits intogenid:masterfrom
cascadingstyletrees wants to merge 9 commits intogenid:masterfrom
Conversation
Contributor
cascadingstyletrees
commented
Jan 28, 2026
- Optimize pileup base counting by stripping indel sequences with regex and using Counter, ensuring insertions/deletions are skipped correctly while computing base frequencies.
- Reduce memory overhead when loading the reference genome by caching a single concatenated string instead of a list of characters.
- Avoid repeated tree/table loads in haplogroup prediction by instantiating the Tree once for multiprocessing workers and reusing the intermediate table data in the legacy predictor loop.
- Ensure only the most specific haplogroup nodes are considered by tracking covered nodes during scoring and pruning less-specific paths in the prediction loop.
Moved the reading of the intermediate tree table outside the sample processing loop. This avoids reading the same file from disk for every sample, improving performance. Measured improvement: ~23% speedup on 50 samples (4.47s -> 3.41s).
Moved `Tree` object creation outside the sample loop to avoid repeated I/O and parsing. Passed the `tree` object to the worker function via `partial`. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
…d of list of characters. (#5) Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
- Replaced manual loop with `re` and `Counter` for parsing pileup strings. - Fixed a bug where indel skipping logic was unreachable, ensuring inserted sequences are correctly skipped. - Removed legacy helper `find_digit` and `NUM_SET`. - Added unit tests in `tests/test_Yleaf.py`. - Achieved ~2x performance improvement in benchmarks. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
Contributor
Author
|
@dionzand Could you please review when you get the chance? |
- Replaced manual loop with `re` and `Counter` for parsing pileup strings. - Fixed a bug where indel skipping logic was unreachable, ensuring inserted sequences are correctly skipped. - Removed legacy helper `find_digit` and `NUM_SET`. - Added unit tests in `tests/test_Yleaf.py` and updated `.gitignore`. - Achieved ~2x performance improvement in benchmarks. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
#8) - Replaced `setup.py` and `requirements.txt` with `pyproject.toml` (setuptools backend). - Updated Python requirement to >=3.10. - Unpinned dependencies (`pandas`, `numpy`, etc.) to allow for modern versions. - Added `bcftools` to `environment_yleaf.yaml`. - Added `pytest` and `ruff` dev dependencies. - Applied `ruff` fixes for code modernization (f-strings, type hints, etc.) and formatting. - Updated `yleaf/Yleaf.py` to use `on_bad_lines='skip'` (replacing deprecated `error_bad_lines`). - Updated `yleaf/predict_haplogroup.py` to use `math.prod`. - Removed `six` dependency. - Verified functionality with existing tests and CLI checks. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
Introduced `EXPECTED_STATES_CACHE` to persist parsed backbone table data across `get_qc1_score` calls. Previously, these files were re-read for every sample processing if the `QC1_SCORE_CACHE` (per-sample) missed. This change reduces disk I/O significantly. Benchmark showed a ~37x speedup (0.24ms -> 0.0065ms per call) in a synthetic loop. Added `tests/test_predict_haplogroup.py` to verify correctness. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
Replaced temporary file creation/reading/deletion in `run_vcf` with a direct `subprocess` pipe to `pandas.read_csv`. Also updated `yleaf/predict_haplogroup.py` to use modern type hints (Python 3.10+) to fix `NameError` in tests. Added `tests/test_vcf_pipe.py` to verify the new pipe implementation. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.