Performance & correctness improvements in haplogroup prediction and pileup parsing by cascadingstyletrees · Pull Request #40 · genid/Yleaf

cascadingstyletrees · 2026-01-28T19:01:24Z

Optimize pileup base counting by stripping indel sequences with regex and using Counter, ensuring insertions/deletions are skipped correctly while computing base frequencies.
Reduce memory overhead when loading the reference genome by caching a single concatenated string instead of a list of characters.
Avoid repeated tree/table loads in haplogroup prediction by instantiating the Tree once for multiprocessing workers and reusing the intermediate table data in the legacy predictor loop.
Ensure only the most specific haplogroup nodes are considered by tracking covered nodes during scoring and pruning less-specific paths in the prediction loop.

Moved the reading of the intermediate tree table outside the sample processing loop. This avoids reading the same file from disk for every sample, improving performance. Measured improvement: ~23% speedup on 50 samples (4.47s -> 3.41s).

Moved `Tree` object creation outside the sample loop to avoid repeated I/O and parsing. Passed the `tree` object to the worker function via `partial`. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

…d of list of characters. (#5) Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

- Replaced manual loop with `re` and `Counter` for parsing pileup strings. - Fixed a bug where indel skipping logic was unreachable, ensuring inserted sequences are correctly skipped. - Removed legacy helper `find_digit` and `NUM_SET`. - Added unit tests in `tests/test_Yleaf.py`. - Achieved ~2x performance improvement in benchmarks. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

cascadingstyletrees · 2026-01-28T19:23:38Z

@dionzand Could you please review when you get the chance?

- Replaced manual loop with `re` and `Counter` for parsing pileup strings. - Fixed a bug where indel skipping logic was unreachable, ensuring inserted sequences are correctly skipped. - Removed legacy helper `find_digit` and `NUM_SET`. - Added unit tests in `tests/test_Yleaf.py` and updated `.gitignore`. - Achieved ~2x performance improvement in benchmarks. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

#8) - Replaced `setup.py` and `requirements.txt` with `pyproject.toml` (setuptools backend). - Updated Python requirement to >=3.10. - Unpinned dependencies (`pandas`, `numpy`, etc.) to allow for modern versions. - Added `bcftools` to `environment_yleaf.yaml`. - Added `pytest` and `ruff` dev dependencies. - Applied `ruff` fixes for code modernization (f-strings, type hints, etc.) and formatting. - Updated `yleaf/Yleaf.py` to use `on_bad_lines='skip'` (replacing deprecated `error_bad_lines`). - Updated `yleaf/predict_haplogroup.py` to use `math.prod`. - Removed `six` dependency. - Verified functionality with existing tests and CLI checks. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

Introduced `EXPECTED_STATES_CACHE` to persist parsed backbone table data across `get_qc1_score` calls. Previously, these files were re-read for every sample processing if the `QC1_SCORE_CACHE` (per-sample) missed. This change reduces disk I/O significantly. Benchmark showed a ~37x speedup (0.24ms -> 0.0065ms per call) in a synthetic loop. Added `tests/test_predict_haplogroup.py` to verify correctness. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

Replaced temporary file creation/reading/deletion in `run_vcf` with a direct `subprocess` pipe to `pandas.read_csv`. Also updated `yleaf/predict_haplogroup.py` to use modern type hints (Python 3.10+) to fix `NameError` in tests. Added `tests/test_vcf_pipe.py` to verify the new pipe implementation. Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

cascadingstyletrees and others added 5 commits January 27, 2026 18:19

Fix broken loop pruning logic in predict_haplogroup

6574a7d

Optimize reference genome loading memory usage by using string instea…

dfb6f5d

…d of list of characters. (#5) Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>

cascadingstyletrees and others added 4 commits January 28, 2026 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance & correctness improvements in haplogroup prediction and pileup parsing#40

Performance & correctness improvements in haplogroup prediction and pileup parsing#40
cascadingstyletrees wants to merge 9 commits intogenid:masterfrom
cascadingstyletrees:master

cascadingstyletrees commented Jan 28, 2026

Uh oh!

cascadingstyletrees commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cascadingstyletrees commented Jan 28, 2026

Uh oh!

cascadingstyletrees commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant