Skip to content

Performance & correctness improvements in haplogroup prediction and pileup parsing#40

Open
cascadingstyletrees wants to merge 9 commits intogenid:masterfrom
cascadingstyletrees:master
Open

Performance & correctness improvements in haplogroup prediction and pileup parsing#40
cascadingstyletrees wants to merge 9 commits intogenid:masterfrom
cascadingstyletrees:master

Conversation

@cascadingstyletrees
Copy link
Contributor

  • Optimize pileup base counting by stripping indel sequences with regex and using Counter, ensuring insertions/deletions are skipped correctly while computing base frequencies.
  • Reduce memory overhead when loading the reference genome by caching a single concatenated string instead of a list of characters.
  • Avoid repeated tree/table loads in haplogroup prediction by instantiating the Tree once for multiprocessing workers and reusing the intermediate table data in the legacy predictor loop.
  • Ensure only the most specific haplogroup nodes are considered by tracking covered nodes during scoring and pruning less-specific paths in the prediction loop.

cascadingstyletrees and others added 5 commits January 27, 2026 18:19
Moved the reading of the intermediate tree table outside the sample processing loop.
This avoids reading the same file from disk for every sample, improving performance.

Measured improvement: ~23% speedup on 50 samples (4.47s -> 3.41s).
Moved `Tree` object creation outside the sample loop to avoid repeated I/O and parsing.
Passed the `tree` object to the worker function via `partial`.

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
…d of list of characters. (#5)

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
- Replaced manual loop with `re` and `Counter` for parsing pileup strings.
- Fixed a bug where indel skipping logic was unreachable, ensuring inserted sequences are correctly skipped.
- Removed legacy helper `find_digit` and `NUM_SET`.
- Added unit tests in `tests/test_Yleaf.py`.
- Achieved ~2x performance improvement in benchmarks.

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
@cascadingstyletrees
Copy link
Contributor Author

@dionzand Could you please review when you get the chance?

cascadingstyletrees and others added 4 commits January 28, 2026 13:40
- Replaced manual loop with `re` and `Counter` for parsing pileup strings.
- Fixed a bug where indel skipping logic was unreachable, ensuring inserted sequences are correctly skipped.
- Removed legacy helper `find_digit` and `NUM_SET`.
- Added unit tests in `tests/test_Yleaf.py` and updated `.gitignore`.
- Achieved ~2x performance improvement in benchmarks.

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
#8)

- Replaced `setup.py` and `requirements.txt` with `pyproject.toml` (setuptools backend).
- Updated Python requirement to >=3.10.
- Unpinned dependencies (`pandas`, `numpy`, etc.) to allow for modern versions.
- Added `bcftools` to `environment_yleaf.yaml`.
- Added `pytest` and `ruff` dev dependencies.
- Applied `ruff` fixes for code modernization (f-strings, type hints, etc.) and formatting.
- Updated `yleaf/Yleaf.py` to use `on_bad_lines='skip'` (replacing deprecated `error_bad_lines`).
- Updated `yleaf/predict_haplogroup.py` to use `math.prod`.
- Removed `six` dependency.
- Verified functionality with existing tests and CLI checks.

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
Introduced `EXPECTED_STATES_CACHE` to persist parsed backbone table data across `get_qc1_score` calls.
Previously, these files were re-read for every sample processing if the `QC1_SCORE_CACHE` (per-sample) missed.
This change reduces disk I/O significantly.

Benchmark showed a ~37x speedup (0.24ms -> 0.0065ms per call) in a synthetic loop.

Added `tests/test_predict_haplogroup.py` to verify correctness.

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
Replaced temporary file creation/reading/deletion in `run_vcf` with a direct `subprocess` pipe to `pandas.read_csv`.
Also updated `yleaf/predict_haplogroup.py` to use modern type hints (Python 3.10+) to fix `NameError` in tests.
Added `tests/test_vcf_pipe.py` to verify the new pipe implementation.

Co-authored-by: cascadingstyletrees <25812029+cascadingstyletrees@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant