Add prep_datafile argument and modify PCA data handling in newref_con…#137
Open
Add prep_datafile argument and modify PCA data handling in newref_con…#137
Conversation
Author
|
To be tested, in progress @nvnieuwk |
matthdsm
reviewed
Nov 26, 2025
| - bioconda | ||
| dependencies: | ||
| - setuptools | ||
| - pandas |
There was a problem hiding this comment.
Do we also need to pin pandas? IIRC, there were some issues with incompatible (newer) versions
There was a problem hiding this comment.
I copy pasted what was in the meta.yml of the conda package. What version should it be pinned to?
Author
|
@nvnieuwk I can't review my own PR 😜 |
|
You're not the reviewer 😉 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimize Memory Usage in
newrefCommandProblem
The
newrefcommand was experiencing memory bloat of over 500GB when training references with 10kb binsize. The issue was caused by multi-threaded execution where each thread independently loaded the fullpca_corrected_datamatrix from disk, creating multiple copies of the dataset in memory (e.g., 10 threads = 10 copies). Additionally, unused data was being stored in intermediate files.Solution
Implemented memory optimizations to reduce memory usage from O(N) to O(1) relative to thread count:
Changes Made
Optimized Data Storage in
tool_newref_prep(src/wisecondorx/newref_control.py):masked_datafrom the compressed.npzfile to reduce I/O overheadpca_corrected_datamatrix to a separate uncompressed.npyfile to enable memory mappingOptimized Multi-threaded Loading in
tool_newref_main(src/wisecondorx/newref_control.py):pca_corrected_dataonce usingnp.load(..., mmap_mode='r')for memory mappingUpdated Worker Function
_tool_newref_part(src/wisecondorx/newref_control.py):pca_corrected_dataarray as a parameterConfiguration Updates (
src/wisecondorx/main.py):args.prepdatafileto handle the temporary data file pathCleanup (
src/wisecondorx/newref_control.py):.npydata file is properly removed after processingBenefits
masked_data