Add prep_datafile argument and modify PCA data handling in newref_con… by tomiles · Pull Request #137 · CenterForMedicalGeneticsGhent/WisecondorX

tomiles · 2025-11-26T08:56:29Z

Optimize Memory Usage in `newref` Command

Problem

The newref command was experiencing memory bloat of over 500GB when training references with 10kb binsize. The issue was caused by multi-threaded execution where each thread independently loaded the full pca_corrected_data matrix from disk, creating multiple copies of the dataset in memory (e.g., 10 threads = 10 copies). Additionally, unused data was being stored in intermediate files.

Solution

Implemented memory optimizations to reduce memory usage from O(N) to O(1) relative to thread count:

Changes Made

Optimized Data Storage in tool_newref_prep (src/wisecondorx/newref_control.py):
- Removed unused masked_data from the compressed .npz file to reduce I/O overhead
- Saved the large pca_corrected_data matrix to a separate uncompressed .npy file to enable memory mapping
Optimized Multi-threaded Loading in tool_newref_main (src/wisecondorx/newref_control.py):
- Load pca_corrected_data once using np.load(..., mmap_mode='r') for memory mapping
- Pass the single memory-mapped array object to worker threads instead of each thread loading independently
- This ensures all threads share the same physical memory pages via OS memory management
Updated Worker Function _tool_newref_part (src/wisecondorx/newref_control.py):
- Modified to accept the pre-loaded pca_corrected_data array as a parameter
- Removed redundant data loading from within the worker function
Configuration Updates (src/wisecondorx/main.py):
- Added args.prepdatafile to handle the temporary data file path
Cleanup (src/wisecondorx/newref_control.py):
- Ensured the new .npy data file is properly removed after processing

Benefits

Massive Memory Reduction: Memory usage is now O(1) relative to thread count instead of O(N), enabling efficient multi-threading without memory bloat
Reduced I/O: Eliminated storage and transfer of unused masked_data
Better Performance: Memory mapping allows the OS to manage data access more efficiently
Maintained Functionality: All existing behavior is preserved while dramatically improving resource efficiency

…trol.py

tomiles · 2025-11-26T08:57:11Z

To be tested, in progress @nvnieuwk

matthdsm · 2025-11-26T10:44:06Z

conda.yml

+  - bioconda
+dependencies:
+  - setuptools
+  - pandas


Do we also need to pin pandas? IIRC, there were some issues with incompatible (newer) versions

I copy pasted what was in the meta.yml of the conda package. What version should it be pinned to?

tomiles · 2025-11-26T16:55:27Z

@nvnieuwk I can't review my own PR 😜

nvnieuwk · 2025-11-27T08:25:42Z

You're not the reviewer 😉

Add prep_datafile argument and modify PCA data handling in newref_con…

2699501

…trol.py

tomiles requested a review from matthdsm November 26, 2025 08:56

add docker and conda file

bc19c03

matthdsm reviewed Nov 26, 2025

View reviewed changes

nvnieuwk assigned tomiles Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prep_datafile argument and modify PCA data handling in newref_con…#137

Add prep_datafile argument and modify PCA data handling in newref_con…#137
tomiles wants to merge 2 commits intomasterfrom
optimize_new_ref

tomiles commented Nov 26, 2025

Uh oh!

tomiles commented Nov 26, 2025

Uh oh!

matthdsm Nov 26, 2025

Uh oh!

nvnieuwk Nov 26, 2025

Uh oh!

tomiles commented Nov 26, 2025

Uh oh!

nvnieuwk commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tomiles commented Nov 26, 2025

Optimize Memory Usage in newref Command

Problem

Solution

Changes Made

Benefits

Uh oh!

tomiles commented Nov 26, 2025

Uh oh!

matthdsm Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

nvnieuwk Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

tomiles commented Nov 26, 2025

Uh oh!

nvnieuwk commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize Memory Usage in `newref` Command