Skip to content

Add prep_datafile argument and modify PCA data handling in newref_con…#137

Open
tomiles wants to merge 2 commits intomasterfrom
optimize_new_ref
Open

Add prep_datafile argument and modify PCA data handling in newref_con…#137
tomiles wants to merge 2 commits intomasterfrom
optimize_new_ref

Conversation

@tomiles
Copy link

@tomiles tomiles commented Nov 26, 2025

Optimize Memory Usage in newref Command

Problem

The newref command was experiencing memory bloat of over 500GB when training references with 10kb binsize. The issue was caused by multi-threaded execution where each thread independently loaded the full pca_corrected_data matrix from disk, creating multiple copies of the dataset in memory (e.g., 10 threads = 10 copies). Additionally, unused data was being stored in intermediate files.

Solution

Implemented memory optimizations to reduce memory usage from O(N) to O(1) relative to thread count:

Changes Made

  1. Optimized Data Storage in tool_newref_prep (src/wisecondorx/newref_control.py):

    • Removed unused masked_data from the compressed .npz file to reduce I/O overhead
    • Saved the large pca_corrected_data matrix to a separate uncompressed .npy file to enable memory mapping
  2. Optimized Multi-threaded Loading in tool_newref_main (src/wisecondorx/newref_control.py):

    • Load pca_corrected_data once using np.load(..., mmap_mode='r') for memory mapping
    • Pass the single memory-mapped array object to worker threads instead of each thread loading independently
    • This ensures all threads share the same physical memory pages via OS memory management
  3. Updated Worker Function _tool_newref_part (src/wisecondorx/newref_control.py):

    • Modified to accept the pre-loaded pca_corrected_data array as a parameter
    • Removed redundant data loading from within the worker function
  4. Configuration Updates (src/wisecondorx/main.py):

    • Added args.prepdatafile to handle the temporary data file path
  5. Cleanup (src/wisecondorx/newref_control.py):

    • Ensured the new .npy data file is properly removed after processing

Benefits

  • Massive Memory Reduction: Memory usage is now O(1) relative to thread count instead of O(N), enabling efficient multi-threading without memory bloat
  • Reduced I/O: Eliminated storage and transfer of unused masked_data
  • Better Performance: Memory mapping allows the OS to manage data access more efficiently
  • Maintained Functionality: All existing behavior is preserved while dramatically improving resource efficiency

@tomiles tomiles requested a review from matthdsm November 26, 2025 08:56
@tomiles
Copy link
Author

tomiles commented Nov 26, 2025

To be tested, in progress @nvnieuwk

- bioconda
dependencies:
- setuptools
- pandas

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to pin pandas? IIRC, there were some issues with incompatible (newer) versions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy pasted what was in the meta.yml of the conda package. What version should it be pinned to?

@tomiles
Copy link
Author

tomiles commented Nov 26, 2025

@nvnieuwk I can't review my own PR 😜

@nvnieuwk
Copy link

You're not the reviewer 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants