Skip to content

HNSW index build reproducibility - use single thread instead #139

@hsuknowledge

Description

@hsuknowledge

Hello. Based on these notes you left for RcppHNSW[1][2], I suggest switching to single thread for the hnsw_build step and only that step [3].

Here's my test log messages. The number of graph edges fluctuates if and only if the index building function uses multi threading. It and the ultimate umap result become reproducible if I fix n_threads = 1 in using hnsw_build.

22:09:34 Commencing optimization for 500 epochs, with 17984770 positive edges using 20 threads

umap2() messages
22:08:09 Using HNSW for nearest neighbor search
22:08:09 UMAP embedding parameters a = 0.9922 b = 1.112
22:08:09 Setting random seed 1
22:08:09 Read 107723 rows and found 489 numeric columns
22:08:09 Building HNSW index with metric 'l2' ef = 200 M = 16 using 1 threads
22:09:12 Finished building index
22:09:12 Searching HNSW index with ef = 100 and 20 threads
22:09:17 Finished searching
22:09:17 Commencing smooth kNN distance calibration using 20 threads with target n_neighbors = 100
22:09:21 Initializing from normalized Laplacian + noise (using RSpectra)
22:09:29 Range-scaling initial input columns to 0-10
22:09:34 Commencing optimization for 500 epochs, with 17984770 positive edges using 20 threads
22:09:34 Using rng type: pcg
Using method 'umap'
Optimizing with Adam alpha = 1 beta1 = 0.5 beta2 = 0.9 eps = 1e-07
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
22:09:39 Optimization finished

[1] jlmelville/rcpphnsw@6c54753

[2] jlmelville/rcpphnsw#23

[3]

n_threads = n_threads,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions