Reduce the Capacity template for IVF-Flat search#1681
Reduce the Capacity template for IVF-Flat search#1681lowener wants to merge 4 commits intorapidsai:mainfrom
Conversation
| rmm::cuda_stream_view stream) | ||
| { | ||
| const int capacity = raft::bound_by_power_of_two(k); | ||
| const int capacity = bound_by_power_of_four(k); |
There was a problem hiding this comment.
Have you compared the binary size reduction of this approach vs converting capacity to a run-time parameter?
Maybe for the first step, we may try to just estimate the potential size reduction without worrying too much about performance or even correctness.
In the kernel function (https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L810)
Capacity is mainly used in two places.
https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L828
=> Just replacing constexpr to const will be sufficient for initial best case estimate.
https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L852
=> Looks more involved (need to dig into the internals of block_sort_t), but for the initial estimate, we may just set Capacity here to an arbitrary value (e.g. 4) to just quickly get an idea about the upper limit in binary size reduction.
If the size you get with this approach is significantly smaller, then it might be worth further investigation. If the size reduction is comparable or even less, yeah, better not bother.
There was a problem hiding this comment.
By fixing the capacity template to 0 the size of libcuvs.so gets to 133Mb.
Power-of-two: libcuvs.so 157Mb
Power-of-four: libcuvs.so 146Mb (7% reduction)
Capacity=0: libcuvs.so 133Mb (18% reduction)
There was a problem hiding this comment.
I just noticed that the same recursive pattern for Capacity happens in IVF-PQ (from 1 to 128, but the combination of template parameter is less important) : https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh#L533-L543
So if we decide to switch block_sort_t to dynamic parameters we can profit on both types of index
@lowener is this per architecture? Any idea what the savings is for the binary when all architectures are compiled? |
Signed-off-by: Mickael Ide <mide@nvidia.com>
Signed-off-by: Mickael Ide <mide@nvidia.com>
02ff8bf to
95075bf
Compare
Yes this seems to be per architecture. Reduction of 79Mb (or 7%) with the following architecture compiled:
|
| } | ||
| } | ||
| if constexpr (Capacity > 1) { | ||
| if (k_max * 2 <= Capacity) { |
There was a problem hiding this comment.
Have you verified that this check (k_max * 2 <= Capacity) will still remain the same?
Currently the
Capacitytemplate goes from 1 to 256 by power of 2.By changing it to power of 4 from 1 to 256, we can reduce the size of libcuvs from 157 Mb to 146 Mb (11 Mb or 7% reduction).
After some tests on
mnist-784-euclidean, across multiple topk and a nprobe of 1 or 5, the impact on the throughput would be around 4%. The measurements are noisy as the power-of-4 version is sometimes faster than the base version. The benchmarks are reproducible by running the script present in the first commit of the PR.