Skip to content

Reduce the Capacity template for IVF-Flat search#1681

Open
lowener wants to merge 4 commits intorapidsai:mainfrom
lowener:26.02-flat-kernel
Open

Reduce the Capacity template for IVF-Flat search#1681
lowener wants to merge 4 commits intorapidsai:mainfrom
lowener:26.02-flat-kernel

Conversation

@lowener
Copy link
Contributor

@lowener lowener commented Jan 7, 2026

Currently the Capacity template goes from 1 to 256 by power of 2.
By changing it to power of 4 from 1 to 256, we can reduce the size of libcuvs from 157 Mb to 146 Mb (11 Mb or 7% reduction).

After some tests on mnist-784-euclidean, across multiple topk and a nprobe of 1 or 5, the impact on the throughput would be around 4%. The measurements are noisy as the power-of-4 version is sometimes faster than the base version. The benchmarks are reproducible by running the script present in the first commit of the PR.

Topk N-Probes QPS base QPS power of 4 Pow-of-4 over Base
1 1 341,646 300,844 88%
1 5 269,131 257,179 96%
2 1 328,880 293,591 89%
2 5 224,674 264,695 118%
4 1 308,350 296,900 96%
4 5 227,393 220,282 97%
5 1 340,225 296,276 87%
5 5 296,486 278,676 94%
10 1 301,967 308,025 102%
10 5 234,487 286,652 122%
20 1 335,355 311,835 93%
20 5 231,498 256,806 111%
50 1 336,700 310,101 92%
50 5 293,545 241,445 82%
100 1 337,883 277,521 82%
100 5 227,633 223,234 98%
-------- -------- ------- ------- -------
Average -------- ------- ------- 96%

rmm::cuda_stream_view stream)
{
const int capacity = raft::bound_by_power_of_two(k);
const int capacity = bound_by_power_of_four(k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared the binary size reduction of this approach vs converting capacity to a run-time parameter?

Maybe for the first step, we may try to just estimate the potential size reduction without worrying too much about performance or even correctness.

In the kernel function (https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L810)

Capacity is mainly used in two places.

https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L828
=> Just replacing constexpr to const will be sufficient for initial best case estimate.

https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L852
=> Looks more involved (need to dig into the internals of block_sort_t), but for the initial estimate, we may just set Capacity here to an arbitrary value (e.g. 4) to just quickly get an idea about the upper limit in binary size reduction.

If the size you get with this approach is significantly smaller, then it might be worth further investigation. If the size reduction is comparable or even less, yeah, better not bother.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By fixing the capacity template to 0 the size of libcuvs.so gets to 133Mb.
Power-of-two: libcuvs.so 157Mb
Power-of-four: libcuvs.so 146Mb (7% reduction)
Capacity=0: libcuvs.so 133Mb (18% reduction)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the same recursive pattern for Capacity happens in IVF-PQ (from 1 to 128, but the combination of template parameter is less important) : https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh#L533-L543
So if we decide to switch block_sort_t to dynamic parameters we can profit on both types of index

@cjnolet
Copy link
Member

cjnolet commented Jan 14, 2026

we can reduce the size of libcuvs from 157 Mb to 146 Mb (11 Mb or 7% reduction).

@lowener is this per architecture? Any idea what the savings is for the binary when all architectures are compiled?

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 26, 2026
@lowener lowener requested review from a team as code owners January 26, 2026 14:40
@lowener lowener requested a review from msarahan January 26, 2026 14:40
Signed-off-by: Mickael Ide <mide@nvidia.com>
Signed-off-by: Mickael Ide <mide@nvidia.com>
@lowener lowener removed request for a team January 26, 2026 14:44
@lowener lowener removed request for a team and msarahan January 26, 2026 14:44
@lowener
Copy link
Contributor Author

lowener commented Jan 26, 2026

we can reduce the size of libcuvs from 157 Mb to 146 Mb (11 Mb or 7% reduction).

@lowener is this per architecture? Any idea what the savings is for the binary when all architectures are compiled?

Yes this seems to be per architecture.
Compiling with --allgpuarch gives me the following number (CUDA 12.2)
Power-of-two: libcuvs.so 1096Mb
Power-of-four: libcuvs.so 1017Mb

Reduction of 79Mb (or 7%) with the following architecture compiled:

  • 70-real
  • 75-real
  • 80-real
  • 86-real
  • 90a-real
  • 100f-real
  • 120a-real
  • 120

}
}
if constexpr (Capacity > 1) {
if (k_max * 2 <= Capacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that this check (k_max * 2 <= Capacity) will still remain the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

4 participants