Skip to content

Add optional centroid reporting to Butina clustering#82

Merged
scal444 merged 2 commits intoNVIDIA-Digital-Bio:mainfrom
scal444:butina_FR
Jan 30, 2026
Merged

Add optional centroid reporting to Butina clustering#82
scal444 merged 2 commits intoNVIDIA-Digital-Bio:mainfrom
scal444:butina_FR

Conversation

@scal444
Copy link
Collaborator

@scal444 scal444 commented Jan 30, 2026

Addresses #74

@scal444 scal444 requested a review from evasnow1992 January 30, 2026 14:09
@scal444 scal444 self-assigned this Jan 30, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

Added optional centroid reporting to Butina clustering, allowing users to retrieve the centroid index for each cluster. The implementation threads centroids through all CUDA kernels and properly handles cluster renumbering.

Key changes:

  • Added return_centroids parameter to Python API that returns a tuple (clusters, centroids) when enabled
  • Centroids are tracked at cluster assignment time in three kernels: attemptAssignClustersFromNeighborlist, butinaWriteClusterValue, and assignSingletonIdsKernel
  • Implemented remapCentroidsKernel to handle centroid index remapping when clusters are renumbered by size
  • Memory management properly addressed with new toOwnedPyArray helper function
  • Both Python and C++ tests added to validate centroid correctness

Tests verify:

  • Centroids array has correct size (equal to number of clusters)
  • Each centroid is actually in its corresponding cluster
  • All cluster members are adjacent to their centroid (within cutoff distance)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is thorough and well-tested. The feature is backward-compatible (optional parameter with default false), centroids are properly threaded through all kernels, memory management concerns from previous review were addressed with the toOwnedPyArray helper, and comprehensive tests validate correctness at both Python and C++ levels
  • No files require special attention

Important Files Changed

Filename Overview
nvmolkit/clustering.cpp Added toOwnedPyArray helper and conditional centroid return logic with proper memory management
src/butina.h Added optional centroids parameter and return value for cluster count to both GPU entry points
src/butina.cu Threaded centroids through all kernels, added remapCentroidsKernel, and returns cluster count

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@evasnow1992 evasnow1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the user request. Code changes are clean and valid.

@scal444 scal444 merged commit aeec3dc into NVIDIA-Digital-Bio:main Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants