Skip to content

Move AsyncMinHashLSH from experimental to datasketch.aio#301

Open
ekzhu wants to merge 7 commits intomasterfrom
move-async-api-to-main
Open

Move AsyncMinHashLSH from experimental to datasketch.aio#301
ekzhu wants to merge 7 commits intomasterfrom
move-async-api-to-main

Conversation

@ekzhu
Copy link
Owner

@ekzhu ekzhu commented Jan 20, 2026

  • Create new datasketch/aio/ submodule with AsyncMinHashLSH, AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession
  • Add optional AsyncMinHashLSH export to main datasketch package
  • Deprecate experimental module with warnings pointing to new location
  • Update tests to use new import path
  • Rename experimental_aio dependency group to aio (keep alias)
  • Remove experimental from coverage omit

New import paths:

  • from datasketch.aio import AsyncMinHashLSH
  • from datasketch import AsyncMinHashLSH

Old imports still work but emit DeprecationWarning.

What does this PR do?

  • Briefly describe the changes.

Fixes # (issue) if any.

Checklist

  • Are unit tests passing?
  • Documentation added/updated for all public APIs?
  • Is this a breaking change? If yes, add "[BREAKING]" to the PR title.

- Create new `datasketch/aio/` submodule with AsyncMinHashLSH,
  AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession
- Add optional AsyncMinHashLSH export to main datasketch package
- Deprecate experimental module with warnings pointing to new location
- Update tests to use new import path
- Rename `experimental_aio` dependency group to `aio` (keep alias)
- Remove experimental from coverage omit

New import paths:
- `from datasketch.aio import AsyncMinHashLSH`
- `from datasketch import AsyncMinHashLSH`

Old imports still work but emit DeprecationWarning.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @ekzhu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the asynchronous MinHash LSH functionality by promoting it from an experimental module to a dedicated, stable datasketch.aio submodule. This change aims to formalize the async LSH implementation within the library, making it more accessible and indicating its readiness for broader use. It includes updates to import paths, dependency management, and ensures a smooth transition for existing users through deprecation warnings and backward-compatible re-exports.

Highlights

  • Module Relocation: The AsyncMinHashLSH implementation and its related classes (AsyncMinHashLSHInsertionSession, AsyncMinHashLSHDeleteSession) have been moved from the datasketch.experimental.aio module to a new, stable datasketch.aio submodule.
  • Top-Level Export: The AsyncMinHashLSH class is now optionally exported directly from the main datasketch package, allowing for a simpler import path (from datasketch import AsyncMinHashLSH) if the necessary asynchronous dependencies are installed.
  • Deprecation Strategy: The datasketch.experimental and datasketch.experimental.aio modules are now officially deprecated. Importing from these paths will emit DeprecationWarning messages, guiding users to the new datasketch.aio location, while still re-exporting the classes for backward compatibility.
  • Dependency Group Renaming: The experimental_aio dependency group in pyproject.toml has been renamed to aio, with experimental_aio maintained as a deprecated alias for compatibility.
  • Coverage Configuration Update: The experimental directory has been removed from the coverage omit list in pyproject.toml, reflecting its transition out of an experimental status.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully moves the asynchronous LSH implementation from the experimental module to a new datasketch.aio module, which is a great step towards stabilizing this feature. The changes include creating the new submodule, deprecating the old one with warnings, and updating dependencies and test configurations.

My review focuses on ensuring the correctness and robustness of the newly located async code. I've found a few critical issues in the async storage backends, particularly with Redis, that could lead to data loss or incorrect behavior. I've also identified a high-severity bug related to the MongoDB storage implementation and several medium-severity issues concerning documentation examples, code efficiency, and minor bugs. Addressing these points will significantly improve the quality and reliability of the async functionality.

Comment on lines 318 to 321
async def execute_command(self, *args, **kwargs):
if len(self.command_stack) >= self._buffer_size:
self.execute()
await super(AsyncRedisBuffer, self).execute_command(*args, **kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

self.execute() is a coroutine and needs to be awaited. Without await, this will raise a RuntimeWarning: coroutine 'Pipeline.execute' was never awaited and the buffer will not be flushed, which can lead to data loss.

Suggested change
async def execute_command(self, *args, **kwargs):
if len(self.command_stack) >= self._buffer_size:
self.execute()
await super(AsyncRedisBuffer, self).execute_command(*args, **kwargs)
async def execute_command(self, *args, **kwargs):
if len(self.command_stack) >= self._buffer_size:
await self.execute()
await super(AsyncRedisBuffer, self).execute_command(*args, **kwargs)

Comment on lines 361 to 366
async def getmany(self, *keys):
pipe = self._redis.pipeline()
pipe.multi()
for key in keys:
await self._get_items(pipe, self.redis_key(key))
return await pipe.execute()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This implementation of getmany is incorrect and will fail at runtime. pipe.multi() is a no-op in recent versions of redis-py. More importantly, await self._get_items(pipe, ...) is wrong because methods on a pipeline object are not coroutines; they just queue commands. The implementation can be simplified to directly queue the lrange commands and execute them once.

Suggested change
async def getmany(self, *keys):
pipe = self._redis.pipeline()
pipe.multi()
for key in keys:
await self._get_items(pipe, self.redis_key(key))
return await pipe.execute()
async def getmany(self, *keys):
pipe = self._redis.pipeline()
for key in keys:
pipe.lrange(self.redis_key(key), 0, -1)
return await pipe.execute()

"""See :class:`datasketch.MinHashLSH`."""
key_set = list(set(keys))
hashtables = [unordered_storage({"type": "dict"}) for _ in range(self.b)]
Hss = await self.keys.getmany(*key_set)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The method get_subset_counts calls self.keys.getmany(...). When using aiomongo storage, self.keys is an AsyncMongoListStorage instance. This class does not implement an async def getmany method, causing this call to fail. It inherits a synchronous getmany from datasketch.storage.Storage which returns a list of coroutines without awaiting them.

) as lsh:
m = MinHash(num_perm=128)
m.update(b"data")
await lsh.insert("key", m)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example code will raise a TypeError because you are inserting a string key "key" when using aiomongo storage. By default (prepickle=False), aiomongo storage requires keys to be bytes.

Suggested change
await lsh.insert("key", m)
await lsh.insert(b"key", m)

"""See :class:`datasketch.MinHashLSH`."""
await self._insert(key, minhash, check_duplication=check_duplication, buffer=False)

def insertion_session(self, batch_size=10000):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example in the insertion_session docstring is not runnable. It is missing necessary imports (islice, random, string), uses a misleading variable name (_storage_config_redis for a mongo config), and the keys are not bytes which will cause a TypeError with prepickle=False. The example in delete_session has similar issues and uses an undefined variable keys_to_remove.

ekzhu and others added 5 commits January 19, 2026 23:25
- Sort __all__ lists alphabetically (RUF022)
- Add noqa: E402 for intentional imports after deprecation warnings
- Add blank line after docstring Example section (D413)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug fixes:
- Add missing await on self.execute() in AsyncRedisBuffer
- Fix getmany() to correctly use Redis pipeline (don't await individual commands)
- Add getmany() method to AsyncMongoListStorage for get_subset_counts support
- Fix typo: "AsyncMinHash" -> "AsyncMinHashLSH" in error message
- Remove dead code: await self.keys (storage objects aren't awaitable)

Performance:
- Change tuple buffers to lists in AsyncMongoBuffer for O(1) append

Documentation:
- Fix module docstring example to use prepickle=True for string keys
- Simplify insertion_session and delete_session docstring examples

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This test exercises the getmany() method that was fixed in a previous
commit, ensuring proper coverage for the async implementation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change workflow triggers from workflow_run (which runs after main Test
completes and doesn't show as PR checks) to push/pull_request triggers.
This makes integration test results visible directly in PR status checks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change workflow trigger from workflow_run to push/pull_request triggers
for consistency with MongoDB and Redis test workflows.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ekzhu
Copy link
Owner Author

ekzhu commented Jan 20, 2026

@copilot please see if all reviews have been addressed if not please address them

Copy link
Contributor

Copilot AI commented Jan 20, 2026

@ekzhu I've opened a new pull request, #302, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan

* Update documentation and fix RedisStorage import issue

Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>

* Add clarifying comments for optional Redis dependency handling

Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>
@ekzhu
Copy link
Owner Author

ekzhu commented Jan 20, 2026

@gemini-code-assist are all reviews addressed?

@gemini-code-assist
Copy link

@ekzhu It looks like @Copilot has created a new pull request, #302, to address the changes and will request your review once it's ready. I recommend reviewing that pull request once it's available to confirm that all reviews have been addressed to your satisfaction. I am unable to directly verify the status of reviews in a separate pull request.

@ekzhu
Copy link
Owner Author

ekzhu commented Jan 20, 2026

@gemini-code-assist

That PR has been merged, please take a look at the latest version of the current PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants