Skip to content

Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11)#2837

Merged
guilload merged 3 commits intomainfrom
congxie/replaceHll
Feb 12, 2026
Merged

Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11)#2837
guilload merged 3 commits intomainfrom
congxie/replaceHll

Conversation

@congx4
Copy link
Collaborator

@congx4 congx4 commented Feb 11, 2026

What

Switch tantivy's cardinality aggregation from the hyperloglogplus crate (HyperLogLog++ with p=16) to the official Apache DataSketches HLL implementation (datasketches crate v0.2.0, lg_k=11, Hll4).

Why

Pomsky currently fabricates fake HLL sketches from scalar cardinality estimates (generate_sketch(count)). When event query merges these across shards, the fake hashes collide — producing incorrect cardinality. For counts >256, it returns an empty HLL, losing data entirely.

Why DataSketches over other options

Official crate (datasketches = 0.2.0, released 2026-01-14).
Binary-compatible with datasketches-java (cross-language tests in the crate).
Same hash function (MurmurHash3 x64-128, seed=9001).
Same lg_k=11 as Java Union(LOG2M=11).
Zero conversion needed — pomsky just forwards raw bytes.

Logs-backend already supports DataSketches HLL via the hll_data_sketch_value protobuf field (field 10). No changes needed in logs-backend or cloudprem-bridge.

Changes

  • Cargo.toml: hyperloglogplus 0.4.1datasketches 0.2.0
  • CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher>HllSketch(lg_k=11, Hll4)
  • Custom Serde impl using HllSketch binary serialization format (for cross-shard transfer)
  • New to_sketch_bytes() method for external consumers (pomsky will call this)
  • Salt preserved via (salt, value) tuple hashing for column type disambiguation
  • Removed BuildSaltedHasher struct
  • 4 new unit tests: serde roundtrip, merge, binary format compat, salt differentiation

Testing

All 169 aggregation tests pass, including 10 cardinality-specific tests (6 existing + 4 new).

Follow-up PRs

  • pomsky: Replace generate_sketch hack → call cardinality.to_sketch_bytes(), return as hll_data_sketch_value
  • pomsky: Delete generate_sketch function and its test

Switch tantivy's cardinality aggregation from the hyperloglogplus crate
(HyperLogLog++ with p=16) to the official Apache DataSketches HLL
implementation (datasketches crate v0.2.0 with lg_k=11, Hll4).

This enables returning raw HLL sketch bytes from pomsky to Datadog's
event query, where they can be properly deserialized and merged using
the same DataSketches library (Java). The previous implementation
required pomsky to fabricate fake HLL sketches from scalar cardinality
estimates, which produced incorrect results when merged.

Changes:
- Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0
- CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch
- Custom Serde impl using HllSketch binary format (cross-shard compat)
- New to_sketch_bytes() for external consumers (pomsky)
- Salt preserved via (salt, value) tuple hashing for column type disambiguation
- Removed BuildSaltedHasher struct
- Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)
@PSeitz
Copy link
Collaborator

PSeitz commented Feb 11, 2026

Can you check the performance before and after?

cargo bench cardinality

@congx4
Copy link
Collaborator Author

congx4 commented Feb 11, 2026

Can you check the performance before and after?

cargo bench cardinality

before:
Screenshot 2026-02-11 at 3 09 41 PM

after:
Screenshot 2026-02-11 at 3 09 50 PM

This is an open-source repo — replace references to Datadog's event query
with generic cross-language compatibility descriptions.
@guilload guilload merged commit 51f340f into main Feb 12, 2026
8 checks passed
@guilload guilload deleted the congxie/replaceHll branch February 12, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants