Skip to content

Support lz4 compression in hdfs#18982

Open
FrankChen021 wants to merge 10 commits intoapache:masterfrom
FrankChen021:lz4-compression-cherrypick
Open

Support lz4 compression in hdfs#18982
FrankChen021 wants to merge 10 commits intoapache:masterfrom
FrankChen021:lz4-compression-cherrypick

Conversation

@FrankChen021
Copy link
Member

Description

Currently zip(default level 6) is used to compress all files when uploading to hdfs.
the zip compression takes cpu resources and time to upload large files. Another consequence is that it significantly increases task duration and may introduce task pending during task rolling over, which also introduces high lag in kafka.
This PR adds support to using lz4 compression(fastest level) when uploading and decompression during downloading from hdfs.

Metrics Comparison

  • because task duration is reduced, we didn't observe task pending
image
  • push duration is significantly reduced from up to 2 minutes to 30 seconds while the pull duration(pull+decompression) keep the same
image
  • file size
    although the lz4 may increase the size of segment stored in hdfs, maybe 10% - 20%, but storage cost is acceptable

  • task hand off
    task hand off was reduced because task takes fewer time to compress the data

image
  • cpu resources
    during the push phase of ingestion tasks, the lz4 does not take more cpu resource compared to zip
image

and for historicals, we didn't observe any cpu resources increase in a cluster where lz4 formatted segments are frequently downloaded

New Configuration

  • druid.storage.compressionFormat is added, if not given, it defaults to zip to keep current behaviour

Metrics

Added the following metrics:

  • hdfs/pull/size
  • hdfs/pull/duration
  • hdfs/push/size
  • hdfs/push/duration

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@clintropolis
Copy link
Member

quite a lot of the segment already uses lz4 by default so curious how effective running the whole thing through lz4 again would be, did you do any experiments to compare with just not 'externally' compressing at all in deep storage?

S3 support for honoring druid.storage.zip was added in #18544, and we've been using it in some of our clusters alongside the virtual storage functionality introduced in #18176, since not having to decompress the segments to load them speeds up things quite a lot. It trades some extra space, but has been worth it for our use case.

Also, the new 'V10' segment format introduced in #18880 was built around some future ideas i have about improving the virtual storage functionality to start only downloading the v10 metadata to enable partial downloads of only the parts of the segment that are needed to take part in a query, which will require the segment not be 'externally' compressed in deep storage (columns inside the segment can obviously still be compressed).

Fwiw, I'm not necessarily opposed to making this 'external' segment compression stuff more configurable (but it is certainly a bit tedious with the current interfaces since it needs to be handled by each implementation of segment pusher/puller separately).

@FrankChen021
Copy link
Member Author

FrankChen021 commented Feb 4, 2026

@clintropolis thanks for the comments.

quite a lot of the segment already uses lz4 by default so curious how effective running the whole thing through lz4 again would be, did you do any experiments to compare with just not 'externally' compressing at all in deep storage?

I think you're ask for 2 questions here.
the first is, after lz4 compression on columns, will the compression gain? here's a table showing the raw segments and the size under zip/lz4 compression again

image

You can see that decompressed segment has a size of 11GB, in the HDFS(by the current default zip), it has a size of 2.48GB

Another question I think you're asking is wheter we did experiments to upload the raw files into deep storage?
No.
1st, above table shows the compression over the dir has gain,
2nd, HDFS deep storage currently does not support upload file by file.

As for v10, I do know that we can support partial download, but HDFS as deep storage is different from object storage,
for HDFS, it can't support too many files as object storage, and it can't provide high concurrent access as object storage, querying data directly from hdfs is much slower than querying data from object storage. For us, hdfs is main storage and I don't see there's way for us to migrate from hdfs to object storage, we will stay on hdfs for very long time.

since it needs to be handled by each implementation of segment pusher/puller

No need to worry about it. the compression configuration is only provided on hdfs, and it's implemented in the hdfs, not on a higher level which requires all deep storage to do so.

@clintropolis
Copy link
Member

You can see that decompressed segment has a size of 11GB, in the HDFS(by the current default zip), it has a size of 2.48GB

interesting, that is a lot bigger of a difference than i expected, though perhaps if there are a lot of complex columns that are not using compression (off by default added in #16863) then a difference of that size makes sense. Basically where I'm thking things are heading is moving away from generic compression in favor of all of the contents of the segment file being compressed.

Another question I think you're asking is wheter we did experiments to upload the raw files into deep storage?

Ah, this was mostly just me wondering about uncompressed sizing, though i expect most of the perf stuff would look better not having to do any extra compression/decompression, for your segments at least it seems like some additional stuff would need to happen to make that viable.

As for v10, I do know that we can support partial download, but HDFS as deep storage is different from object storage,
for HDFS, it can't support too many files as object storage, and it can't provide high concurrent access as object storage, querying data directly from hdfs is much slower than querying data from object storage. For us, hdfs is main storage and I don't see there's way for us to migrate from hdfs to object storage, we will stay on hdfs for very long time.

V10 segments store everything in a single file, druid.segment, so in terms of count it should be no different than having a single .zip or whatever that there is today with externally compressed v9 segments. Though it is fair that partial downloads would potentially increase concurrent access, however with smaller fetches so maybe not quite as bad. None of the partial stuff exists yet though, and virtual storage mode is optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants