Support lz4 compression in hdfs#18982
Conversation
processing/src/test/java/org/apache/druid/java/util/common/CompressionUtilsTest.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/utils/CompressionUtils.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/utils/CompressionUtils.java
Dismissed
Show dismissed
Hide dismissed
|
quite a lot of the segment already uses lz4 by default so curious how effective running the whole thing through lz4 again would be, did you do any experiments to compare with just not 'externally' compressing at all in deep storage? S3 support for honoring Also, the new 'V10' segment format introduced in #18880 was built around some future ideas i have about improving the virtual storage functionality to start only downloading the v10 metadata to enable partial downloads of only the parts of the segment that are needed to take part in a query, which will require the segment not be 'externally' compressed in deep storage (columns inside the segment can obviously still be compressed). Fwiw, I'm not necessarily opposed to making this 'external' segment compression stuff more configurable (but it is certainly a bit tedious with the current interfaces since it needs to be handled by each implementation of segment pusher/puller separately). |
|
@clintropolis thanks for the comments.
I think you're ask for 2 questions here.
You can see that decompressed segment has a size of 11GB, in the HDFS(by the current default zip), it has a size of 2.48GB Another question I think you're asking is wheter we did experiments to upload the raw files into deep storage? As for v10, I do know that we can support partial download, but HDFS as deep storage is different from object storage,
No need to worry about it. the compression configuration is only provided on hdfs, and it's implemented in the hdfs, not on a higher level which requires all deep storage to do so. |
interesting, that is a lot bigger of a difference than i expected, though perhaps if there are a lot of complex columns that are not using compression (off by default added in #16863) then a difference of that size makes sense. Basically where I'm thking things are heading is moving away from generic compression in favor of all of the contents of the segment file being compressed.
Ah, this was mostly just me wondering about uncompressed sizing, though i expect most of the perf stuff would look better not having to do any extra compression/decompression, for your segments at least it seems like some additional stuff would need to happen to make that viable.
V10 segments store everything in a single file, |

Description
Currently zip(default level 6) is used to compress all files when uploading to hdfs.
the zip compression takes cpu resources and time to upload large files. Another consequence is that it significantly increases task duration and may introduce task pending during task rolling over, which also introduces high lag in kafka.
This PR adds support to using lz4 compression(fastest level) when uploading and decompression during downloading from hdfs.
Metrics Comparison
file size
although the lz4 may increase the size of segment stored in hdfs, maybe 10% - 20%, but storage cost is acceptable
task hand off
task hand off was reduced because task takes fewer time to compress the data
during the push phase of ingestion tasks, the lz4 does not take more cpu resource compared to zip
and for historicals, we didn't observe any cpu resources increase in a cluster where lz4 formatted segments are frequently downloaded
New Configuration
Metrics
Added the following metrics:
This PR has: