fix: Iceberg commit message when writer rotated#16365
fix: Iceberg commit message when writer rotated#16365PingLiuPing wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
mbasmanova
left a comment
There was a problem hiding this comment.
@PingLiuPing Is it possible to add a test to protect this behavior?
velox/connectors/hive/HiveDataSink.h
Outdated
| /// Size of the file in bytes. | ||
| uint64_t fileSize{0}; | ||
| /// Number of rows in the file. | ||
| uint64_t rowCount{0}; |
There was a problem hiding this comment.
nit: rowCount -> numRows
| // It contains the minimal subset of metadata. | ||
| // TODO: Complete metrics is missing now and this could lead to suboptimal | ||
| // query plan, will collect full iceberg metrics in following PR. | ||
| // clang-format off |
There was a problem hiding this comment.
// clang-format off
this should go inside the loop
@mbasmanova Thank you for the quick review. I attempted to add a test for this behavior. And found that triggering writer rotation requires writing a large amount of data, since the default Parquet flush policy rotates only after 1,024,000 rows or 128MB. One possible approach is to allow IcebergInsertTableHandle to accept WriterOptions, so that tests can inject customized writer options with a smaller flush threshold. For example: This would allow us to trigger writer rotation with a small dataset in tests. Let me explore whether there is a cleaner or more appropriate approach before proceeding. |
|
Ideally, we should be able to set kMaxTargetFileSizeSession to an arbitrary value and rely on the writer to rotate once the file size reaches that limit. This is likely because Parquet’s flush logic is driven by row group thresholds (row count or row group size), rather than the target file size: This issue may also exist in the DWRF writer, although its flush policy is more complex. Parquet’s behavior is relatively straightforward since it is purely based on row count and row group size. Would it make sense for me to open a separate PR to address this issue and add proper test coverage there? |
3f6398b to
b8956f6
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
Would it make sense for me to open a separate PR to address this issue and add proper test coverage there?
Thank would be nice. Thanks.
|
@jainxrohit has imported this pull request. If you are a Meta employee, you can view this in D93157234. |
|
@jainxrohit merged this pull request in 9cb3cd1. |
Fixes commit message correctness for multi-file writes by updating Iceberg writer rotation handling and tracking per-file row counts and emitting one commit task per rotated file.
Fix parquet writer does not honor smaller values of
max-target-file-sizeconfig.These are the adjustments required after #16077.