Use varint length field for last_path encoding to support longer GCP object names#72
Conversation
|
Hello, @r-uehara0219 Thank you for proposing this PR. We will check this PR. Please wait for a while. By the way, We are Looking for long-term maintainers around the Embulk eco-system. If you (and your team) are willing to raise your hand for the community on the total work of GCS plugins, you're welcome. |
dmikurube
left a comment
There was a problem hiding this comment.
Thanks for your contribution, @r-uehara0219 !
This part seems to come from #47, which was to support Japanese characters, but the 128 limitation was a kind of tentative/ad-hoc one.
This pull-request fixes the ad-hoc one. It basically looks good to me! Then approved, but left a couple of minor comments. Could you address them?
|
|
||
| int utf8EncodeLength = utf8.length; | ||
| if (utf8EncodeLength >= 128) { | ||
| if (utf8EncodeLength >= 65_535) { |
There was a problem hiding this comment.
It's understandable that this base64Encode method can handle 65536 bytes at most by itself (regardless of the limitation for task.getLastPath()), but it is a little bit confusing for code readers that it has a different limitation from task.getLastPath().
Can you align this with task.getLastPath() and leave a comment to explain why limited by that number?
There was a problem hiding this comment.
Aligned byte limit to 1024 bytes and added a comment.
db7e7b9
| // see: https://protobuf.dev/programming-guides/encoding/#varints | ||
| private static byte[] encodeVarint(int value) |
There was a problem hiding this comment.
I understand that this method is indirectly tested via base64Encode, but for sure and for readability, can you add some more tests directly for this encodeVarint method?
There was a problem hiding this comment.
Tests were added. I use reflect to test private methods, but please point out if this should not be used as repository policy.
9f691e6
There was a problem hiding this comment.
Thanks for your consideration! I think that the method can be just package-private (no visibility modifier) so that the method is visible from test classes.
There was a problem hiding this comment.
Fixed to package-private. Thanks for the comment!
dea553b
Co-authored-by: Dai MIKURUBE <dmikurube@users.noreply.github.com>
…/embulk-input-gcs into feature/varint-length-field
|
Hmm, the tests are failing on GitHub Actions. Some of them might be for different reasons (ex. GCP credentials?), but at least |
|
Sorry, only some tests were run in my local environment and the test results were not properly verified. Please wait while I fix the test. |
The test EXPECTED was incorrect and has been corrected. |
|
Thanks for working on the fix! Confirmed that |
|
Most of the errors seem to occur when reading the P12 key in AuthUtils.java:98. @dmikurube |
|
They're updated 3 years ago, but #74 passed just a couple of minutes ago... Hmm, could be a kind of permission issues by external contributors? Let me see. |
|
Yeah, that's the cause. We should have used the GitHub https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ We've already set this repo not to run any Workflow against a fork pull-req automatically without approval, then using Okay, let me merge this pull-req now, and let's see the Thanks for your contribution! |

fix #71
Overview
This PR updates the last_path encoding mechanism in embulk-input-gcs. Instead of using a fixed 1-byte length field (limiting paths to 127 bytes), the new implementation uses a varint length field, which can span 1–2 bytes, to represent the length of the UTF-8 encoded string.
Why Is It Necessary?
GCP object names can be up to 1024 bytes in length. The previous implementation’s 1-byte length field restricted last_path to 128 characters, causing errors when handling longer object names. Switching to a varint length field removes this limitation and ensures that the plugin can support all valid GCP object names.