Conversation
|
Agree, very useful when dealing with UTF8 databases. Wonder if it should have a utf8 variant, where it does not have to re truncate, it can just look at the byte patterns at the border. The current version does not deal with UTF16 code units properly. (Substring might cut them in half) |
garydgregory
left a comment
There was a problem hiding this comment.
Hello all,
I think you'll want tests that cover grapheme clusters to avoid problems like https://issues.apache.org/jira/browse/LANG-1770
|
I added some test cases for emoji characters 🚀✨🎉 After adding |
|
@kiddos |
|
Oh, right. |
|
I'm not requesting support for grapheme cluster in the runtime, but we should set expectations in unit tests, whether they are supported or not. This is a larger discussion, which I raised in https://issues.apache.org/jira/browse/LANG-1770 |
…ller or equal then expected and not null
|
for the test case, I only check if the final output is not null and the bytes size is actully smaller then specified byte size. |
| } | ||
|
|
||
| @Test | ||
| void testTruncateToByteLength() { |
There was a problem hiding this comment.
These should be separate test methods for different tests so they can pass or fail independently.
There was a problem hiding this comment.
Yes, you can reuse the test class, but this is about 10 different tests that should each be a separate method.
There was a problem hiding this comment.
The test suite in Commons IO is parameterized and takes into consideration the introduction of support for Unicode 15 in JDK 20 (grapheme clusters and so on): https://github.com/apache/commons-io/blob/b4ee32c53c0036429d64c0d6fe82a62a0fc6dae2/src/test/java/org/apache/commons/io/FileSystemTest.java#L171
There was a problem hiding this comment.
Maybe habe one method testing the various null cases (more than at the moment) one for the usual single byte ascii cases one for bmp Tests and then one for Graphen cases and one for Dynamic. I would also avoid Emoji literals in source
|
In apache/commons-io#781 I used a similar approach to introduce an equivalent functionality in Commons IO. While approaches might differ, I think we can reuse the test cases. |
| } | ||
|
|
||
| @Test | ||
| void testTruncateToByteLength() { |
There was a problem hiding this comment.
Yes, you can reuse the test class, but this is about 10 different tests that should each be a separate method.
We sometimes need to store Unicode text in a fixed space (e.g., in a database column of type
CHARACTER(32)). It's acceptable for the text to be truncated, but because we're dealing with Unicode, we can't simply treat the text as raw bytes and truncate it at 16 bytes — that might split a character in the middle. The functionStringUtils.truncateToByteLength(String str, int maxBytes, Charset charset)helps handle this by safely truncating the string based on byte length while preserving valid character boundaries.