Add StringUtils.truncateToByteLength by kiddos · Pull Request #1392 · apache/commons-lang

kiddos · 2025-05-27T17:34:58Z

We sometimes need to store Unicode text in a fixed space (e.g., in a database column of type CHARACTER(32)). It's acceptable for the text to be truncated, but because we're dealing with Unicode, we can't simply treat the text as raw bytes and truncate it at 16 bytes — that might split a character in the middle. The function StringUtils.truncateToByteLength(String str, int maxBytes, Charset charset) helps handle this by safely truncating the string based on byte length while preserving valid character boundaries.

ecki · 2025-05-28T00:07:55Z

Agree, very useful when dealing with UTF8 databases. Wonder if it should have a utf8 variant, where it does not have to re truncate, it can just look at the byte patterns at the border.

The current version does not deal with UTF16 code units properly. (Substring might cut them in half)

garydgregory

Hello all,

I think you'll want tests that cover grapheme clusters to avoid problems like https://issues.apache.org/jira/browse/LANG-1770

kiddos · 2025-05-28T17:39:48Z

I added some test cases for emoji characters 🚀✨🎉
I did some testing and found that current implementation the escape characters worked ("\uD83D\uDE80\u2728\uD83C\uDF89")
but "🚀✨🎉" doesn't

After adding <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in pom.xml, "🚀✨🎉" seems to work.

garydgregory · 2025-05-28T18:07:51Z

@kiddos
Please see my previous comment.

kiddos · 2025-05-28T19:11:07Z

Oh, right.
it's just tricky to handle grapheme cluster.
the codePoint solution you mention does seems to work.
I'll add more tests using grapheme clusters.

garydgregory · 2025-05-28T19:30:19Z

I'm not requesting support for grapheme cluster in the runtime, but we should set expectations in unit tests, whether they are supported or not. This is a larger discussion, which I raised in https://issues.apache.org/jira/browse/LANG-1770

…ller or equal then expected and not null

kiddos · 2025-06-29T15:25:21Z

for the test case, I only check if the final output is not null and the bytes size is actully smaller then specified byte size.
Is that ok?

elharo · 2025-09-18T10:05:01Z

src/test/java/org/apache/commons/lang3/StringUtilsTest.java

    }

+    @Test
+    void testTruncateToByteLength() {


These should be separate test methods for different tests so they can pass or fail independently.

Yes, you can reuse the test class, but this is about 10 different tests that should each be a separate method.

The test suite in Commons IO is parameterized and takes into consideration the introduction of support for Unicode 15 in JDK 20 (grapheme clusters and so on): https://github.com/apache/commons-io/blob/b4ee32c53c0036429d64c0d6fe82a62a0fc6dae2/src/test/java/org/apache/commons/io/FileSystemTest.java#L171

Maybe habe one method testing the various null cases (more than at the moment) one for the usual single byte ascii cases one for bmp Tests and then one for Graphen cases and one for Dynamic. I would also avoid Emoji literals in source

ppkarwasz · 2025-09-18T10:22:08Z

In apache/commons-io#781 I used a similar approach to introduce an equivalent functionality in Commons IO.

While approaches might differ, I think we can reuse the test cases.

elharo · 2025-09-18T10:30:05Z

src/test/java/org/apache/commons/lang3/StringUtilsTest.java

    }

+    @Test
+    void testTruncateToByteLength() {


Yes, you can reuse the test class, but this is about 10 different tests that should each be a separate method.

Add StringUtils.truncateToByteLength

375e9cf

garydgregory requested changes May 28, 2025

View reviewed changes

fix case for emojis

b12a9ee

kiddos added 3 commits June 8, 2025 22:34

add test cases

4ba9a41

Merge branch 'master' of github.com:apache/commons-lang

ea235b3

case with graphene cluster only check if output bytes is actually sma…

e9a1a31

…ller or equal then expected and not null

elharo suggested changes Sep 18, 2025

View reviewed changes

Conversation

kiddos commented May 27, 2025

Uh oh!

ecki commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

kiddos commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garydgregory commented May 28, 2025

Uh oh!

kiddos commented May 28, 2025

Uh oh!

garydgregory commented May 28, 2025

Uh oh!

kiddos commented Jun 29, 2025

Uh oh!

elharo Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

elharo Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

ppkarwasz Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

ecki Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

ppkarwasz commented Sep 18, 2025

Uh oh!

elharo Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ecki commented May 28, 2025 •

edited

Loading

kiddos commented May 28, 2025 •

edited

Loading