feat: unify left and right functions and benches by theirix · Pull Request #20114 · apache/datafusion

theirix · 2026-02-02T17:09:00Z

Which issue does this PR close?

Closes feat: unify left and right functions #20103

Rationale for this change

A refactoring PR for performance improvement PRs for left #19749 and right #20068.

What changes are included in this PR?

Removed a lot of code duplication by extracting a common stringarray / stringview implementation. Now left and right UDFs entry points are leaner. Differences are only in slicing - from the left or from the right - which is implemented in a generic trait parameter, following the design of trim.
Switched left to use make_view to avoid buffer tinkering in datafusion code.
Combine left and right benches together

Are these changes tested?

Existing unit tests
Existing SLTs passed
Benches show the same performance improvement of 60-85%

Bench results against pre-optimisation commit 458b491:

Details

left size=1024/string_array positive n/1024 time: [34.150 µs 34.694 µs 35.251 µs] change: [−71.694% −70.722% −69.818%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild left size=1024/string_array negative n/1024 time: [30.860 µs 31.396 µs 31.998 µs] change: [−85.846% −85.294% −84.759%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe

left size=4096/string_array positive n/4096
time: [112.19 µs 114.28 µs 116.98 µs]
change: [−71.673% −70.934% −70.107%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
left size=4096/string_array negative n/4096
time: [126.71 µs 129.06 µs 131.26 µs]
change: [−84.204% −83.809% −83.455%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) low mild
2 (2.00%) high mild

left size=1024/string_view_array positive n/1024
time: [30.249 µs 30.887 µs 31.461 µs]
change: [−75.288% −74.499% −73.743%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) low mild
1 (1.00%) high mild
left size=1024/string_view_array negative n/1024
time: [48.404 µs 49.007 µs 49.608 µs]
change: [−66.827% −65.727% −64.652%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe

left size=4096/string_view_array positive n/4096
time: [145.25 µs 148.47 µs 151.85 µs]
change: [−68.913% −67.836% −66.770%] (p = 0.00 < 0.05)
Performance has improved.
left size=4096/string_view_array negative n/4096
time: [203.11 µs 206.31 µs 209.98 µs]
change: [−57.411% −56.773% −56.142%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low mild
13 (13.00%) high mild
1 (1.00%) high severe

right size=1024/string_array positive n/1024
time: [30.820 µs 31.674 µs 32.627 µs]
change: [−84.230% −83.842% −83.402%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
right size=1024/string_array negative n/1024
time: [32.434 µs 33.170 µs 33.846 µs]
change: [−88.796% −88.460% −88.164%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

right size=4096/string_array positive n/4096
time: [124.71 µs 126.54 µs 128.27 µs]
change: [−83.321% −82.902% −82.537%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
right size=4096/string_array negative n/4096
time: [125.05 µs 127.67 µs 130.35 µs]
change: [−89.376% −89.193% −89.004%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

right size=1024/string_view_array positive n/1024
time: [29.110 µs 29.608 µs 30.141 µs]
change: [−79.807% −79.330% −78.683%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
right size=1024/string_view_array negative n/1024
time: [44.883 µs 45.656 µs 46.511 µs]
change: [−71.157% −70.546% −69.874%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe

right size=4096/string_view_array positive n/4096
time: [139.57 µs 142.18 µs 144.96 µs]
change: [−75.610% −75.088% −74.549%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
right size=4096/string_view_array negative n/4096
time: [221.47 µs 224.47 µs 227.72 µs]
change: [−64.625% −64.047% −63.504%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

Are there any user-facing changes?

theirix · 2026-02-02T18:10:28Z

A final left/right PR - @Jefffrey , could you please have a look?

Jefffrey · 2026-02-03T02:24:50Z

datafusion/functions/src/unicode/common.rs

+        if n == 0 {
+            // Return nothing for `n=0`
+            0..0
+        } else {
+            0..left_right_byte_length(string, n)
+        }


Suggested change

if n == 0 {

// Return nothing for `n=0`

0..0

} else {

0..left_right_byte_length(string, n)

}

0..left_right_byte_length(string, n)

Technically we already check for 0 inside left_right_byte_length

martin-g · 2026-02-03T13:17:13Z

datafusion/functions/benches/left_right.rs

+                    })
+                });
+
+                // Benchmark with negative n (triggers optimization)


The code below is very similar to the "positive" logic above.
Would it be a good idea to add for is_negative in [false, true] { before/after for is_string_view in [false, true] and merge them ?

Good point, I simplified this part and generalized slicing. Now it looks much more compact and concise.

martin-g · 2026-02-03T13:37:29Z

datafusion/functions/src/unicode/common.rs

+                let result_bytes = &string.as_bytes()[range.clone()];
+
+                let byte_view = ByteView::from(view);
+                // New offsets starts at 0 for left, and at `range.start` for right,


Suggested change

// New offsets starts at 0 for left, and at `range.start` for right,

// New offset starts at 0 for left, and at `range.start` for right,

Jefffrey · 2026-02-05T01:13:20Z

Thanks @theirix & @martin-g

theirix added 2 commits February 2, 2026 12:39

Combine left and right benchmarks

c8087d8

Unify left and right implementation

204921c

github-actions bot added the functions Changes to functions implementation label Feb 2, 2026

theirix added 2 commits February 2, 2026 17:34

Move common to unicode mod

aa58ff0

Add license header

8794229

theirix marked this pull request as ready for review February 2, 2026 18:09

Jefffrey approved these changes Feb 3, 2026

View reviewed changes

martin-g reviewed Feb 3, 2026

View reviewed changes

theirix added 3 commits February 3, 2026 21:20

Combine positive and negative benches

614dc5b

Fix typo

0bb6925

Simplify slicer for left

426997f

Jefffrey enabled auto-merge February 5, 2026 01:11

Jefffrey approved these changes Feb 5, 2026

View reviewed changes

Jefffrey added this pull request to the merge queue Feb 5, 2026

Merged via the queue into apache:main with commit eb33141 Feb 5, 2026
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify left and right functions and benches#20114

feat: unify left and right functions and benches#20114
Jefffrey merged 7 commits intoapache:mainfrom
theirix:left-right-unify

theirix commented Feb 2, 2026

Uh oh!

theirix commented Feb 2, 2026

Uh oh!

Jefffrey Feb 3, 2026

Uh oh!

martin-g Feb 3, 2026

Uh oh!

theirix Feb 3, 2026

Uh oh!

martin-g Feb 3, 2026

Uh oh!

Uh oh!

Jefffrey commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// New offsets starts at 0 for left, and at `range.start` for right,
	// New offset starts at 0 for left, and at `range.start` for right,

Conversation

theirix commented Feb 2, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

theirix commented Feb 2, 2026

Uh oh!

Jefffrey Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

theirix Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jefffrey commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants