Skip to content

Conversation

@wilkolbrzym-coder
Copy link

I've been grinding on the SIMD string escape paths to see how far we could push the throughput, and I finally nailed a version that's significantly faster across the board.
What's changed:
Aggressive Unrolling: I've implemented unrolled loops for AVX2 and SSE2 to fully saturate the CPU execution ports and minimize loop control overhead.
Speculative Stores: The code now writes to the buffer speculatively before checking for escape characters. This keeps the pipeline moving and provides a massive boost for "clean" strings.
Small String Dispatcher: I added a lightweight dispatcher for small strings. This ensures we don't pay any SIMD setup penalty for tiny keys, maintaining our baseline performance while crushing it on everything else.
UTF-8 & High-Byte Stability: I managed to completely clear the regressions on non-ASCII data. The new logic is now consistently faster than baseline even on UTF-8 payloads.
Verification:
I've verified everything on native Ubuntu. All 82 core tests passed, so correctness is 100% solid across UTF-8, control characters, and edge cases.

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225507 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097435/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225510 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097435/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225508 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097435/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225509 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097435/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/544198 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097435/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225511 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097435/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225520 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097440/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225519 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097440/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225521 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097440/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/544203 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097440/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225522 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097440/

@packit-as-a-service
Copy link

One of the tests failed for 87683da. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225523 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097440/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225464 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097421/

@packit-as-a-service
Copy link

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225487 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097428/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225489 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097428/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225485 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097428/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3225488 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10097428/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3227184 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10098015/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/copr/3227185 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10098015/

@packit-as-a-service
Copy link

One of the tests failed for d96b176. @admin check logs None, packit dashboard https://dashboard.packit.dev/jobs/srpm/545069 and external service dashboard https://copr.fedorainfracloud.org/coprs/build/10099694/

@stephenberry
Copy link
Owner

I just ran the string writing benchmarks and they have some correctness checks that are failing:

MISMATCH in 'single null' (input len 1):
SWAR (3 bytes): "
SIMD (4 bytes): "
First diff at byte 2: SWAR=0x22 SIMD=0x00
Length difference: SWAR=3 SIMD=4
MISMATCH in 'single 0x1F' (input len 1):
SWAR (3 bytes): "�"
SIMD (4 bytes): "
First diff at byte 1: SWAR=0x1F SIMD=0x00
Length difference: SWAR=3 SIMD=4
MISMATCH in 'all_control_chars' (input len 512):
SWAR (1019 bytes): "...
SIMD (1026 bytes): "...
First diff at byte 1011: SWAR=0x19 SIMD=0x00
Length difference: SWAR=1019 SIMD=1026
MISMATCH in 'boundary_0x1F 7' (input len 7):
SWAR (9 bytes): "�xxxxxx"
SIMD (10 bytes): "
First diff at byte 1: SWAR=0x1F SIMD=0x00
Length difference: SWAR=9 SIMD=10
MISMATCH in 'boundary_0x1F 8' (input len 8):
SWAR (11 bytes): "
SIMD (12 bytes): "
First diff at byte 9: SWAR=0x1F SIMD=0x00
Length difference: SWAR=11 SIMD=12
MISMATCH in 'boundary_0x1F 15' (input len 15):
SWAR (19 bytes): "
SIMD (20 bytes): "
First diff at byte 17: SWAR=0x1F SIMD=0x00
Length difference: SWAR=19 SIMD=20
MISMATCH in 'boundary_0x1F 64' (input len 64):
SWAR (75 bytes): "
SIMD (76 bytes): "
First diff at byte 73: SWAR=0x1F SIMD=0x00
Length difference: SWAR=75 SIMD=76
MISMATCH in 'alternating_0x1F_0x20 8' (input len 8):
SWAR (11 bytes): "
SIMD (14 bytes): "
First diff at byte 4: SWAR=0x1F SIMD=0x00
Length difference: SWAR=11 SIMD=14
MISMATCH in 'alternating_0x1F_0x20 16' (input len 16):
SWAR (23 bytes): "
SIMD (26 bytes): "
First diff at byte 16: SWAR=0x1F SIMD=0x00
Length difference: SWAR=23 SIMD=26
MISMATCH in 'alternating_0x1F_0x20 32' (input len 32):
SWAR (47 bytes): "
SIMD (50 bytes): "
First diff at byte 40: SWAR=0x1F SIMD=0x00
Length difference: SWAR=47 SIMD=50
MISMATCH in 'alternating_0x1F_0x20 64' (input len 64):
SWAR (95 bytes): "
SIMD (98 bytes): "
First diff at byte 88: SWAR=0x1F SIMD=0x00
Length difference: SWAR=95 SIMD=98
MISMATCH in 'alternating_0x1F_0x20 256' (input len 256):
SWAR (383 bytes): "...
SIMD (386 bytes): "...
First diff at byte 376: SWAR=0x1F SIMD=0x00
Length difference: SWAR=383 SIMD=386
CORRECTNESS FAILURE: 12 tests failed! Aborting.
Error: Process completed with exit code 134.

@stephenberry
Copy link
Owner

There are functional changes here that I think will regress performance rather than improve it.

Scalar tail short-circuits the SSE2/SWAR cascade

The existing design is intentionally layered: AVX2 (32 bytes) → SSE2 (16 bytes) → SWAR (8 bytes) → scalar tail. Adding a byte-by-byte scalar tail inside the SIMD functions causes it to consume all remaining bytes, preventing the more efficient downstream paths from ever running.

For example, with a 48-byte string on AVX2, the old code processes 32 bytes via AVX2 + 16 bytes via SSE2. The new code processes 32 bytes via AVX2 + 16 bytes byte-by-byte. That's a meaningful regression for strings whose lengths aren't multiples of 32 (or 16 for SSE2-only).

Removing the speculative store regresses the escape path

The speculative store is a deliberate optimization, not dead code. The old pattern:

_mm256_storeu_si256(reinterpret_cast<__m256i*>(data), v); // one fixed-size SIMD store
// ...if escape at offset `length`:
data += length;  // clean bytes already in place — zero extra work
write_escape();

The new pattern replaces this with a variable-length memcpy(data, c, len) in the escape path. Since len is a runtime value (0–31), the compiler can't lower it to a single instruction — it'll emit a memcpy call or a branchy inline sequence. The old code got those bytes into place for free as part of the SIMD store it was already doing.

In the no-escape path, memcpy(data, c, 32) will compile to the same vmovdqu store, so it's neutral there — but the old code also got to reuse the already-loaded register v, whereas the memcpy approach may cause a redundant reload depending on the compiler.

Minor issues

  • License headers removed — both files lost their // Glaze Library / // For the license information refer to glaze.hpp comments. The NEON file still has them, so this looks unintentional.
  • Missing trailing newline — both files now end with #endif with no newline, which will trigger warnings on some compilers.
  • Behavioral change in tail — the new scalar tail routes all control chars through write_escape(), while the old scalar tail in write.hpp copied through control chars that don't have char_escape_table entries when check_escape_control_characters is disabled.

The const qualification on v and the comment explaining why (v & 0xE0) == 0 is correct for UTF-8 are nice touches, but they don't warrant the functional regressions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants