Skip to content

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329

Open
Yuof wants to merge 2 commits intodolanmiu:masterfrom
Yuof:fix/utf8-encode-before-zip
Open

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329
Yuof wants to merge 2 commits intodolanmiu:masterfrom
Yuof:fix/utf8-encode-before-zip

Conversation

@Yuof
Copy link

@Yuof Yuof commented Jan 29, 2026

Problem

When JSZip processes large XML content (> 16KB), it chunks strings using \substring()\ which operates on UTF-16 code units. This can split surrogate pairs for characters above U+FFFF (like emoji or Material Design Icons).

Each surrogate then gets encoded as a separate 3-byte UTF-8 sequence, producing invalid CESU-8 instead of proper UTF-8. This corrupts the docx file and causes XML parsing errors.

Root Cause

JSZip's \DataWorker\ uses:
\\javascript
DEFAULT_BLOCK_SIZE = 16 * 1024
data.substring(index, nextIndex)
\\

The \Utf8EncodeWorker\ then processes each chunk independently, without handling split surrogates. I've opened a fix PR to JSZip: Stuk/jszip#963

Solution

This PR works around the issue by pre-encoding strings to UTF-8 using \TextEncoder\ before passing to \zip.file(). Since we pass \Uint8Array\ instead of strings, JSZip skips the problematic \Utf8EncodeWorker\ entirely.

Changes

  • Added \�ncodeUtf8()\ helper function in
    ext-compiler.ts\ and \ rom-docx.ts\
  • Modified \zip.file()\ calls to use pre-encoded UTF-8 bytes for XML content

Testing

  • Build succeeds
  • TypeScript compiles without errors
  • This is a minimal, low-risk change that ensures UTF-8 encoding happens correctly before JSZip processing

Copilot AI review requested due to automatic review settings January 29, 2026 10:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a bug in JSZip where string chunking can split UTF-16 surrogate pairs when processing large XML content (> 16KB), resulting in invalid CESU-8 encoding instead of proper UTF-8. The fix pre-encodes XML strings to UTF-8 using TextEncoder before passing them to JSZip, bypassing the problematic chunking behavior.

Changes:

  • Added encodeUtf8() helper function to pre-encode strings to UTF-8 bytes
  • Modified all zip.file() calls for XML content to use the pre-encoded UTF-8 bytes
  • Binary content (images, fonts) continues to be passed directly without encoding

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/patcher/from-docx.ts Added encodeUtf8() helper and applied it to XML content in the patching workflow
src/export/packer/next-compiler.ts Added encodeUtf8() helper and applied it to all XML file generation in the compiler

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant