fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip by Yuof · Pull Request #3329 · dolanmiu/docx

Yuof · 2026-01-29T10:50:35Z

Problem

When JSZip processes large XML content (> 16KB), it chunks strings using \substring()\ which operates on UTF-16 code units. This can split surrogate pairs for characters above U+FFFF (like emoji or Material Design Icons).

Each surrogate then gets encoded as a separate 3-byte UTF-8 sequence, producing invalid CESU-8 instead of proper UTF-8. This corrupts the docx file and causes XML parsing errors.

Root Cause

JSZip's \DataWorker\ uses:
\\javascript
DEFAULT_BLOCK_SIZE = 16 * 1024
data.substring(index, nextIndex)
\\

The \Utf8EncodeWorker\ then processes each chunk independently, without handling split surrogates. I've opened a fix PR to JSZip: Stuk/jszip#963

Solution

This PR works around the issue by pre-encoding strings to UTF-8 using \TextEncoder\ before passing to \zip.file(). Since we pass \Uint8Array\ instead of strings, JSZip skips the problematic \Utf8EncodeWorker\ entirely.

Changes

Added \�ncodeUtf8()\ helper function in
ext-compiler.ts\ and \rom-docx.ts\
Modified \zip.file()\ calls to use pre-encoded UTF-8 bytes for XML content

Testing

Build succeeds
TypeScript compiles without errors
This is a minimal, low-risk change that ensures UTF-8 encoding happens correctly before JSZip processing

Copilot

Pull request overview

This PR addresses a bug in JSZip where string chunking can split UTF-16 surrogate pairs when processing large XML content (> 16KB), resulting in invalid CESU-8 encoding instead of proper UTF-8. The fix pre-encodes XML strings to UTF-8 using TextEncoder before passing them to JSZip, bypassing the problematic chunking behavior.

Changes:

Added encodeUtf8() helper function to pre-encode strings to UTF-8 bytes
Modified all zip.file() calls for XML content to use the pre-encoded UTF-8 bytes
Binary content (images, fonts) continues to be passed directly without encoding

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/patcher/from-docx.ts	Added `encodeUtf8()` helper and applied it to XML content in the patching workflow
src/export/packer/next-compiler.ts	Added `encodeUtf8()` helper and applied it to all XML file generation in the compiler

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/patcher/from-docx.ts

src/export/packer/next-compiler.ts

src/patcher/from-docx.ts

… tests

fix: pre-encode XML strings to UTF-8 to avoid JSZip surrogate split bug

0a800fd

Copilot AI review requested due to automatic review settings January 29, 2026 10:50

Copilot started reviewing on behalf of Yuof January 29, 2026 10:50 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

src/patcher/from-docx.ts Outdated Show resolved Hide resolved

src/export/packer/next-compiler.ts Outdated Show resolved Hide resolved

src/export/packer/next-compiler.ts Outdated Show resolved Hide resolved

src/patcher/from-docx.ts Outdated Show resolved Hide resolved

Address review feedback: extract encodeUtf8 to shared utility and add…

29329a7

… tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip#3329
Yuof wants to merge 2 commits intodolanmiu:masterfrom
Yuof:fix/utf8-encode-before-zip

Yuof commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Yuof commented Jan 29, 2026

Problem

Root Cause

Solution

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant