Skip to content

Fix unexpected document end when importing drops with large XLIFF files#1049

Open
wadimw wants to merge 2 commits intoupstream-patchedfrom
large-drop-import-fix
Open

Fix unexpected document end when importing drops with large XLIFF files#1049
wadimw wants to merge 2 commits intoupstream-patchedfrom
large-drop-import-fix

Conversation

@wadimw
Copy link

@wadimw wadimw commented Feb 3, 2026

This PR fixes the following error:

com.box.l10n.mojito.service.drop.importdropexception: com.ctc.wstx.exc.wstxeofexception: unexpected eof; was expecting a close tag for element <note>
 at [row,col,system-id]: [128,206,"/some/file/path/to/be/read/from/db"]

which would occur when importing a drop containing XLIFF files larger than 8 KiB. On the UI side, this would appear as Import Failed in the Project Requests page, and would result in a partial import (translations for strings before the 8KiB mark would be imported correctly).


Click here to see the full stack trace
2026-01-29t07:42:26.609-08:00 debug 97 --- [ pollabletask-5] c.b.l.mojito.service.drop.dropservice    : error when importing file, keep importing other files

com.box.l10n.mojito.service.drop.importdropexception: com.ctc.wstx.exc.wstxeofexception: unexpected eof; was expecting a close tag for element <note>
 at [row,col,system-id]: [128,206,"/some/file/path/to/be/read/from/db"]
	at com.box.l10n.mojito.service.drop.dropservice.updatetmwithlocalizedxliff_aroundbody10(dropservice.java:341)
	at com.box.l10n.mojito.service.drop.dropservice$ajcclosure11.run(dropservice.java:1)
	at org.aspectj.runtime.reflect.joinpointimpl.proceed(joinpointimpl.java:270)
	at com.box.l10n.mojito.service.pollabletask.pollablecallable.call(pollablecallable.java:50)
	at java.base/java.util.concurrent.futuretask.run(futuretask.java:317)
	at com.box.l10n.mojito.service.pollabletask.pollableaspect.syncexecute(pollableaspect.java:111)
	at com.box.l10n.mojito.service.pollabletask.pollableaspect.ajc$inlineaccessmethod$com_box_l10n_mojito_service_pollabletask_pollableaspect$com_box_l10n_mojito_service_pollabletask_pollableaspect$syncexecute(pollableaspect.java:1)
	at com.box.l10n.mojito.service.pollabletask.pollableaspect.createpollablewrapper(pollableaspect.java:79)
	at com.box.l10n.mojito.service.drop.dropservice.updatetmwithlocalizedxliff(dropservice.java:331)
	at com.box.l10n.mojito.service.drop.dropservice.importfile_aroundbody8(dropservice.java:313)
	at com.box.l10n.mojito.service.drop.dropservice$ajcclosure9.run(dropservice.java:1)
	at org.aspectj.runtime.reflect.joinpointimpl.proceed(joinpointimpl.java:270)
	at com.box.l10n.mojito.service.pollabletask.pollablecallable.call(pollablecallable.java:50)
	at java.base/java.util.concurrent.futuretask.run(futuretask.java:317)
	at com.box.l10n.mojito.service.pollabletask.pollableaspect.syncexecute(pollableaspect.java:111)
	at com.box.l10n.mojito.service.pollabletask.pollableaspect.ajc$inlineaccessmethod$com_box_l10n_mojito_service_pollabletask_pollableaspect$com_box_l10n_mojito_service_pollabletask_pollableaspect$syncexecute(pollableaspect.java:1)
	at com.box.l10n.mojito.service.pollabletask.pollableaspect.createpollablewrapper(pollableaspect.java:79)
	at com.box.l10n.mojito.service.drop.dropservice.importfile(dropservice.java:301)
	at com.box.l10n.mojito.service.drop.dropservice.importdrop_aroundbody6(dropservice.java:256)
	at com.box.l10n.mojito.service.drop.dropservice$ajcclosure7.run(dropservice.java:1)
	at org.aspectj.runtime.reflect.joinpointimpl.proceed(joinpointimpl.java:270)
	at com.box.l10n.mojito.service.pollabletask.pollablecallable.call(pollablecallable.java:50)
	at java.base/java.util.concurrent.futuretask.run(futuretask.java:317)
	at org.springframework.security.concurrent.delegatingsecuritycontextrunnable.run(delegatingsecuritycontextrunnable.java:94)
	at java.base/java.util.concurrent.executors$runnableadapter.call(executors.java:572)
	at java.base/java.util.concurrent.futuretask.run(futuretask.java:317)
	at java.base/java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1144)
	at java.base/java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:642)
	at java.base/java.lang.thread.run(thread.java:1583)
caused by: net.sf.okapi.common.exceptions.okapiioexception: com.ctc.wstx.exc.wstxeofexception: unexpected eof; was expecting a close tag for element <note>
 at [row,col,system-id]: [128,206,"/some/file/path/to/be/read/from/db"]
	at net.sf.okapi.filters.xliff.xlifffilter.processnote(xlifffilter.java:3671)
	at net.sf.okapi.filters.xliff.xlifffilter.processtransunit(xlifffilter.java:1688)
	at net.sf.okapi.filters.xliff.xlifffilter.read(xlifffilter.java:643)
	at net.sf.okapi.filters.xliff.xlifffilter.next(xlifffilter.java:360)
	at net.sf.okapi.steps.common.rawdocumenttofiltereventsstep.handleevent(rawdocumenttofiltereventsstep.java:166)
	at net.sf.okapi.common.pipeline.pipeline.execute(pipeline.java:117)
	at net.sf.okapi.common.pipeline.pipeline.process(pipeline.java:227)
	at net.sf.okapi.common.pipeline.pipeline.process(pipeline.java:199)
	at net.sf.okapi.common.pipelinedriver.pipelinedriver.processbatch(pipelinedriver.java:182)
	at com.box.l10n.mojito.service.tm.tmservice.updatetmwithxliff(tmservice.java:921)
	at com.box.l10n.mojito.service.tm.tmservice.updatetmwithtranslationkitxliff(tmservice.java:840)
	at com.box.l10n.mojito.service.drop.dropservice.updatetmwithlocalizedxliff_aroundbody10(dropservice.java:338)
	... 28 common frames omitted
caused by: com.ctc.wstx.exc.wstxeofexception: unexpected eof; was expecting a close tag for element <note>
 at [row,col,system-id]: [128,206,"/some/file/path/to/be/read/from/db"]
	at com.ctc.wstx.sr.streamscanner.throwunexpectedeof(streamscanner.java:701)
	at com.ctc.wstx.sr.basicstreamreader.throwunexpectedeof(basicstreamreader.java:5612)
	at com.ctc.wstx.sr.basicstreamreader.nextfromtree(basicstreamreader.java:2811)
	at com.ctc.wstx.sr.basicstreamreader.next(basicstreamreader.java:1122)
	at net.sf.okapi.filters.xliff.xlifffilter.processnote(xlifffilter.java:3642)
	... 39 common frames omitted

2026-01-29t07:42:26.610-08:00 debug 97 --- [ pollabletask-5] c.b.l.m.boxsdk.boxapiconnectionprovider  : getting box api connection
2026-01-29t07:42:27.110-08:00 debug 97 --- [ pollabletask-5] c.b.l.m.s.p.pollabletaskexceptionutils   : error happened during task execution

com.box.l10n.mojito.service.drop.importdropexception: number of files not imported: 1, check sub task for more information
	at com.box.l10n.mojito.service.drop.dropservice.importdrop_aroundbody6(dropservice.java:271)
	at com.box.l10n.mojito.service.drop.dropservice$ajcclosure7.run(dropservice.java:1)
	at org.aspectj.runtime.reflect.joinpointimpl.proceed(joinpointimpl.java:270)
	at com.box.l10n.mojito.service.pollabletask.pollablecallable.call(pollablecallable.java:50)
	at java.base/java.util.concurrent.futuretask.run(futuretask.java:317)
	at org.springframework.security.concurrent.delegatingsecuritycontextrunnable.run(delegatingsecuritycontextrunnable.java:94)
	at java.base/java.util.concurrent.executors$runnableadapter.call(executors.java:572)
	at java.base/java.util.concurrent.futuretask.run(futuretask.java:317)
	at java.base/java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1144)
	at java.base/java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:642)
	at java.base/java.lang.thread.run(thread.java:1583)

Note that the logged error position [128,206] was exacly 8192 characters. Additionally, this happens regardless of selected DropExporter (i.e. this is not caused by Box SDK failing to provide document content).

The root cause was one step in the Okapi pipeline (IntegrityCheckStep) advanding the underlying document stream to the end, while another step (RawDocumentToFilterEventsStep with XLIFFFIlter) was in the middle of parsing it. This issue was introduced in #731 which changed the method to retrieve content of the whole document from RawDocument#getCharSequence to RawDocument#getReader. According to the findings described in #1049 (comment), it seems like it's only allowed to access the reader/stream within BasePipelineStep#handleRawDocument, not later (in any Filter Events handlers).

The fix is tested through new DropServiceTest#forTranslationLargeXliffFile.

@wadimw wadimw changed the base branch from upstream-patched to refactor-drop-service-test-data February 3, 2026 14:42
@wadimw
Copy link
Author

wadimw commented Feb 3, 2026

Notes on the go: During direct isolated testing of the flow of TMService#updateTMWithXliff, removal of the IntegrityCheckStep allows the test to handle large documents correctly. That step breaks things, because it calls CharStreams.toString(rawDocument#getReader), which somehow sideeffects the underlying stream that is created on the fly in RawDocument ctor. This was introduced in #731 and probably wasn't ever noticed, because as per Guava documentation, CharStreams#toString does not close the stream when it's done with the reader. If we revert to using CharSequence-based document, then it works fine for large files - but then it breaks the encoding-related tests. In theory we could try to use RawDocument#setEncoding to change it, but in practice it does not work, because this method no-ops for documents created from CharSequence and keeps the default UTF-16, i.e. it's impossible to force encoding to UTF-8 that we want. So, a hacky way to fix this is to update the encoding through reflection - subsequent processing produces UTF-8 output streams. This might be fine, since we're already using reflection to force set the URI on the RawDocument anyway - but this actually opposes Okapi RawDocument contract, which says you can have only ONE of: a charsequence, a stream or an URI and possibly introduces some unexpected runtime behaviour, so we should probably avoid introducing more hacky reflection stuff.

The funny part is that this solution (i.e. reverting to CharSequence-backed document, but then force changing the encoding) actually works https://github.com/box/mojito/actions/runs/21634652420/job/62356567792

@wadimw wadimw added the upstream-patched Experimental features ported from legacy branch label Feb 3, 2026
@wadimw
Copy link
Author

wadimw commented Feb 3, 2026

Possibly the same issue as #1021

Base automatically changed from refactor-drop-service-test-data to upstream-patched February 4, 2026 11:26
@wadimw wadimw force-pushed the large-drop-import-fix branch from 5b00af0 to 211a757 Compare February 5, 2026 15:53
@wadimw wadimw force-pushed the large-drop-import-fix branch from 211a757 to b25a3d0 Compare February 5, 2026 15:55
@wadimw
Copy link
Author

wadimw commented Feb 5, 2026

New findings:

Seems like we can get rid of the RawDocument URI injection through reflection workaround, if we instead manually set the document name on the START_DOCUMENT event to avoid NPE from QualityChecker. This can be done by overriding QualityCheckStep#handleStartDocument, because this event actually exposes a public method StartDocument#setName - so it's a bit less hacky then reflection. Additionally, this means that this workaroud is more contained, since it's implemented on that particular class which requires it. I previously thought this URI change might even be the root cause of this issue, but it seems that it's not the case - more on that below. This however means, that the URI injection removal is merely a refactor, and so it's outside of the scope of this issue. We may still want to include it here, or not. (Note: after discussing with @ehoogerbeets we've decided to move the URI refactor into a separate PR #1052)

Now, onto the actual issue. The way stream-based RawDocument works is that it provides the same cached InputStream instance every time RawDocument#getStream or RawDocumen#getReader is called - it's just that these methods implicitly reopen that stream (if it has been closed elsewhere) and reset it to the beginning. So, if we retrieve this stream anywhere, we must be aware that this introduces side effects which may impact any other piece of code that also refers to it. Given that the RawDocumentToFilterEventsStep with XLIFFFilter uses that stream within an XML parser of the XLIFFFilter, that means that if that parser requests the stream before our IntegrityCheckStep#handleStartDocument does, our request (and further reading using CharStreams.toString), puts the XLIFFFilter in a weird state (because it might have been in the middle of the document, and we've suddenly advanced the stream to the end). So basically, it looks like we shouldn't have two steps in a single pipeline that actually use the stream at the same time.

Note that this behaviour didn't surface earlier, because CharStreams#toString does not implicitly close the stream - so there was no obvious way to notice that this stream has been tampered with. Additionally, it seems that the XML parser of XLIFFFilter caches some of the read data in memory, so this may explain why nothing wrong happened for small files. I also now realize that this must mean that the IntegrityCheckStep never really got the WHOLE document, since the XLIFFFilter would probably read ahead beforehand <- not true, because RawDocument#getReader implicitly resets the stream to the beginning.

To confirm these findings, I first tried adding reader.close() and then another rawDocument.getReader() immediately after closing to verify the side effects actually impact XLIFFFilter behaviour. The findings match - closing the stream causes XLIFFFilter parser to throw when it tries to read more data than what it has cached beforehand, while re-requesting the reader causes the XLIFFFilter parser to throw due to unexpected character < (i.e. the underlying shared stream has been reset to the beginning of the file). The latter also confirms behavior that we've run into originally, i.e. the unclosed reader would report EOF (due to the read ahead) even though XLIFFFilter parser buffered only 8k of the input document content.

Using a debugger I discovered that the XLIFFFilter actually calls RawDocument#getStream multiple times, so it seems that the reopen-reset behaviour is at least somewhat expected in the whole design of Okapi pipeline. More importantly though, I noticed that this is only called during RawDocumentToFilterEventsStep#handleRawDocument as opposed to ours IntegrityCheckStep#handleStartDocument. This suggests that it's actually fine to retrieve (and implicitly reopen-reset) the document stream multiple times - but it can only happen in the beginning of the pipeline, before RawDocumentToFilterEventsStep with XLIFFFilter actually starts reading step by step and emitting text unit events.

The idea then is to move the CharStreams#toString call to an eariler moment within the pipeline, where apparently it is fine to mess with the Stream. So, I moved this code to IntegrityCheckStep#handleRawDocument. This caused it to never run - because it turned out that theRawDocumentToFilterEventsStep would suppress the RAW_DOCUMENT event - which I assume is exacly why this code was initially placed in the START_DOCUMENT handler. Note again, that this placement was not a problem initially, because before #731 the RawDocument was based on CharSequence, so the integrity check utilized RawDocument#getCharSequence rather than the reader - thus it didn't touch the stream at all. To get the IntegrityCheckStep#handleRawDocument to execute, I had to move IntegrityCheckStep before the RawDocumentToFilterEventsStep. But then, I ran into the other side of the same issue, i.e. that now the checks implemented in IntegrityCheckStep#handleTextUnit didn't run, because the TEXT_UNIT events were not produced yet.

So, eventually I decided to split the integrity checks into two pipeline steps - one for document-level checks that read the whole file content from its stream while it's still allowed (i.e. before RawDocumentToFilterEventsStep), other for text-unit-level cheks (after RawDocumentToFilterEventsStep, because it relies on the events emmited by it).

EDIT: IT WOOOOOORKS WOOHOOOOOOOO https://github.com/box/mojito/actions/runs/21718444824/job/62640989530?pr=1049

@wadimw wadimw requested a review from ehoogerbeets February 5, 2026 16:25
@wadimw wadimw force-pushed the large-drop-import-fix branch from b25a3d0 to e0890a4 Compare February 6, 2026 11:56
@wadimw wadimw changed the title Large drop import fix Fix unexpected document end when importing drops with large XLIFF files Feb 6, 2026
@wadimw wadimw marked this pull request as ready for review February 6, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

upstream-patched Experimental features ported from legacy branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants