Skip to content

Refactor workaround for Okapi in-memory document handling#1052

Draft
wadimw wants to merge 2 commits intoupstream-patchedfrom
okapi-document-uri-refactor
Draft

Refactor workaround for Okapi in-memory document handling#1052
wadimw wants to merge 2 commits intoupstream-patchedfrom
okapi-document-uri-refactor

Conversation

@wadimw
Copy link

@wadimw wadimw commented Feb 6, 2026

Split off from #1049

This PR removes the URI injection using reflection on RawDocument which is potentially fragile. Instead, it replaces it with an override on QualityCheckStep START_DOCUMENT event handler that injects the expected value directly into the StartDocument resource through its public method StartDocument#setName.

Additionally, it slightly refactors the empty RawDocument approach to make it more explicit that the content of this document is not relevant - it's only used to "drive" the Okapi pipeline and provide Locale configuration.


Rationale for the URI refactor:

Force setting URI on a RawDocument backed by InputStream or a CharSequence breaks its contract as specified here:

The RawDocument object has one (and only one) of three input objects: a CharSequence, a URI, or an InputStream.

Even the private runtime of RawDocument itself depends on whether the URI field is set (e.g. here), meaning this workaround might introduce unexpected behaviour and is potentially fragile.

After removal of the URI injection from Mojito code, the only thing that breaks seems to be QualityCheckStep - specifically, it throws a NullPointerException on the following line:

@Override
public void processStartDocument (StartDocument sd,
	List<String> sigList)
{
	currentDocId = (new File(sd.getName())).toURI();
// ...

This can be avoided by setting an override on our existing QualityCheckStep subclass which will set this name if it's null. This way, the workaround is localized only to the class that actually needs it, rather than affecting all steps of Okapi pipelines that would rely on RawDocument.

Forced URI injection into RawDocument through reflection breaks its contract (stating that RawDocument can backed by one of URI, CharSequence or Stream at a time) and impacts the way it handles content at runtime.

Since this workaround is currently only needed for Okapi QualityCheckStep, this commit minimizes its impact. Required property (StartDocument#name) is injected within the existing Mojito's QualityCheckStep subclass, so that only the StartDocument Event is affected. Additionally, this approach does not require Reflection, because the StartDocument resource provides a public method to set its name.
@wadimw wadimw added the upstream-patched Experimental features ported from legacy branch label Feb 6, 2026
@wadimw wadimw changed the title Okapi document uri refactor Refactor workaround for Okapi in-memory document handling Feb 6, 2026
@wadimw
Copy link
Author

wadimw commented Feb 6, 2026

TODO should probably also get rid of

in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

upstream-patched Experimental features ported from legacy branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant