Skip to content

[Bug]: DocumentBlock incorrectly coerces empty string fields to None #20462

@Alioth99

Description

@Alioth99

Bug Description

Similar to the behavior previously fixed in #19302, the DocumentBlock class incorrectly treats empty strings ("") as falsy values for its optional fields (such as document_mimetype, url, and title), converting them to None during initialization or validation.

This inconsistent behavior violates the principle of data integrity—if a user explicitly provides an empty string, the library should preserve it rather than defaulting it to None. This is particularly important for serialization and downstream validation where a str type is expected.

Version

llama-index-core==0.14.12

Steps to Reproduce

from llama_index.core.llms import DocumentBlock

doc = DocumentBlock(
    data=b"",
    url="",
    title="",
    document_mimetype=""
)

doc.document_validation()

print(f"URL: {doc.url!r}") 
print(f"Mimetype: {doc.document_mimetype!r}")

assert doc.url == "", f"Expected empty string, but got {doc.url!r}"
assert doc.document_mimetype == "", f"Expected empty string, but got {doc.document_mimetype!r}"

Relevant Logs/Tracbacks

URL: ''
Mimetype: None
Traceback (most recent call last):
  line 16, in <module>
    assert doc.document_mimetype == "", f"Expected empty string, but got {doc.document_mimetype!r}"
AssertionError: Expected empty string, but got None

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriageIssue needs to be triaged/prioritized

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions