Skip to content

BioCJSONEncoder does not match to_dict methods in dataclasses #272

@AdrianDAlessandro

Description

@AdrianDAlessandro

Describe the bug

Using the to_dict method of some of the ac_bioc dataclasses gets a different result to using the BioCJSON class (which is determined by the BioCJSONEncoder)

To Reproduce

For example, serialising the BioCPassage dataclass returns a very different result:

>>> from autocorpus.ac_bioc import BioCJSON, BioCPassage
>>> p = BioCPassage()
>>> p
BioCPassage(text='', offset=0, infons={}, sentences=[], annotations=[], relations=[])
>>> p.to_dict() # Does not include "annotations" or "relations"
{'text': '', 'offset': 0, 'infons': {}, 'sentences': []}
>>> print(BioCJSON.dumps(p)) # Does not include "sentences"
{"offset": 0, "infons": {}, "text": "", "annotations": [], "relations": []}

Expected behavior

These two approaches to get a dictionary should yield the same result.

Suggested solution

I suggest changing the default method of the BioCJSONEncoder to just use the to_dict methods and adjust the to_dict method to match the desired behaviour.

If you just want to include every field automatically and not need to update the to_dict method ever, I suggest using the asdict function from the datalcasses module. This will recursively unpack everything. For example:

>>> from autocorpus.ac_bioc import BioCPassage, BioCSentence
>>> p = BioCPassage(sentences=[BioCSentence("hello", 2)])
>>> p
BioCPassage(text='', offset=0, infons={}, sentences=[BioCSentence(text='hello', offset=2, infons={}, annotations=[], relations=[])], annotations=[], relations=[])
>>> p.to_dict() # Missing fields from both dataclasses
{'text': '', 'offset': 0, 'infons': {}, 'sentences': [{'text': 'hello', 'offset': 2, 'infons': {}, 'annotations': []}]}
>>> from dataclasses import asdict
>>> asdict(p) # All fields present and converts the nested "sentences" to a dict
{'text': '', 'offset': 0, 'infons': {}, 'sentences': [{'text': 'hello', 'offset': 2, 'infons': {}, 'annotations': [], 'relations': []}], 'annotations': [], 'relations': []}

To include this in a dataclass is as simple as:

from dataclasses import dataclass, asdict

@dataclass
class MyClass():
    field1: int

    to_dict = asdict

Context

Please, complete the following to better understand the system you are using to run Auto-CORPus.

  • Operating system (eg. Windows 10): MacOS 14.7.6
  • Auto-CORPus version (eg. 1.0.0): Current main branch
  • Installation method (eg. pipx, pip, development mode): dev mode with poetry
  • Python version (you can get this running python --version): 3.13.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions