Skip to content

Silent deduplication of entities #459

@aricohen93

Description

@aricohen93

Hi @percevalw ,
The default behaviour of the get_spans produce a loss of entities when writing documents to disk.
I suggest to add a deduplicate argument to converters with default value to False.

For example, here the get_spans function deduplicate values and therefore less entities than expected are written to disk.

spans = get_spans(doc, self.span_getter)

Additionally, this line is also dropping duplicate spans :

for i, ent in enumerate(sorted(dict.fromkeys(spans)))

I suggest to replace it by:

for i, ent in enumerate(sorted(spans))

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions