-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Description
Hi @percevalw ,
The default behaviour of the get_spans produce a loss of entities when writing documents to disk.
I suggest to add a deduplicate argument to converters with default value to False.
For example, here the get_spans function deduplicate values and therefore less entities than expected are written to disk.
edsnlp/edsnlp/data/converters.py
Line 612 in 879e340
| spans = get_spans(doc, self.span_getter) |
Additionally, this line is also dropping duplicate spans :
edsnlp/edsnlp/data/converters.py
Line 645 in 879e340
| for i, ent in enumerate(sorted(dict.fromkeys(spans))) |
I suggest to replace it by:
for i, ent in enumerate(sorted(spans))Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels