Slow loading when the nlp object used is a StanfordNLPLanguage instance

This is probably a rare case occurring only when adding a spacymoji step to the pipeline of a StanfordNLPLanguage instance. However, what happens is that the spacymoji constructor uses the StanfordNLPLanguage's tokenizer to convert emoji to Docs for the PhraseMatcher. For standard Spacy Language instances this is pretty optimal, since the tokenizer does the minimal work necessary here. But the StanfordNLPLanguage's tokenizer executes the whole StanfordNLP pipeline at once, making this step very slow (>2min vs < 1s in the normal case on my laptop simply to create a spacymoji instance).

Not sure how to fix this elegantly. Making the code conditional on the class of the nlp object could be one option. Another might be to always simply load the default English tokenizer and use this to process the Emoji, instead of the passed nlp object's tokenizer.

I'm happy to create a PR if we agree on the best solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow loading when the nlp object used is a StanfordNLPLanguage instance #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow loading when the nlp object used is a StanfordNLPLanguage instance #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions