Skip to content

Tokens not properly split when using emoji with modifiers #10

@timonmohaupt

Description

@timonmohaupt
import spacy
from spacymoji import Emoji

def test():
    nlp = spacy.load('en_core_web_sm')
    emoji = Emoji(nlp, merge_spans=True)
    nlp.add_pipe(emoji, first=True)
    doc = nlp(
        'Word!👍🏿')
    for token in doc:
        print (token)
    doc = nlp(
        'Word! 👍🏿')
    for token in doc:
        print(token)
    doc = nlp(
        'Word!👍')
    for token in doc:
        print(token)
    return doc

Shows the problem. "Word!" is not correctly split into "Word" and "!", when the thumbs up has a dark skin tone modifier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions