Conversation
|
Hey! |
| def SentenceDelimiter(x_long): | ||
| seg = pysbd.Segmenter(clean=False) | ||
| xs = [a for a in seg.segment(x_long[0]) if len(a)>0] | ||
| return tuple(xs) |
There was a problem hiding this comment.
Hey!
For texts with sentences that exceed 256, we get this (in ner_mult_long_demo):
RuntimeError: input sequence after bert tokenization shouldn't exceed 256 tokens.
Would it be possible to add some code here that for each sentence in xs, it splits them into chunks of max 256 each? Using transformers.BertTokenizerFast or something like that.
I know it may cause some inaccuracy in some cases, if for example the sentence gets split in the middle of "Michael Jackson", but at least it won't cause a RuntimeError and fail the whole thing :)
There's just an important caveat - if a chunk is about to be split in the middle of a word, e.g. "un|believable", the word should remain intact, and move to the next chunk, rather than splitting the word.
Many thanks!
There was a problem hiding this comment.
Something like this seems to at least prevent most of the crashes:
from transformers import BertTokenizer
...
def SentenceDelimiter(x_long):
seg = pysbd.Segmenter(clean=False)
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
sentences = [a for a in seg.segment(x_long[0]) if len(a) > 0]
def split_long_sentence(sentence: str) -> list:
tokens = tokenizer.tokenize(sentence)
if len(tokens) <= 250:
return [sentence]
chunks = []
current_chunk = []
current_token_count = 0
for token in tokens:
if current_token_count + 1 > 250:
if current_chunk:
chunks.append(current_chunk)
current_chunk = [token]
current_token_count = 1
else:
current_chunk.append(token)
current_token_count += 1
if current_chunk:
chunks.append(current_chunk)
return [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]
processed_sentences = []
for sentence in sentences:
processed_sentences.extend(split_long_sentence(sentence))
return tuple(processed_sentences)
I couldn't get it to work with 256 (maybe I'm not using the correct pretrained model, or the right parameters, so many times it still exceeded 256 after all, so I put 250.
This doesn't keep the words intact, but at least prevents most of the crashes for those that would crash and give no results.
Ideally, "250" and "bert-base-multilingual-cased" should come as arguments in the SentenceDelimiter function, but I'm not sure how to pass those from the json config file.
Thanks!
Major Features and Improvements
Bug Fixes and Other Changes