Add some symbols to punctuation in strip_punctuation#9
Add some symbols to punctuation in strip_punctuation#9remusao wants to merge 1 commit intoJuliaText:masterfrom
Conversation
|
This is tricky. Unlike other punctuation, single quote marks often occur within tokens, so stripping them causes a lot of problems. We should see what other systems do. |
|
I agree. Why not letting the user choose? Or simply stripping ' and " at the beginning and end of the string instead of everywhere? It would preserve tokens containing this symbols? In my case I mainly liked to avoid tokens like |
|
Let's see what R's tm and Python's NLTK do, then make a decision. |
|
And is it possible to add "[" and "]" to exactly this regex? |
Hi,
Since
"and'are considered punctuation in English, I thought it would be a good idea to add this characters in the functionstrip_punctuation!in thepreprocessingmodule. I don't know if there is a reason for not including them in the regex, but I needed them in a project of mine, so here is a patch if you think it could be useful for others too.Bests,
Remusao