Some characters missing in spa.training_text makes Tesseract fail recognizing them#137
Open
diegodlh wants to merge 1 commit intotesseract-ocr:mainfrom
Open
Some characters missing in spa.training_text makes Tesseract fail recognizing them#137diegodlh wants to merge 1 commit intotesseract-ocr:mainfrom
diegodlh wants to merge 1 commit intotesseract-ocr:mainfrom
Conversation
…pital "É" and "«"
Contributor
|
Thank you. This training text file is suitable for tesseract 3.0x (base tesseract). For 4.0 and lstm training please see the langdata_lstm repo. |
Author
|
Effectively, I retried tesstrain.sh with langdata_lstm and the training_text file is so long that this time unicharset_extractor did not complain about missing characters. Still, as users may still be using langdata to train their tesseract 3.0x engine (or tesseract 4.0 with --oem 0, as I understand it), I deem it useful to merge my commit into plain langdata's master branch. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When running unicharset_extractor on the Spanish langdata, it warns that capital "Ñ", capital "É" and "«" are absent from the training text (while their counterparts, "ñ", "é" and "»", are present). This makes Tesseract then fail to recognize this characters with --oem 0 (for example, it recognizes "Ñ" as "NN", and "É" as "EI").
I'm a beginner in the subject of Tesseract training and I'm not sure how these training_text files are generated. It seems to me they are more or less a random set of words and short phrases. It occurred to me I could simply make some replacements to cover these missing characters: España -> ESPAÑA, años -> AÑOS, también -> TAMBIÉN, México -> MÉXICO, and also replaced half occurrences of "»" with "«".
If my assumption that this file is mostly random, please consider pulling this commit into master. Thank you