58.2 Mo of sentences from articles of Wikipedia (one sentence per line)
| Filename | # paragraphs | # lines | Weight (Mo) |
|---|---|---|---|
| 002.txt | 25,648 | 120,154 | 13.5 |
| 003.txt | 25,344 | 121,575 | 12.6 |
| 004.txt | 26,119 | 124,285 | 13.4 |
| 005.txt | 7,992 | 164,319 | 18.7 |
Each file contains a number of paragraphs. These paragraphs contain several sentences, which are contiguous on Wikipedia. Each sentence is written on one line. The paragraphs are divided by the symbol +++$+++.
I have tried to remove as many incoherences as I could find. However, there are still mistakes on it. In particular, these files contain non-latin characters (Japanese, Chinese, Arabic, etc.), which I have not removed.