Skip to content

Dataset: 58.2 Mo of sentences from articles of Wikipedia (one sentence per line)

License

Notifications You must be signed in to change notification settings

tblondelle/wikipedia_sentences_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikipedia_sentences_dataset

58.2 Mo of sentences from articles of Wikipedia (one sentence per line)

Description

Filename # paragraphs # lines Weight (Mo)
002.txt 25,648 120,154 13.5
003.txt 25,344 121,575 12.6
004.txt 26,119 124,285 13.4
005.txt 7,992 164,319 18.7

Each file contains a number of paragraphs. These paragraphs contain several sentences, which are contiguous on Wikipedia. Each sentence is written on one line. The paragraphs are divided by the symbol +++$+++.

Note

I have tried to remove as many incoherences as I could find. However, there are still mistakes on it. In particular, these files contain non-latin characters (Japanese, Chinese, Arabic, etc.), which I have not removed.

About

Dataset: 58.2 Mo of sentences from articles of Wikipedia (one sentence per line)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published