Negapedia TFIDF Analyzer analyze Wikipedia's dumps and makes statistical analysis on reverts text.
The data produced in output can be used to clarify the theme of the contrast inside a Wikipedia page.
english, arabic, danish, dutch, finnish, french,
german, greek, hungarian, indonesian, italian,
kazakh, nepali, portuguese, romanian,
russian, spanish, swedish, turkish, armenian,
azerbaijani, basque, bengali, bulgarian, catalan,
chinese, croatian, czech, galician, hebrew, hindi,
irish, japanese, korean, latvian, lithuanian,
marathi, persian, polish, slovak, thai, ukrainian,
urdu, simple-english
This kind of data come from Negapedia/nltk
english, arabic, danish, dutch, finnish, french,
german, hungarian, italian, portuguese,
spanish, swedish, chinese, czech, hindi, japanese,
korean, persian, polish, thai, simple-english
This kind of data come from Negapedia/badwords
GlobalPagesTFIDF.json: contains for every page the list of words associated with their absolute frequency and tf-idf value;GlobalPagesTFIDF_topNwords.json: asGlobalPagesTFIDF.json, but are reported only the most important N words (in term of tf-idf value);GlobalWords.json: contains all the analyzed wiki's words associated with their absolute frequency;GlobalTopic.json: contains all the words in every topic (using Negapedia topics);BadWordsReport.json: contains for every page which has them, a list of badwords associated with their absolute frequency.
The minimum requirements which are needed for executing the project in reasonable times are:
- At least 4 cores-8 threads CPU;
- At least 16GB of RAM (required);
- At least 300GB of disk space.
However the recommended requirements are:
- 32GB of RAM or more (highly recommended).
docker build -t <image_name> .
from the root of repository directory.
docker run -d -v <path_on_fs_where_to_save_results>:<container_results_path> <image_name>
example:
docker run -d -v /path/2/out/dir:/data my_image
-lang: wiki language;-d: container result dir;-s: revert starting date to consider;-e: revert ending date to consider;-specialList: special page list to consider;-rev: number of revert to consider;-topPages: number of top words per page to consider;-topWords: number of top words of global words to consider;-topTopic: number of top words per topic to consider;-delete: if true, after compressing results directory will be deleted (default: true);-test: if true, logs are shown and is processed a single dump.
example:
docker run -v /path/2/out/dir:/data wikitfidf dothething -lang it
Go packages can be installed by:
go get github.com/negapedia/wikitfidf
and docker image can be downloaded by:
docker pull negapedia/wikitfidf