Course material for "Traitement automatique de corpus" (STIC-B545) taught at ULB
Caution: Python 3.6 or higher required to handle f-strings (3.7 or 3.8 is better)
It is recommended to run this code in a virtual environment:
git clone git@github.com:madewild/tac.git
cd tac
pip install virtualenv
virtualenv venv --python=python3
source venv/bin/activate
which pipThen install Python dependencies with pip install -r requirements.txt
You can use either the scripts (*.py) or the Jypyter Notebooks (*.ipynb)
s1_sql.py: querying a simple relational database
s2_sparql.py: querying the Wikidata SPARQL endpoint
s3_api.py: playing with OpenStreetMap and EUcountries APIs
s4_scrape.py: scraping the AVB to retrieve 2833 PDF bulletins
s1_convert.sh: bash script to convert PDFs to TXTs, move them to dedicated folder and aggregate them in single big text file
s2_explore.py: playing with various categories (city, year, decade, type...)
s3_freq.py: basic frenquency analysis, hapaxes, long words...
s1_keyword.py: using YAKE to extract French keywords in each text file
s2_wordcloud.sh: generating a wordcloud for a given year (calling filtering.py in the background)
Install SpaCy from requirements then run this command to download French model: python -m spacy download fr_core_news_sm
s3_ner.py: perform NER with SpaCy FR model
s4_sentiment.py: analyse positive/negative sentences with textblob
classification.py: supervised classification of 20 newsgroups
clustering.py: unsupervised clustering with k-means
sentence_tokenizer.py: split big text into sentences
model_builder.py: train word2vec model on corpus
model_explorer.py: explore similarity between vectors
language_detection: language identification with langid
anonymization.py: de-identification of data with Faker
extraction.py: extract text from various file types
htr.sh: script for handwritten text recognition