TAC

Course material for "Traitement automatique de corpus" (STIC-B545) taught at ULB

Caution: Python 3.6 or higher required to handle f-strings (3.7 or 3.8 is better)

It is recommended to run this code in a virtual environment:

git clone git@github.com:madewild/tac.git
cd tac
pip install virtualenv
virtualenv venv --python=python3
source venv/bin/activate
which pip

Then install Python dependencies with pip install -r requirements.txt

You can use either the scripts (*.py) or the Jypyter Notebooks (*.ipynb)

Module 1

s1_sql.py: querying a simple relational database

s2_sparql.py: querying the Wikidata SPARQL endpoint

s3_api.py: playing with OpenStreetMap and EUcountries APIs

s4_scrape.py: scraping the AVB to retrieve 2833 PDF bulletins

Module 2

s1_convert.sh: bash script to convert PDFs to TXTs, move them to dedicated folder and aggregate them in single big text file

s2_explore.py: playing with various categories (city, year, decade, type...)

s3_freq.py: basic frenquency analysis, hapaxes, long words...

Module 3

Keyword extraction

s1_keyword.py: using YAKE to extract French keywords in each text file

s2_wordcloud.sh: generating a wordcloud for a given year (calling filtering.py in the background)

Named-entity recognition

Install SpaCy from requirements then run this command to download French model: python -m spacy download fr_core_news_sm

s3_ner.py: perform NER with SpaCy FR model

Sentiment analysis

s4_sentiment.py: analyse positive/negative sentences with textblob

Module 4

classification.py: supervised classification of 20 newsgroups

clustering.py: unsupervised clustering with k-means

sentence_tokenizer.py: split big text into sentences

model_builder.py: train word2vec model on corpus

model_explorer.py: explore similarity between vectors

Module 5

language_detection: language identification with langid

anonymization.py: de-identification of data with Faker

Module 6

extraction.py: extract text from various file types

htr.sh: script for handwritten text recognition

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
module1		module1
module2		module2
module3		module3
module4		module4
module5		module5
module6		module6
.DS_Store		.DS_Store
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAC

Module 1

Module 2

Module 3

Keyword extraction

Named-entity recognition

Sentiment analysis

Module 4

Module 5

Module 6

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TAC

Module 1

Module 2

Module 3

Keyword extraction

Named-entity recognition

Sentiment analysis

Module 4

Module 5

Module 6

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages