Skip to content

rhubain/tac

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

175 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TAC

Course material for "Traitement automatique de corpus" (STIC-B545) taught at ULB

Caution: Python 3.6 or higher required to handle f-strings (3.7 or 3.8 is better)

It is recommended to run this code in a virtual environment:

git clone git@github.com:madewild/tac.git
cd tac
pip install virtualenv
virtualenv venv --python=python3
source venv/bin/activate
which pip

Then install Python dependencies with pip install -r requirements.txt

You can use either the scripts (*.py) or the Jypyter Notebooks (*.ipynb)

Module 1

s1_sql.py: querying a simple relational database

s2_sparql.py: querying the Wikidata SPARQL endpoint

s3_api.py: playing with OpenStreetMap and EUcountries APIs

s4_scrape.py: scraping the AVB to retrieve 2833 PDF bulletins

Module 2

s1_convert.sh: bash script to convert PDFs to TXTs, move them to dedicated folder and aggregate them in single big text file

s2_explore.py: playing with various categories (city, year, decade, type...)

s3_freq.py: basic frenquency analysis, hapaxes, long words...

Module 3

Keyword extraction

s1_keyword.py: using YAKE to extract French keywords in each text file

s2_wordcloud.sh: generating a wordcloud for a given year (calling filtering.py in the background)

Named-entity recognition

Install SpaCy from requirements then run this command to download French model: python -m spacy download fr_core_news_sm

s3_ner.py: perform NER with SpaCy FR model

Sentiment analysis

s4_sentiment.py: analyse positive/negative sentences with textblob

Module 4

classification.py: supervised classification of 20 newsgroups

clustering.py: unsupervised clustering with k-means

sentence_tokenizer.py: split big text into sentences

model_builder.py: train word2vec model on corpus

model_explorer.py: explore similarity between vectors

Module 5

language_detection: language identification with langid

anonymization.py: de-identification of data with Faker

Module 6

extraction.py: extract text from various file types

htr.sh: script for handwritten text recognition

About

Course material for "Traitement automatique de corpus"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 86.8%
  • Python 11.4%
  • Rich Text Format 1.3%
  • Shell 0.5%