XB_0085 Text Mining for AI Project

📌 Overview

This repository showcases the final coursework for XB_0085 Text Mining for AI, completed at Vrije University in Spring 2024.

The project explores three core NLP tasks — Sentiment Analysis, Topic Analysis, and Named Entity Recognition (NER) — by applying both classical machine learning and modern transformer-based models.

👥 Contributors (Group 43)

Arda Cem Çakmak
Sinemis Toktaş
Emre Akça
Berk Yavaş

🗂️ Project Components

1️⃣ Sentiment Analysis

Goal:
Classify text data into bipolar (positive/negative) or tripolar (positive/neutral/negative) sentiments.

Methods Used:

Multinomial Naive Bayes Classifier (MNB): Using CountVectorizer and TfidfTransformer from scikit-learn.
VADER: Initially tested but omitted from final metrics due to lower comparative performance.

Experiments:

5 main experiments:
- 2 bipolar (subjective/objective & positive/negative)
- 3 tripolar (positive/neutral/negative)
min_df tuned from 1–20 for feature selection.
Best test accuracy: 0.7 on 30,000 balanced tweets (10,000 per sentiment).
Word clouds visualized frequent tokens for each sentiment.

Key Observations:

High precision for negative and neutral predictions.
False positives mainly occurred due to overlapping tokens across sentiment classes.
Limited neutral-labeled data restricted model performance.
Data balance and larger training sets are crucial for higher accuracy.

Datasets:

2️⃣ Topic Analysis

Goal:
Categorize text data by topics — Books, Movies, Sports.

Methods Used:

Multinomial Naive Bayes (MNB): Probabilistic, assumes feature independence.
Support Vector Machine (SVM): Margin-based classifier, robust for high-dimensional data.

Datasets:

Results:

MNB: High accuracy for Movies (~99%), limited by lower Book data.
SVM: Outperformed MNB overall — up to 100% on test data with slight f1-score drop for Books.
Richer sports datasets improved topic detection for SVM more than MNB.

Challenges:

Underrepresented Book data hindered overall balance.
Future improvements: expand Book-related data, adjust class weights, enhance data quality.

3️⃣ Named Entity Recognition (NER)

Goal:
Extract entities (e.g., PERSON, LOCATION, DATE, WORK_OF_ART) from text.

Methods Used:

BERT: Pre-trained on CoNLL-2003 dataset; 4 entity types.
Flair: Pre-trained on OntoNotes 5.0; 18 entity types.

Results:

BERT: Accuracy ~0.87 — better performance but limited label set (mislabels extra categories as MISC).
Flair: Broader entity coverage — accuracy ~0.78 — slightly lower due to tokenization edge cases and missing IOB tagging in raw output.

Key Insights:

BERT's bidirectional attention mechanism improved results despite label gaps.
Flair's pre-trained extended label set gave it an edge on uncommon entities (DATE, WORK_OF_ART).
Better token handling and additional fine-tuning could close performance gaps.

Pre-trained Models:

Flair NER Model

🧩 Work Distribution

Arda Cem Çakmak: Sentiment Analysis (Naive Bayes, VADER), analysis, poster design.
Sinemis Toktaş: NERC (BERT, Flair), analysis, poster design.
Emre Akça: Topic Analysis (Multinomial Naive Bayes), analysis, poster design.
Berk Yavaş: Topic Analysis (SVM), analysis, poster design.

📌 Poster

📄 Click to view our project poster

🔗 Relevant Links

📚 Acknowledgements

This project was conducted as part of XB_0085 Text Mining for AI at Vrije University. We thank our instructor and TAs for their guidance and feedback throughout the course.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
NERC		NERC
bipolar-sentiment-train-v1		bipolar-sentiment-train-v1
bipolar-sentiment-train-v2		bipolar-sentiment-train-v2
topic-analysis-two-approaches		topic-analysis-two-approaches
tripolar-sentiment-train-v1		tripolar-sentiment-train-v1
tripolar-sentiment-train-v2		tripolar-sentiment-train-v2
tripolar-sentiment-train-v3		tripolar-sentiment-train-v3
.DS_Store		.DS_Store
README.md		README.md
poster.pdf		poster.pdf
project-description.pdf		project-description.pdf
sentiment-test.ipynb		sentiment-test.ipynb
sentiment-topic-test.tsv		sentiment-topic-test.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XB_0085 Text Mining for AI Project

📌 Overview

👥 Contributors (Group 43)

🗂️ Project Components

1️⃣ Sentiment Analysis

2️⃣ Topic Analysis

3️⃣ Named Entity Recognition (NER)

🧩 Work Distribution

📌 Poster

🔗 Relevant Links

📚 Acknowledgements

About

Uh oh!

Languages

sinemistoktas/text-mining-project

Folders and files

Latest commit

History

Repository files navigation

XB_0085 Text Mining for AI Project

📌 Overview

👥 Contributors (Group 43)

🗂️ Project Components

1️⃣ Sentiment Analysis

2️⃣ Topic Analysis

3️⃣ Named Entity Recognition (NER)

🧩 Work Distribution

📌 Poster

🔗 Relevant Links

📚 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages