Tweet Sentiment Analysis with SVM

Overview

This project implements a sentiment analysis model to classify tweets as positive or negative using a Support Vector Machine (SVM). The dataset consists of 10,000 tweets (5,000 positive, 5,000 negative), and the model achieves an impressive 98.8% accuracy on the test set.

The key innovation is a custom preprocessing pipeline that leverages emoticons (e.g., :) → "smile", :( → "frown", <3 → "heart") to capture strong sentiment signals, significantly boosting performance from an initial 74% to 98.8%.

This project was developed as part of a school assignment, with a focus on building a custom model without pre-trained embeddings or deep learning frameworks.

Features

Dataset: 10,000 balanced tweets (50% positive, 50% negative)
Model: Linear SVM with TF-IDF features and custom emoticon handling
Preprocessing: Custom pipeline including emoticon mapping, emoji processing, lemmatization, and stop word removal
Accuracy: 98.8% on a 30% test split (3,000 tweets), improved from 74% through iterative feature engineering
Key Innovation: Mapping emoticons (e.g., :) to "smile") as tokens, capturing critical sentiment cues

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository:

git clone https://github.com/Liburn-Krasniqi/Tweet-Sentiment-Analysis.git
cd Tweet-Sentiment-Analysis

Install required packages:

pip install numpy pandas scikit-learn nltk emoji matplotlib seaborn

Download NLTK data (run once):

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('twitter_samples')

Usage

Running the Notebook

Open Project_B_ICT_AI_v3.ipynb in Jupyter Notebook or Google Colab
Run all cells sequentially
The notebook will:
- Load and preprocess the dataset
- Train the SVM model
- Evaluate performance with confusion matrix and classification report

Key Components

Data Loading: Uses NLTK's twitter_samples corpus
Preprocessing: Custom getCleanedText() function handles:
- Lowercasing
- Username and URL removal
- Character normalization
- Emoticon mapping
- Emoji conversion to text
- Lemmatization
- Stop word removal
Vectorization: TF-IDF with bigrams (1-2 ngrams)
Model Training: LinearSVC with GridSearchCV for hyperparameter tuning

Methodology

Dataset

Size: 10,000 tweets (5,000 positive, 5,000 negative)
Split: 70% training (7,000 tweets), 30% testing (3,000 tweets)
Source: NLTK Twitter Samples

Preprocessing Pipeline

Text Normalization:
- Convert to lowercase
- Remove usernames (@username)
- Remove URLs
- Normalize repeated characters (e.g., "loooooove" → "love")
Emoticon & Emoji Handling:
- Map emoticons to words: :) → "smile", :( → "frown", <3 → "heart"
- Convert emojis to text tokens using emoji.demojize()
Tokenization:
- Use TweetTokenizer for informal language handling
- Remove stop words
- Apply lemmatization (preferred over stemming for better word preservation)
Feature Extraction:
- TF-IDF vectorization with bigrams
- Minimum document frequency: 2

Model Architecture

Algorithm: Linear Support Vector Classifier (LinearSVC)
Hyperparameter Tuning: GridSearchCV with C values [0.01, 0.1, 1, 10, 100]
Class Weighting: Balanced to handle class imbalance
Cross-Validation: 5-fold CV for hyperparameter selection

Results

Performance Metrics

Accuracy: 98.83%
Precision: 99% (macro average)
Recall: 99% (macro average)
F1-Score: 99% (macro average)

Model Comparison

Model	Accuracy	Notes
Multinomial Naive Bayes	74.17%	Baseline with basic preprocessing
SVM (CountVectorizer)	74.47%	Similar performance to Naive Bayes
SVM (Lemmatization)	74.83%	Slight improvement with lemmatization
SVM (Emoticon + TF-IDF)	98.83%	Final optimized model

Key Findings

The inclusion of emoticons and emojis in the preprocessing pipeline resulted in a ~24% accuracy improvement, demonstrating their critical role in sentiment analysis for social media text.

Project Structure

Tweet-Sentiment-Analysis/
├── Project_B_ICT_AI_v3.ipynb  # Main Jupyter notebook
└── README.md                   # This file

References

Author

Liburn Krasniqi

License

This project is developed for educational purposes as part of a school assignment.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project_B_ICT_AI_v3.ipynb		Project_B_ICT_AI_v3.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Sentiment Analysis with SVM

Overview

Features

Installation

Prerequisites

Setup

Usage

Running the Notebook

Key Components

Methodology

Dataset

Preprocessing Pipeline

Model Architecture

Results

Performance Metrics

Model Comparison

Key Findings

Project Structure

References

Author

License

About

Uh oh!

Releases

Packages

Languages

Liburn-Krasniqi/Tweet-Sentiment-Analysis

Folders and files

Latest commit

History

Repository files navigation

Tweet Sentiment Analysis with SVM

Overview

Features

Installation

Prerequisites

Setup

Usage

Running the Notebook

Key Components

Methodology

Dataset

Preprocessing Pipeline

Model Architecture

Results

Performance Metrics

Model Comparison

Key Findings

Project Structure

References

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages