This project implements a sentiment analysis model to classify tweets as positive or negative using a Support Vector Machine (SVM). The dataset consists of 10,000 tweets (5,000 positive, 5,000 negative), and the model achieves an impressive 98.8% accuracy on the test set.
The key innovation is a custom preprocessing pipeline that leverages emoticons (e.g., :) → "smile", :( → "frown", <3 → "heart") to capture strong sentiment signals, significantly boosting performance from an initial 74% to 98.8%.
This project was developed as part of a school assignment, with a focus on building a custom model without pre-trained embeddings or deep learning frameworks.
- Dataset: 10,000 balanced tweets (50% positive, 50% negative)
- Model: Linear SVM with TF-IDF features and custom emoticon handling
- Preprocessing: Custom pipeline including emoticon mapping, emoji processing, lemmatization, and stop word removal
- Accuracy: 98.8% on a 30% test split (3,000 tweets), improved from 74% through iterative feature engineering
- Key Innovation: Mapping emoticons (e.g.,
:)to "smile") as tokens, capturing critical sentiment cues
- Python 3.8 or higher
- pip package manager
- Clone the repository:
git clone https://github.com/Liburn-Krasniqi/Tweet-Sentiment-Analysis.git
cd Tweet-Sentiment-Analysis- Install required packages:
pip install numpy pandas scikit-learn nltk emoji matplotlib seaborn- Download NLTK data (run once):
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('twitter_samples')- Open
Project_B_ICT_AI_v3.ipynbin Jupyter Notebook or Google Colab - Run all cells sequentially
- The notebook will:
- Load and preprocess the dataset
- Train the SVM model
- Evaluate performance with confusion matrix and classification report
- Data Loading: Uses NLTK's
twitter_samplescorpus - Preprocessing: Custom
getCleanedText()function handles:- Lowercasing
- Username and URL removal
- Character normalization
- Emoticon mapping
- Emoji conversion to text
- Lemmatization
- Stop word removal
- Vectorization: TF-IDF with bigrams (1-2 ngrams)
- Model Training: LinearSVC with GridSearchCV for hyperparameter tuning
- Size: 10,000 tweets (5,000 positive, 5,000 negative)
- Split: 70% training (7,000 tweets), 30% testing (3,000 tweets)
- Source: NLTK Twitter Samples
-
Text Normalization:
- Convert to lowercase
- Remove usernames (
@username) - Remove URLs
- Normalize repeated characters (e.g., "loooooove" → "love")
-
Emoticon & Emoji Handling:
- Map emoticons to words:
:)→ "smile",:(→ "frown",<3→ "heart" - Convert emojis to text tokens using
emoji.demojize()
- Map emoticons to words:
-
Tokenization:
- Use
TweetTokenizerfor informal language handling - Remove stop words
- Apply lemmatization (preferred over stemming for better word preservation)
- Use
-
Feature Extraction:
- TF-IDF vectorization with bigrams
- Minimum document frequency: 2
- Algorithm: Linear Support Vector Classifier (LinearSVC)
- Hyperparameter Tuning: GridSearchCV with C values [0.01, 0.1, 1, 10, 100]
- Class Weighting: Balanced to handle class imbalance
- Cross-Validation: 5-fold CV for hyperparameter selection
- Accuracy: 98.83%
- Precision: 99% (macro average)
- Recall: 99% (macro average)
- F1-Score: 99% (macro average)
| Model | Accuracy | Notes |
|---|---|---|
| Multinomial Naive Bayes | 74.17% | Baseline with basic preprocessing |
| SVM (CountVectorizer) | 74.47% | Similar performance to Naive Bayes |
| SVM (Lemmatization) | 74.83% | Slight improvement with lemmatization |
| SVM (Emoticon + TF-IDF) | 98.83% | Final optimized model |
The inclusion of emoticons and emojis in the preprocessing pipeline resulted in a ~24% accuracy improvement, demonstrating their critical role in sentiment analysis for social media text.
Tweet-Sentiment-Analysis/
├── Project_B_ICT_AI_v3.ipynb # Main Jupyter notebook
└── README.md # This file
- Data Quality Explored - Sentiment Analysis
- Emojis Aid Social Media Sentiment Analysis
- Emoticons - Wikipedia
- Lemmatization with NLTK
Liburn Krasniqi
This project is developed for educational purposes as part of a school assignment.