Skip to content

Liburn-Krasniqi/Tweet-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Tweet Sentiment Analysis with SVM

Python scikit-learn NLTK Open in Colab

Overview

This project implements a sentiment analysis model to classify tweets as positive or negative using a Support Vector Machine (SVM). The dataset consists of 10,000 tweets (5,000 positive, 5,000 negative), and the model achieves an impressive 98.8% accuracy on the test set.

The key innovation is a custom preprocessing pipeline that leverages emoticons (e.g., :) → "smile", :( → "frown", <3 → "heart") to capture strong sentiment signals, significantly boosting performance from an initial 74% to 98.8%.

This project was developed as part of a school assignment, with a focus on building a custom model without pre-trained embeddings or deep learning frameworks.

Features

  • Dataset: 10,000 balanced tweets (50% positive, 50% negative)
  • Model: Linear SVM with TF-IDF features and custom emoticon handling
  • Preprocessing: Custom pipeline including emoticon mapping, emoji processing, lemmatization, and stop word removal
  • Accuracy: 98.8% on a 30% test split (3,000 tweets), improved from 74% through iterative feature engineering
  • Key Innovation: Mapping emoticons (e.g., :) to "smile") as tokens, capturing critical sentiment cues

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

  1. Clone the repository:
git clone https://github.com/Liburn-Krasniqi/Tweet-Sentiment-Analysis.git
cd Tweet-Sentiment-Analysis
  1. Install required packages:
pip install numpy pandas scikit-learn nltk emoji matplotlib seaborn
  1. Download NLTK data (run once):
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('twitter_samples')

Usage

Running the Notebook

  1. Open Project_B_ICT_AI_v3.ipynb in Jupyter Notebook or Google Colab
  2. Run all cells sequentially
  3. The notebook will:
    • Load and preprocess the dataset
    • Train the SVM model
    • Evaluate performance with confusion matrix and classification report

Key Components

  • Data Loading: Uses NLTK's twitter_samples corpus
  • Preprocessing: Custom getCleanedText() function handles:
    • Lowercasing
    • Username and URL removal
    • Character normalization
    • Emoticon mapping
    • Emoji conversion to text
    • Lemmatization
    • Stop word removal
  • Vectorization: TF-IDF with bigrams (1-2 ngrams)
  • Model Training: LinearSVC with GridSearchCV for hyperparameter tuning

Methodology

Dataset

  • Size: 10,000 tweets (5,000 positive, 5,000 negative)
  • Split: 70% training (7,000 tweets), 30% testing (3,000 tweets)
  • Source: NLTK Twitter Samples

Preprocessing Pipeline

  1. Text Normalization:

    • Convert to lowercase
    • Remove usernames (@username)
    • Remove URLs
    • Normalize repeated characters (e.g., "loooooove" → "love")
  2. Emoticon & Emoji Handling:

    • Map emoticons to words: :) → "smile", :( → "frown", <3 → "heart"
    • Convert emojis to text tokens using emoji.demojize()
  3. Tokenization:

    • Use TweetTokenizer for informal language handling
    • Remove stop words
    • Apply lemmatization (preferred over stemming for better word preservation)
  4. Feature Extraction:

    • TF-IDF vectorization with bigrams
    • Minimum document frequency: 2

Model Architecture

  • Algorithm: Linear Support Vector Classifier (LinearSVC)
  • Hyperparameter Tuning: GridSearchCV with C values [0.01, 0.1, 1, 10, 100]
  • Class Weighting: Balanced to handle class imbalance
  • Cross-Validation: 5-fold CV for hyperparameter selection

Results

Performance Metrics

  • Accuracy: 98.83%
  • Precision: 99% (macro average)
  • Recall: 99% (macro average)
  • F1-Score: 99% (macro average)

Model Comparison

Model Accuracy Notes
Multinomial Naive Bayes 74.17% Baseline with basic preprocessing
SVM (CountVectorizer) 74.47% Similar performance to Naive Bayes
SVM (Lemmatization) 74.83% Slight improvement with lemmatization
SVM (Emoticon + TF-IDF) 98.83% Final optimized model

Key Findings

The inclusion of emoticons and emojis in the preprocessing pipeline resulted in a ~24% accuracy improvement, demonstrating their critical role in sentiment analysis for social media text.

Project Structure

Tweet-Sentiment-Analysis/
├── Project_B_ICT_AI_v3.ipynb  # Main Jupyter notebook
└── README.md                   # This file

References

Author

Liburn Krasniqi

License

This project is developed for educational purposes as part of a school assignment.

About

Sentiment Analysis model trained on 10K tweets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published