Sentiment analysis, also known as opinion mining, is a natural language processing approach that identifies the emotional tone behind a body of text.
This project focuses on sentiment analysis of tweets related to Black Friday shopping events. The dataset is obtained from the Twitter API, stored in CSV format, and loaded into an Amazon S3 bucket via Amazon Kinesis Data Firehose. A machine learning pipeline is established to train a Logistic Regression model for supervised sentiment analysis. The model's accuracy is then calculated, and the prediction data is written to a personal S3 bucket for further analysis.
The dataset used is obtained from the Twitter API, containing tweets related to Black Friday shopping events. It is stored in CSV format, facilitating easy ingestion and storage through Amazon Kinesis Data Firehose.
The machine learning pipeline comprises the following steps:
-
Data Preprocessing:
- Removal of irrelevant information (URLs, special characters, emojis).
- Text normalization techniques, including tokenization, stopword removal, and stemming/lemmatization.
-
Feature Extraction:
- Transformation of cleaned text data into numerical features using techniques like bag-of-words or TF-IDF vectorization.
-
Model Training:
- Training of a Logistic Regression model using labeled data with sentiment labels (1 for positive,0 for negative).
-
Model Evaluation:
- Calculation of accuracy to assess the model's performance.
-
Visualization:
- Store the results of the sentiment analysis and predictions to Amazon S3, create tables using Athena.
- Visualize the data in Amazon QuickSight.
To run this project, you need:
- Access to Twitter API to obtain the dataset.
- An Amazon EC2 Instance for project deployment.
- An Amazon S3 bucket for storing the dataset and prediction results.
- Knowledge of machine learning techniques, particularly Logistic Regression.
- Python programming skills for implementing the machine learning pipeline.
To use this project:
- Obtain the Black Friday tweet dataset in CSV format using the Twitter API.
- Load the dataset into an Amazon S3 bucket using Amazon Kinesis Data Firehose.
- Preprocess the dataset by cleaning the text data and transforming it into numerical features.
- Train a Logistic Regression model using the preprocessed data.
- Evaluate the accuracy of the trained model using appropriate metrics..
- Write the prediction results to a personal S3 bucket for further analysis or visualization.
- Visualize the data in Athena using Amazon Quicksight
-- Use more machine learning models (random forests, neural networks) -- Draw a flowchart diagram showing various steps from data collection to plot generation. -- Build a reuseable pipleine -- Clean the data further in athena and create better visualizations proviidng valuable insights using quicksight.
This project is inspired by Weclouddata big data course, demonstrating sentiment analysis on tweets using Apache Spark on Databricks.