This project aims to classify product reviews as positive or negative using machine learning algorithms. The classification is based on the sentiment expressed in the review text. We will employ various text vectorization techniques and machine learning models to achieve this classification.
The project follows these main steps:
-
Data Collection: Gather product reviews from online shopping websites. Each review consists of text and a corresponding rating indicating user sentiment.
-
Data Preprocessing:
- Remove HTML tags and punctuation from the review text.
- Filter out unnecessary elements and clean the text data.
- Ensure the correctness of the helpfulness ratio (numerator should be less than or equal to the denominator).
- Deduplicate the data based on user ID, profile name, time, and text.
-
Text Vectorization:
- Use Bag of Words (BOW) and TF-IDF to convert text data into numerical vectors.
- Utilize Word2Vec and TF-IDF Weighted Word2Vec for embedding-based vectorization.
-
Classification:
- Apply the K-nearest Neighbors (KNN) algorithm for classification.
- Evaluate the performance of the classification model using accuracy, precision, recall, and F1-score metrics.
-
Model Evaluation:
- Analyze the confusion matrix to understand the model's performance.
- Compute precision, recall, and F1-score for both positive and negative classes.
-
Results Analysis:
- Interpret the model's accuracy and performance metrics.
- Discuss strengths and weaknesses of the classification model.
- Install Python (version 3.x).
- Install the required libraries by running the following command in the terminal or command prompt:
pip install -r requirements.txt
- Run the
Amazon FineFood Sentiment KNN.ipynbJupyter notebook to see the complete workflow. - Customize the code for your specific use case or dataset.
- Experiment with different text vectorization techniques and machine learning algorithms.
Amazon FineFood Sentiment KNN.ipynb: Jupyter notebook containing source code and detailed explanations.data/: Directory containing input data.README.md: Project overview and usage guide documentation.
- Programming Language: Python
- Main Libraries: pandas, numpy, scikit-learn, nltk, gensim, matplotlib, seaborn
- BewxSevez
This project is released under the MIT License.