In this project, I presented an experimental approach to discover abstract topics using text mining and perform sentiment classification using a deep learning-based framework on the IMDB Movie Review dataset. I used the LDA (Latent Dirichlet Allocation) topic modeling to classify movie reviews to a particular topic. During the experiment, the LDA model extracted 10 topics from the IMDB data and allocate the most relevant topic to each review based on their overall subject. In the next phase, I used the pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to perform binary classification on IMDB reviews based on their sentiment polarities. I split the entire dataset into 80:20 ratio for training and validation purposes. I used batch processing to reduce the computational complexity of the model during the training phase. Finally, the model achieved 91.83% training accuracy and 89.94% validation accuracy during sentiment classification.
Here the description of the database has been presented.
- review - IMDB Movie reviews.
- sentiment - Sentiment polarity of the reviews (e.g., Positive and Negative).
i. Preprocessing of IMDB Reviews
ii. Popular Token Identification
a. Wordcloud Visualization from Positive reviews
b. Wordcloud Visualization from Negative reviews
iii. n-gram Analysis
a. Bigram Analysis
b. Trigram Analysis
iv. LDA Topic Modeling
a. Data Preperation
b. Vectorization
c. Model Training
d. Topic Allocation
e. Visualization of Topic Popularity
v. Sentiment Classification
a. Prepare Final Data
b. Generate Encoded Training and Validation Data
- Train-Validation Split
c. Implement BERT Model
d. Train Model
vi. Sentiment Prediction
a. Sentiment Prediction on Validation Data
b. Model Performance Analysis
c. Save Model
d. Sentiment Prediction on User-End Reviews (Sample Data)
Please install the following packages to execute all the codes.
- pandas==1.3.5
- numpy==1.21.6
- tweet-preprocessor==0.5.0
- seaborn==0.11.2
- matplotlib==3.2.2
- networkx==2.6.3
- wordcloud==1.8.2.2
- nltk==3.7
- scikit-learn==1.0.2
- tqdm==4.64.1
- keras==2.9.0
- tensorflow==2.9.2
- transformers==4.18.0
The entire notebook was executed in Google Colaboratory having an activated GPU kernel. In order to activate GPU support we need to follow the following steps -
- Go to "Edit" menu and click on "Notebook settings" option.
- Select "GPU" from the "Hardware accelerator" dropdown menu.
- Click on
Savebutton to save the changes. - Restart the kernel and run the code.