This repository contains the code and analysis for the Amazon Books Data Analysis Project conducted as part of the course X_400645: Project Big Data during the P6/2024 period of Vrije Universiteit Amsterdam. The analysis aims to explore and uncover trends in books, authors, and publishers, as well as the behaviors of Amazon Books users based on their age and location.
- Sinemis Toktaş
- Arda Cem Çakmak
- Isabelle de Beijer
- Haojia Lu
- Course Name: X_400645: Project Big Data
- Period/Year: P6/2024
- Instructor: Dr. Alessandro Zocca
The full report is available in the Amazon_Books_Report.pdf file, detailing the following sections:
- Introduction: Overview of Amazon Books and the objectives of our analysis.
- Dataset Description: Description of the datasets used (books.csv, ratings.csv, and users.csv) and their structure.
- Data Cleaning: Steps taken to clean the datasets, including handling missing values and filtering erroneous data.
- Exploratory Data Analysis (EDA): In-depth analysis of the data to answer specific research questions, including:
- Best & Worst Authors, Publishers, and Books for All and for Different Age Groups
- The Effect of Location and Age on User Ratings
- Gender Bias in Author Ratings
- Sentiment Analysis of Book Titles
- Prediction and Recommendation Models
- Conclusion: Summary of findings and implications of the analysis.
- References: Sources and references used in the report.
The code used for the analysis is provided in the Jupyter Notebook Amazon_Books_Analysis.ipynb. The notebook includes:
- Data loading and cleaning
- Exploratory data analysis and visualizations
- Statistical tests and models
The project presentation, reflecting the state of the project midway through, is available in the Amazon_Books_Presentation.pdf file. This presentation provides a snapshot of our interim findings, methodology, and progress at that point in time.
This project is licensed under the MIT License - see the LICENSE file for details.