A small toolkit and analysis pipeline for collecting, cleaning, feature-building, and modeling product/brand sentiment (and sarcasm) from social sources. This repository contains scripts and notebooks used to clean raw data, extract features, run EDA, train models, and produce per-brand sentiment summaries.
- Data cleaning scripts for brand-specific raw exports
- Exploratory Data Analysis (EDA) scripts and outputs
- Feature engineering pipeline (TF-IDF, scaler, feature matrices)
- Models for sarcasm detection and sentiment classification (scikit-learn jobs)
- Scripts to run the end-to-end analysis and produce brand-level summaries
Top-level files
build_feature_matrix.py— build TF-IDF / feature matrix used for training and inferencetrain_sentiment_svm.py— train sentiment classifier (Linear SVC saved tomodels/)train_sarcasm_detector.py— train sarcasm detector (Linear SVC saved tomodels/)analyze_product_sentiment.py— produce product/brand sentiment summary CSVsprocess_ndjson_and_features.py— helper to process NDJSON exports and generate featuresutils_lexicon.py— small utility functions for lexicon-based featuresrequirements.txt— Python package dependenciesnb.ipynb— notebook for exploratory/interactive work
Directories
data cleaning/— brand-specific cleaning scripts (e.g.data_cleaning_chanel.py)data extraction/— raw data extraction/filtering scriptsEDA files/andeda_outputs_*— EDA scripts and generated outputs per brandfeatures/— generated feature artifacts (TF-IDF, scaler, X matrices)models/— trained model artifacts (joblib files)processed/— processed CSVs from NDJSON sources
Example data files included
chanel_matches.ndjson,gucci_matches.ndjson,hermes_hits.ndjson— raw ndjson exportsprocessed/*.processed.csv— processed outputs used for modeling/analysis
- Python 3.8+ (create a virtual environment recommended)
- pip
- Create and activate a virtual environment
# Windows (cmd.exe)
python -m venv .venv
.venv\\Scripts\\activate- Install dependencies
pip install -r requirements.txt- Clean raw brand NDJSON exports (scripts in
data cleaning/)
python "data cleaning/data_cleaning_gucci.py"
# or for Chanel/Hermes
python "data cleaning/data_cleaning_chanel.py"
python "data cleaning/data_cleaning_hermes.py"- Build feature matrix
python build_feature_matrix.pyThis produces artifacts under features/ such as tfidf.joblib, scaler.joblib, and X_all.npz.
- Train models
python train_sentiment_svm.py
python train_sarcasm_detector.pyTrained models are saved to models/ (e.g. sentiment_linsvc.joblib).
- Run analysis / produce summary outputs
python analyze_product_sentiment.pyResults and EDA outputs are saved in analysis/ and eda_outputs_* directories.
Open nb.ipynb for interactive exploration and visualization steps used during EDA.
features/meta.csvcontains metadata about generated featuresfeatures/X_all.npzis the feature matrix used for trainingmodels/contains model artifacts used for inferenceeda_outputs_*directories contain CSVs with EDA results (top words, sentiment counts, etc.)
- If you add new data sources or brands, include a brand-specific cleaning script in
data cleaning/and add any necessary preprocessing steps toprocess_ndjson_and_features.pyorbuild_feature_matrix.py. - Keep models in
models/and features infeatures/(do not commit large binary files if they are regenerated by CI or local runs).
This repository does not include an explicit license file. If you plan to share publicly, add a LICENSE file (e.g., MIT) or contact the repo owner for guidance.
If you need help running the pipeline or extending it to a new brand, open an issue or contact the repository owner listed in your project management system.