My work on the Kaggle Predict Future Sales competition. This is a way for me to explore and share the following concepts:
- Application structure:
- Environment
- Configuration
- Functionality
- Parameters
- Data
- Technical Design with Draw.io
- Unit Tests with PyTest
- Docker (TBC)
- Parallel processing with Dask
- Time Series with Facebook Prophet
- Downcasting
- Working in the GCP Cloud Environment
- Using GCP services including Compute Engine, BigQuery, Cloud Storage (GCS)
- Modelling with MLFlow (TBC)
- ML Visualisation with Yellowbrick (TBC)
- Outlier Detection with PyOD
Technical examples of some of the above concepts are available via the Jupyter Notebooks saved in the ./ml_app/examples directory.
Prior to execution, you will need to:
- Install Anaconda
- Clone this repo
git clone https://github.com/Tommo565/kaggle-predict-sales. - Create a GCP project, a GCS bucket and subfolders in the bucket as per the
[config_example.py](./config/config_example.py)file. Ensure this is saved asconfig.py - Create a GCP credentials token with access to GCS, BigQuery & Compute Engine and save this into the
ml_app/configdirectory asgcp_token.json - Download and save the Kaggle Predict Sales Datasets into the appropriate GCP buckets according to the
config.pyfile. - Run the steps in Execution below.
- Add your environment kernel to Jupyter This works for both your local notebook and a cloud based one.
- Optional: Set up a GCP Compute Engine instance with as many cores as you can afford that can run a secure Jupyter notebook. Also add some extra storage. Some instructions & further reading:
cd ml app
conda env create -f environment/environment.yml
conda activate kaggle-predict-sales
python app/main.py runcd ml_app
python -m pytest -vTODO: What this does at a high level.
├── README.md
├── analysis
├── data
├── img
└── ml_app
├── __init__.py
├── analysis
├── app
│ ├── __init__.py
│ ├── feature_engineering
│ ├── import_merge
│ ├── models
│ └── utils
├── config
│ └── __init__.py
├── environment
├── main.py
├── parameters
│ └── __init__.py
└── test
Details of the various processing modules... TBC
- Kaggle Predict Future Sales
- Kaggle Predict Sales Datasets
- Anaconda
- Guide to Software Testing
- The 12 Factor app
- Pytest
- Dask Site
- Dask Tutorial
- Dask API Reference
- Downcasting in Pandas
- Facebook Prophet Quickstart
- MLFlow
- Yellowbrick
- Awesome Machine Learning
- Awesome Production Machine Learning
- PyOD
- PyOD Tutorial
