- Modular Analysis Structure: Separate notebooks for individual analyses with dedicated data storage
- Pipeline Framework: Orchestrated execution of data processing functions
- Environment Management: Virtual environment setup with dependency tracking
- Version Control Ready: Pre-configured .gitignore for data science projects
- Scalable Architecture: Easy addition of new analyses and pipeline functions
- Project Structure
- Prerequisites
- Quick Start
- How to Use the Project
- Adding New Content
- Examples
- Troubleshooting
- Contributing
project/
├── analysis/ # Individual analysis notebooks
│ └── example_analysis.ipynb
├── data/
│ ├── analysis/ # Analysis-specific data
│ │ └── example_analysis/
│ ├── raw/ # Initial, unprocessed data
│ ├── processed/ # Cleaned and processed data
│ └── [pipeline_outputs]/ # Additional folders created by pipeline
├── src/
│ ├── project_execution.ipynb # Main pipeline orchestration
│ └── project_functions.py # Reusable pipeline functions
├── utils/ # Shared utilities and helper functions
├── .gitignore # Excludes data/ and venv/ from version control
├── data_structure.txt # Documentation of data folder structure
├── requirements.txt # Project dependencies
└── README.md # This file
-
analysis/:
- Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in
data/analysis/to store generated or utilized data.
- Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in
-
data/:
- analysis/: Analysis-specific data, organized by notebook name.
- raw/: Initial, unprocessed data files.
- processed/: Processed data derived from
raw/folder. This folder is typically generated by the first function in the project pipeline. - Additional folders may exist to store files generated by the project pipeline. These folders can be organized based on the type of output or the specific functions that create them.
-
src/:
- project_execution.ipynb: Orchestrates and displays the project pipeline.
- project_functions.py: Contains reusable functions used in the pipeline.
-
utils/:
- Useful and reusable Python modules for analyses or other projects.
-
.gitignore:
- Excludes the
data/andvenv/folders from version control.
- Excludes the
-
data_structure.txt:
- Documents the structure of the
data/folder (generated bytree -asD data > data_structure.txt).
- Documents the structure of the
-
requirements.txt:
- Lists all dependencies required for the project.
- Python 3.8+
- Git (for version control)
- Jupyter Notebook/Lab (for running notebooks)
- pip (Python package installer)
# Clone the repository
git clone <repository-url>
cd <project-name>
# Create and activate virtual environment
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Start Jupyter
jupyter labCreate a Virtual Environment:
python -m venv venvActivate the Virtual Environment:
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activateInstall Dependencies:
pip install -r requirements.txtExecute the Project Pipeline:
- Open
src/project_execution.ipynbin Jupyter Lab/Notebook - This notebook orchestrates the various functions and processes defined in the project
- Execute cells sequentially to run the complete workflow
Run Individual Analyses:
- Navigate to the
analysis/folder - Open any Jupyter notebook (e.g.,
example_analysis.ipynb) - These notebooks are independent and can be executed separately from the main pipeline
- Place raw data files in
data/raw/ - Processed data will be automatically saved to
data/processed/ - Analysis-specific data is stored in
data/analysis/[notebook_name]/
-
Create a New Jupyter Notebook:
# Navigate to analysis folder cd analysis/ # Create new notebook (or use Jupyter interface) touch new_analysis.ipynb
-
Create a Corresponding Data Folder:
mkdir data/analysis/new_analysis
-
Implement Your Analysis:
- Write your analysis code in the new notebook
- Save output data files in
data/analysis/new_analysis/ - Import utilities from the
utils/folder as needed
-
Document Your Analysis:
- Include markdown cells explaining your methodology
- Document key findings and conclusions
- Add comments to complex code sections
-
Test Your Function:
- Develop and test your function in a separate Jupyter notebook first
- Ensure it handles edge cases and errors appropriately
-
Define Your Function in
src/project_functions.py:def new_function(input_data, output_path, **kwargs): """ Brief description of what the function does. Parameters: ----------- input_data : str or pd.DataFrame Description of the input parameter output_path : str Path where output files will be saved **kwargs : dict Additional parameters for function customization Returns: -------- bool or str Description of the return value (success status or output path) Raises: ------- ValueError If input validation fails FileNotFoundError If required input files don't exist """ try: # Function implementation here print(f"Processing {input_data}...") # Your logic here result = process_data(input_data) # Save results save_results(result, output_path) print(f"✅ Function completed successfully. Output saved to {output_path}") return output_path except Exception as e: print(f"❌ Error in new_function: {str(e)}") raise
-
Integrate into the Pipeline:
- Open
src/project_execution.ipynb - Add your function to the execution list:
pipeline_steps = [ { 'function': pu.existing_function, 'execute': True, 'args': ['data/raw/input.csv', 'data/processed/'], 'description': 'Processes raw data' }, { 'function': pu.new_function, 'execute': True, 'args': ['data/processed/input.csv', 'data/outputs/', {'param1': 'value1'}], 'description': 'Your new function description' } ]
- Open
-
Create the analysis notebook:
# In analysis/data_cleaning.ipynb import pandas as pd import sys sys.path.append('../utils') from data_helpers import load_data, save_clean_data # Load data df = load_data('../data/raw/dataset.csv') # Perform cleaning df_clean = df.dropna().reset_index(drop=True) # Save results save_clean_data(df_clean, '../data/analysis/data_cleaning/cleaned_dataset.csv')
-
Data folder structure:
data/analysis/data_cleaning/ ├── cleaned_dataset.csv ├── cleaning_report.html └── data_quality_plots.png
# In src/project_functions.py
def feature_engineering(input_path, output_path, feature_config=None):
"""
Creates engineered features from processed data.
Parameters:
-----------
input_path : str
Path to the processed data file
output_path : str
Path to save engineered features
feature_config : dict, optional
Configuration for feature creation
"""
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv(input_path)
# Create features
df['feature_1'] = df['column_a'] * df['column_b']
df['feature_2'] = df['column_c'].rolling(window=3).mean()
# Scale features if requested
if feature_config and feature_config.get('scale_features', False):
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# Save results
df.to_csv(output_path, index=False)
print(f"✅ Feature engineering completed. Features saved to {output_path}")
return output_pathVirtual Environment Issues:
# If activation fails, try:
python -m pip install virtualenv
python -m virtualenv venvJupyter Kernel Issues:
# Install ipykernel in your virtual environment
pip install ipykernel
python -m ipykernel install --user --name=venvImport Errors:
- Ensure your virtual environment is activated
- Check that all dependencies are installed:
pip install -r requirements.txt - Verify that
utils/modules are importable by addingsys.path.append('../utils')in notebooks
Data Path Issues:
- Use relative paths from the notebook's location
- Ensure data folders exist before running functions
- Check file permissions for read/write access
- Use
pandas.read_csv(chunksize=1000)for large datasets - Implement progress bars with
tqdmfor long-running processes - Use
pickleorjoblibto cache intermediate results
- Follow PEP 8 for Python code formatting
- Use descriptive variable and function names
- Include docstrings for all functions
- Add type hints where appropriate
- Test new functions in isolation before adding to pipeline
- Include error handling and input validation
- Document expected input/output formats
- Update this README when adding major features
- Include inline comments for complex logic
- Create examples for new functionality
- Create a feature branch from main
- Test your changes thoroughly
- Update documentation as needed
- Submit a pull request with clear description
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or suggestions, please open an issue.