Skip to content

A structured template for data science and research projects that provides a clear separation between exploratory analysis and production pipelines. This template promotes reproducibility, organization, and collaborative development.

License

Notifications You must be signed in to change notification settings

VincenzoImp/research-project-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Research Project Template

Key Features

  • Modular Analysis Structure: Separate notebooks for individual analyses with dedicated data storage
  • Pipeline Framework: Orchestrated execution of data processing functions
  • Environment Management: Virtual environment setup with dependency tracking
  • Version Control Ready: Pre-configured .gitignore for data science projects
  • Scalable Architecture: Easy addition of new analyses and pipeline functions

Table of Contents

  1. Project Structure
  2. Prerequisites
  3. Quick Start
  4. How to Use the Project
  5. Adding New Content
  6. Examples
  7. Troubleshooting
  8. Contributing

Project Structure

project/
├── analysis/                    # Individual analysis notebooks
│   └── example_analysis.ipynb
├── data/
│   ├── analysis/               # Analysis-specific data
│   │   └── example_analysis/
│   ├── raw/                    # Initial, unprocessed data
│   ├── processed/              # Cleaned and processed data
│   └── [pipeline_outputs]/     # Additional folders created by pipeline
├── src/
│   ├── project_execution.ipynb # Main pipeline orchestration
│   └── project_functions.py    # Reusable pipeline functions
├── utils/                      # Shared utilities and helper functions
├── .gitignore                  # Excludes data/ and venv/ from version control
├── data_structure.txt          # Documentation of data folder structure
├── requirements.txt            # Project dependencies
└── README.md                   # This file

Description of Folders and Files

  • analysis/:

    • Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in data/analysis/ to store generated or utilized data.
  • data/:

    • analysis/: Analysis-specific data, organized by notebook name.
    • raw/: Initial, unprocessed data files.
    • processed/: Processed data derived from raw/ folder. This folder is typically generated by the first function in the project pipeline.
    • Additional folders may exist to store files generated by the project pipeline. These folders can be organized based on the type of output or the specific functions that create them.
  • src/:

    • project_execution.ipynb: Orchestrates and displays the project pipeline.
    • project_functions.py: Contains reusable functions used in the pipeline.
  • utils/:

    • Useful and reusable Python modules for analyses or other projects.
  • .gitignore:

    • Excludes the data/ and venv/ folders from version control.
  • data_structure.txt:

    • Documents the structure of the data/ folder (generated by tree -asD data > data_structure.txt).
  • requirements.txt:

    • Lists all dependencies required for the project.

Prerequisites

  • Python 3.8+
  • Git (for version control)
  • Jupyter Notebook/Lab (for running notebooks)
  • pip (Python package installer)

Quick Start

# Clone the repository
git clone <repository-url>
cd <project-name>

# Create and activate virtual environment
python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start Jupyter
jupyter lab

How to Use the Project

1. Environment Setup

Create a Virtual Environment:

python -m venv venv

Activate the Virtual Environment:

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

Install Dependencies:

pip install -r requirements.txt

2. Running the Project

Execute the Project Pipeline:

  • Open src/project_execution.ipynb in Jupyter Lab/Notebook
  • This notebook orchestrates the various functions and processes defined in the project
  • Execute cells sequentially to run the complete workflow

Run Individual Analyses:

  • Navigate to the analysis/ folder
  • Open any Jupyter notebook (e.g., example_analysis.ipynb)
  • These notebooks are independent and can be executed separately from the main pipeline

3. Data Organization

  • Place raw data files in data/raw/
  • Processed data will be automatically saved to data/processed/
  • Analysis-specific data is stored in data/analysis/[notebook_name]/

Adding New Content

Adding a New Analysis

  1. Create a New Jupyter Notebook:

    # Navigate to analysis folder
    cd analysis/
    # Create new notebook (or use Jupyter interface)
    touch new_analysis.ipynb
  2. Create a Corresponding Data Folder:

    mkdir data/analysis/new_analysis
  3. Implement Your Analysis:

    • Write your analysis code in the new notebook
    • Save output data files in data/analysis/new_analysis/
    • Import utilities from the utils/ folder as needed
  4. Document Your Analysis:

    • Include markdown cells explaining your methodology
    • Document key findings and conclusions
    • Add comments to complex code sections

Adding a Function to the Project Pipeline

  1. Test Your Function:

    • Develop and test your function in a separate Jupyter notebook first
    • Ensure it handles edge cases and errors appropriately
  2. Define Your Function in src/project_functions.py:

    def new_function(input_data, output_path, **kwargs):
        """
        Brief description of what the function does.
        
        Parameters:
        -----------
        input_data : str or pd.DataFrame
            Description of the input parameter
        output_path : str
            Path where output files will be saved
        **kwargs : dict
            Additional parameters for function customization
        
        Returns:
        --------
        bool or str
            Description of the return value (success status or output path)
        
        Raises:
        -------
        ValueError
            If input validation fails
        FileNotFoundError
            If required input files don't exist
        """
        try:
            # Function implementation here
            print(f"Processing {input_data}...")
            
            # Your logic here
            result = process_data(input_data)
            
            # Save results
            save_results(result, output_path)
            
            print(f"✅ Function completed successfully. Output saved to {output_path}")
            return output_path
            
        except Exception as e:
            print(f"❌ Error in new_function: {str(e)}")
            raise
  3. Integrate into the Pipeline:

    • Open src/project_execution.ipynb
    • Add your function to the execution list:
    pipeline_steps = [
        {
            'function': pu.existing_function,
            'execute': True,
            'args': ['data/raw/input.csv', 'data/processed/'],
            'description': 'Processes raw data'
        },
        {
            'function': pu.new_function,
            'execute': True,
            'args': ['data/processed/input.csv', 'data/outputs/', {'param1': 'value1'}],
            'description': 'Your new function description'
        }
    ]

Examples

Example: Adding a Data Cleaning Analysis

  1. Create the analysis notebook:

    # In analysis/data_cleaning.ipynb
    
    import pandas as pd
    import sys
    sys.path.append('../utils')
    from data_helpers import load_data, save_clean_data
    
    # Load data
    df = load_data('../data/raw/dataset.csv')
    
    # Perform cleaning
    df_clean = df.dropna().reset_index(drop=True)
    
    # Save results
    save_clean_data(df_clean, '../data/analysis/data_cleaning/cleaned_dataset.csv')
  2. Data folder structure:

    data/analysis/data_cleaning/
    ├── cleaned_dataset.csv
    ├── cleaning_report.html
    └── data_quality_plots.png
    

Example: Adding a Pipeline Function

# In src/project_functions.py

def feature_engineering(input_path, output_path, feature_config=None):
    """
    Creates engineered features from processed data.
    
    Parameters:
    -----------
    input_path : str
        Path to the processed data file
    output_path : str  
        Path to save engineered features
    feature_config : dict, optional
        Configuration for feature creation
    """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    # Load data
    df = pd.read_csv(input_path)
    
    # Create features
    df['feature_1'] = df['column_a'] * df['column_b']
    df['feature_2'] = df['column_c'].rolling(window=3).mean()
    
    # Scale features if requested
    if feature_config and feature_config.get('scale_features', False):
        scaler = StandardScaler()
        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    # Save results
    df.to_csv(output_path, index=False)
    print(f"✅ Feature engineering completed. Features saved to {output_path}")
    
    return output_path

Troubleshooting

Common Issues

Virtual Environment Issues:

# If activation fails, try:
python -m pip install virtualenv
python -m virtualenv venv

Jupyter Kernel Issues:

# Install ipykernel in your virtual environment
pip install ipykernel
python -m ipykernel install --user --name=venv

Import Errors:

  • Ensure your virtual environment is activated
  • Check that all dependencies are installed: pip install -r requirements.txt
  • Verify that utils/ modules are importable by adding sys.path.append('../utils') in notebooks

Data Path Issues:

  • Use relative paths from the notebook's location
  • Ensure data folders exist before running functions
  • Check file permissions for read/write access

Performance Tips

  • Use pandas.read_csv(chunksize=1000) for large datasets
  • Implement progress bars with tqdm for long-running processes
  • Use pickle or joblib to cache intermediate results

Contributing

Code Style

  • Follow PEP 8 for Python code formatting
  • Use descriptive variable and function names
  • Include docstrings for all functions
  • Add type hints where appropriate

Testing

  • Test new functions in isolation before adding to pipeline
  • Include error handling and input validation
  • Document expected input/output formats

Documentation

  • Update this README when adding major features
  • Include inline comments for complex logic
  • Create examples for new functionality

Pull Request Process

  1. Create a feature branch from main
  2. Test your changes thoroughly
  3. Update documentation as needed
  4. Submit a pull request with clear description

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or suggestions, please open an issue.

About

A structured template for data science and research projects that provides a clear separation between exploratory analysis and production pipelines. This template promotes reproducibility, organization, and collaborative development.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published