A quantitative analysis of the OSMI 2016 Survey, utilizing unsupervised machine learning to identify mental health profile clusters of employees and propose targeted HR interventions to implement the appropriate management strategies by analyzing survey data from over 1,400 technology professionals.
Problem Context: The intersection of corporate technology environments and mental health has become a focal point for organizational development. This case study moves beyond reactive crisis management to a pre-emptive mitigation program supported by quantitative analysis.
Proposed Solution: To categorize employees into distinct personas based on their mental health status, workplace sentiment, and demographic traits, enabling Human Resources to tailor support programs rather than applying a generic approach.
Link to the full Case Study report
The analysis identified three distinct clusters (k=3), validated via Stability Analysis (ARI = 0.90) and Hierarchical Clustering. The results suggest that mental health is a spectrum, and management practices, as opposed to medical severity itself, dictate the employee experience.
| Cluster | Persona | Profile Summary | Intervention |
|---|---|---|---|
| 0 | Healthy Baseline | Low Risk / Disengaged. (~36% of workforce). Predominantly without disorders, but scores highest for unawareness of benefits. | Preventative education: Shift from crisis response to proactive benefit education during onboarding. |
| 1 | Supported Risk | High Risk / Positive Experience. (~32%). High prevalence of disorders, yet reports positive workplace support and mental health management. | Reinforce safety: Maintain trust via strict anonymity protocols to prevent regression. |
| 2 | Unsupported Risk | High Risk / Negative Experience. (~32%). High prevalence of disordes, and reports high structural barriers and poor workplace support. | Mitigate structural barriers: Audit administrative workflows to ensure seamless access to support and resources. |
This project utilizes a pipeline designed for high-dimensional mixed data:
- Data Preprocessing:
- Handling structural missingness (Skip-Logic) via explicit categorization (encoded as
-1). - Exclusion of self-employed respondents to focus on corporate HR strategy.
- Handling structural missingness (Skip-Logic) via explicit categorization (encoded as
- Dimensionality Reduction:
- FAMD (Factor Analysis of Mixed Data): Used to diagnose variance domination caused by employment history features.
- t-SNE (Gower Distance): Used to visualize the local manifold, confirming the continuous spectrum of mental health.
- Clustering:
- K-Prototypes: The primary algorithm, extending K-Means to handle both numerical and categorical features.
- Hierarchical Clustering: Used for structural validation of the
k=3solution.
Tech Stack: Python 3.11, pandas, scikit-learn, kmodes, prince (FAMD), gower, seaborn, missingno.
├── data/
│ ├── raw/ # Original OSMI 2016 dataset (Not included in repo, see Setup)
│ └── processed/ # Cleaned and encoded dataframes and cluster profiles (Run mental-health-tech.ipynb)
├── notebooks/
│ ├── mental-health-tech.ipynb # Primary analysis (ETL, Preprocessing, and Clustering pipeline)
│ └── figures.ipynb # Generation of interpretable visualizations for the report
├── reports/
│ └── figures/ # Exported PNGs used in the report (Run figures.ipynb)
├── environment.yml # Conda environment configuration
├── README.md
└── .gitignore
This project requires Conda (Anaconda) to manage mixed dependencies (pip + conda).
git clone [https://github.com/mardelpozo/mental-health-tech.git](https://github.com/mardelpozo/mental-health-tech.git)
cd mental-health-techconda env create -f environment.yml
conda activate mental-health-techDue to licensing and file size, the raw dataset is not hosted directly in this repository.
- Navigate to the OSMI Mental Health in Tech Survey 2016 on Kaggle.
- Download the file
mental-heath-in-tech-2016_20161114.csv. - Place the downloaded file in the
data/raw/directory.
Note: The notebooks are configured to look for the data in this specific relative path.
jupyter lab- Pipeline (
notebooks/mental-health-tech.ipynb): Contains the complete end-to-end workflow, including data cleaning, feature engineering, clustering, and validation metrics. - Visualizations (
notebooks/figures.ipynb): Generates the interpretable visualizations (FAMD projections, t-SNE plots, radar Charts) used in the final report.
For a complete list of references, please consult the Bibliography section of the corresponding case study report.
This project was conducted as part of the Machine Learning: Unsupervised Learning and Feature Engineering course at IU International University of Applied Sciences.
Author: Mariana Del Pozo Patrón