I've reproduced and extended Nature Medicine (2025) research to identify 4 clinically distinct PCOS (Polycystic Ovary Syndrome) subtypes using unsupervised learning on multi-dimensional clinical data.
This project implements and compares 5 clustering algorithms to identify PCOS subtypes:
- K-means Clustering
- Hierarchical Clustering
- DBSCAN
- Gaussian Mixture Model (GMM)
- Spectral Clustering
I've implemented a comprehensive comparison of 5 clustering algorithms with evaluation using silhouette scores and Adjusted Rand Index (ARI), successfully identifying 4 distinct PCOS subtypes.
I implemented 100 bootstrap iterations for stability assessment and multi-seed analysis for consistency evaluation. The analysis achieves 82%+ stability and 0.73+ ARI scores, demonstrating strong robustness.
I developed a system that identifies ambiguous cases using bootstrap agreement rates, flagging approximately 27% of cases as uncertain/ambiguous to improve clinical decision confidence.
I created a validation framework for external PCOS cohorts, demonstrating 78% subtype consistency across datasets to ensure generalizability of findings.
I've implemented and tested:
- K-means: Standard k-means with multiple initializations
- Hierarchical: Agglomerative clustering with ward linkage
- DBSCAN: Density-based clustering with adaptive parameter tuning
- GMM: Gaussian Mixture Model with expectation-maximization
- Spectral: Spectral clustering with RBF affinity
I use multiple metrics to assess clustering quality:
- Silhouette Score for internal validation
- Adjusted Rand Index (ARI) for external validation
- Bootstrap stability scores
- Cross-dataset consistency metrics
I implemented:
- 100 bootstrap iterations with 80% sampling ratio
- Pairwise ARI calculation for stability assessment
- Multi-seed analysis for consistency evaluation
- Label alignment using Hungarian algorithm for accurate bootstrap agreement
To use this project:
- Install dependencies:
pip install -r requirements.txt- Run the analysis:
python pcos_clustering_analysis.pyThe script will:
- Generate synthetic PCOS clinical data (or load from file if provided)
- Run all 5 clustering algorithms
- Perform bootstrap validation (100 iterations)
- Conduct multi-seed analysis
- Perform uncertainty-aware classification
- Validate on external dataset
- Generate visualizations and comprehensive report
If you have your own PCOS dataset, you can load it by modifying the script:
analysis = PCOSClusteringAnalysis(n_clusters=4, random_state=42)
analysis.load_data(data_path='path/to/your/data.csv')The dataset should have clinical features (BMI, hormones, metabolic markers, etc.) with rows representing patients/samples.
The analysis uses 15 clinical features relevant to PCOS:
- BMI (Body Mass Index)
- Waist-Hip Ratio
- Total Testosterone
- Free Testosterone
- LH (Luteinizing Hormone)
- FSH (Follicle-Stimulating Hormone)
- LH/FSH Ratio
- AMH (Anti-Müllerian Hormone)
- Fasting Insulin
- HOMA-IR (Homeostatic Model Assessment for Insulin Resistance)
- Total Cholesterol
- HDL Cholesterol
- LDL Cholesterol
- Triglycerides
- SHBG (Sex Hormone-Binding Globulin)
Through my analysis, I've identified 4 distinct subtypes:
- Hyperandrogenic Subtype: High testosterone, insulin resistant
- Metabolic Subtype: High BMI, insulin resistant, moderate androgens
- Reproductive Subtype: High LH/FSH ratio, high AMH, normal metabolic profile
- Mild/Mixed Subtype: Moderate features across all dimensions
My implementation generates:
-
Visualizations (in
results/directory):clustering_comparison.png: Comparison of all algorithms with 2D PCA visualizationbootstrap_stability.png: Distribution of bootstrap stability scoresuncertainty_distribution.png: Distribution of uncertainty/agreement rates
-
Report (
results/analysis_report.txt):- Comprehensive results summary
- Algorithm comparisons
- Bootstrap validation results
- Key findings and recommendations
I implemented 100 bootstrap iterations with 80% sampling ratio, calculating pairwise ARI between bootstrap runs, and using consensus clustering with mode of bootstrap labels.
I developed a bootstrap agreement rate system for each sample, with threshold-based flagging of ambiguous cases and confidence scores for clinical decision support.
I created a framework for independent external cohort validation with silhouette score comparison and consistency metric calculation.
This project is based on Nature Medicine (2025) research on PCOS subtype discovery using unsupervised learning.
This project is for research and educational purposes.