This study employed a three-phase analytical pipeline to investigate molecular pathway heterogeneity in amyotrophic lateral sclerosis (ALS) using synthetic patient populations derived from clinically validated pathogenic variants (Figure 1). The methodology comprised: (1) synthetic patient generation with federated clustering to identify disease severity strata, (2) mechanistic pathway annotation based on curated gene-pathway relationships, and (3) comprehensive pathway co-occurrence and correlation analysis stratified by cluster assignment.
Pathogenic ALS variants were extracted from the National Center for Biotechnology Information (NCBI) ClinVar database. The curated dataset (clinvar.cleaned.csv) was filtered using the following inclusion criteria:
Inclusion Criteria:
- Clinical significance classified as "Pathogenic" or "Likely Pathogenic" per American College of Medical Genetics and Genomics (ACMG) guidelines
- Disease association with amyotrophic lateral sclerosis or related motor neuron disease phenotypes
- Minimum review status of "criteria provided, single submitter"
Resulting Dataset Characteristics:
| Parameter | Value |
|---|---|
| Total pathogenic variants | ~450 |
| Unique genes represented | 34 |
| Variant types included | SNV, indel, repeat expansion |
Each ClinVar record provided the following biological ground truth for simulation:
| Field | Description | Example |
|---|---|---|
| Gene Association | HGNC-approved gene symbol | SOD1, TARDBP, C9orf72 |
| Clinical Significance | ACMG/AMP pathogenicity classification | Pathogenic, Likely pathogenic |
| Molecular Consequence | Predicted functional impact | Missense, nonsense, frameshift |
| Genomic Coordinates | GRCh38 chromosome position | chr21:31659666 |
Synthetic patient cohorts were generated to represent five continental superpopulations, following the 1000 Genomes Project nomenclature:
| Code | Population | Simulated n |
|---|---|---|
| AFR | African/African American | 3,000 |
| AMR | Admixed American | 3,000 |
| EAS | East Asian | 3,000 |
| EUR | European | 3,000 |
| SAS | South Asian | 3,000 |
| Total | 15,000 |
Pathogenic variants were assigned to synthetic patients using a stochastic model incorporating population-specific allele frequencies and gene-level penetrance estimates. For each patient i and variant v:
Where:
-
$f_{v,pop}$ = population-specific allele frequency from gnomAD v3.1 -
$\pi_{gene(v)}$ = penetrance coefficient for the variant's associated gene
Each variant carrier was assigned a composite severity score reflecting the cumulative pathogenic burden. The severity score for patient i carrying variants
Component Weights:
Gene Weight (
| Gene | Weight | Justification |
|---|---|---|
| SOD1 | 1.0 | High penetrance, well-characterized |
| C9orf72 | 1.0 | Most common familial ALS cause |
| TARDBP | 0.9 | High penetrance |
| FUS | 0.9 | Aggressive juvenile-onset forms |
| Other genes | 0.5–0.8 | Variable penetrance |
Molecular Consequence Weight (
Pathogenicity Confidence Weight (
Continuous severity scores were discretized into clinical severity categories:
| Category | Severity Score Range | Clinical Interpretation |
|---|---|---|
| Healthy | No pathogenic variants detected | |
| Mild | Single low-penetrance variant | |
| Moderate | Multiple variants or single high-impact | |
| Severe | Multiple high-impact variants |
Disease progression rate was modeled as a probabilistic function of severity score:
Where
Federated learning was employed to simulate a privacy-preserving multi-institutional collaboration where patient-level data remains decentralized at each population node. This architecture enables identification of disease subgroups without sharing raw genetic data.
For each carrier patient, a feature vector
| Feature Set | Dimension | Description |
|---|---|---|
| 1 | Total pathogenic variant count | |
| 1 | Composite severity score | |
| 34 | Binary gene indicators (1 if patient carries variant in gene) | |
| 4 | Consequence counts (missense, nonsense, frameshift, splice) | |
| Total | 40 |
The clustering algorithm proceeded iteratively across
Algorithm: Federated K-Means
Input: K (number of clusters), ε (convergence threshold), T_max (max iterations)
Initialize: Global centroids μ₁⁽⁰⁾, μ₂⁽⁰⁾, ..., μ_K⁽⁰⁾ randomly
For t = 1 to T_max:
For each population node p = 1 to P:
# LOCAL STEP 1: Assign patients to nearest centroid
For each patient i in population p:
z_i = argmin_k ||x_i - μ_k⁽ᵗ⁻¹⁾||²
# LOCAL STEP 2: Compute local cluster statistics
For k = 1 to K:
n_k⁽ᵖ⁾ = |{i : z_i = k}| # Local cluster size
μ_k⁽ᵖ⁾ = (1/n_k⁽ᵖ⁾) Σ_{z_i=k} x_i # Local centroid
Send {n_k⁽ᵖ⁾, μ_k⁽ᵖ⁾} to central coordinator
# GLOBAL AGGREGATION
For k = 1 to K:
μ_k⁽ᵗ⁾ = Σ_p (n_k⁽ᵖ⁾ × μ_k⁽ᵖ⁾) / Σ_p n_k⁽ᵖ⁾ # Weighted average
# CONVERGENCE CHECK
Δ = Σ_k ||μ_k⁽ᵗ⁾ - μ_k⁽ᵗ⁻¹⁾||²
If Δ < ε: break
Output: Cluster assignments {z_i} and final centroids {μ_k}
Hyperparameters:
-
$K = 5$ (determined by elbow method and silhouette analysis) $\epsilon = 10^{-6}$ $T_{max} = 100$ - Distance metric: Euclidean
The optimal cluster count was determined using two complementary methods:
Elbow Method (Within-Cluster Sum of Squares):
Silhouette Score:
For each patient i: $$ s(i) = \frac{b(i) - a(i)}{\max{a(i), b(i)}} $$
Where:
-
$a(i)$ = mean distance to other patients in same cluster -
$b(i)$ = mean distance to patients in nearest other cluster
Mean silhouette score across patients indicated optimal separation at
The resulting five clusters exhibited distinct severity and progression profiles:
| Cluster | n | Mean Severity Score | Modal Progression | % Modal | Interpretation |
|---|---|---|---|---|---|
| C0 | 287 | 5.00 | Slow | 74.2% (213/287) | Mild |
| C1 | 3,287 | 7.27 | Moderate | 92.9% (3,055/3,287) | Moderate-A |
| C2 | 1,999 | 9.30 | Fast | 96.5% (1,930/1,999) | Severe |
| C3 | 8,957 | 0.00 | N/A | 100% | Control (Healthy) |
| C4 | 845 | 6.32 | Moderate | 91.2% (771/845) | Moderate-B |
Key Finding: Clusters C1 and C4, while both classified as "Moderate" severity, emerged as distinct molecular subtypes, providing evidence for severity-stratified molecular heterogeneity within the moderate disease category.
Severity Gradient Validation:
The cluster severity ordering (C0 < C4 < C1 < C2) was validated by:
- Monotonic increase in mean severity scores
- Concordant shift in predicted progression rates
- Distinct gene enrichment patterns (detailed in Section 2.6)
Seven canonical ALS-associated molecular pathways were defined based on comprehensive literature review of motor neuron degeneration mechanisms:
| Pathway | Code | Primary Biological Process |
|---|---|---|
| Proteostasis | PROT | Protein folding quality control, autophagy, ubiquitin-proteasome system |
| RNA Metabolism | RNA | Pre-mRNA splicing, stress granule dynamics, nucleocytoplasmic transport |
| Cytoskeletal/Axonal Transport | CYTO | Microtubule dynamics, motor protein function, neurofilament organization |
| Mitochondrial Dysfunction | MITO | Oxidative phosphorylation, reactive oxygen species homeostasis, apoptosis |
| Excitotoxicity | EXCITO | Glutamatergic neurotransmission, calcium homeostasis, AMPA/NMDA receptor function |
| Vesicle Trafficking | VES | Endosomal sorting, autophagosome-lysosome fusion, synaptic vesicle cycling |
| DNA Damage Response | DNA | Genome stability, DNA repair, R-loop resolution |
Each of the 34 ALS-associated genes was mapped to one or more pathways based on established molecular mechanisms. The mapping schema permitted pleiotropy, reflecting biological reality where single genes participate in multiple cellular processes.
Complete Gene-Pathway Assignment:
| Gene | Pathway(s) | Mechanistic Basis |
|---|---|---|
| SOD1 | PROT, MITO, EXCITO | Misfolded aggregates disrupt proteostasis; mitochondrial localization impairs respiration; EAAT2 cleavage reduces glutamate clearance |
| C9orf72 | PROT, RNA, EXCITO | Dipeptide repeat proteins inhibit proteasome; RNA foci sequester splicing factors; AMPA receptor upregulation |
| TARDBP | RNA, EXCITO | TDP-43 mislocalization disrupts splicing; ADAR2 downregulation increases Ca²⁺-permeable AMPA receptors |
| FUS | RNA, MITO | Nuclear export defects; interaction with ATP synthase impairs mitochondrial function |
| VCP | PROT | Autophagosome maturation failure |
| UBQLN2 | PROT | Ubiquitin-proteasome dysfunction |
| OPTN | PROT | Autophagy receptor dysfunction |
| SQSTM1 | PROT | p62-mediated selective autophagy impairment |
| TBK1 | PROT | Autophagy initiation kinase deficiency |
| CCNF | PROT | E3 ubiquitin ligase dysfunction |
| MATR3 | RNA | mRNA nuclear export block |
| HNRNPA1 | RNA | Stress granule dysregulation |
| HNRNPA2B1 | RNA | Stress granule dysregulation |
| ANG | RNA | Ribosomal biogenesis impairment |
| ELP3 | RNA | tRNA modification defects |
| TUBA4A | CYTO | Microtubule destabilization |
| PFN1 | CYTO | Actin polymerization failure |
| NEFH | CYTO | Neurofilament accumulation |
| DCTN1 | CYTO | Retrograde axonal transport failure |
| KIF5A | CYTO | Anterograde axonal transport failure |
| CHCHD10 | MITO | Cristae disruption, mtDNA instability |
| SIGMAR1 | MITO | MAM calcium dysregulation |
| ATXN2 | MITO | NADPH oxidase activation, ROS surge |
| C19orf12 | MITO | Mitochondrial iron dysregulation |
| ALS2 | VES | Endosome-lysosome fusion failure |
| CHMP2B | VES | ESCRT-III dysfunction |
| VAPB | VES | ER-Golgi transport defects |
| FIG4 | VES | Lysosomal biogenesis failure |
| SPG11 | VES, DNA | Lysosome reformation failure; DNA repair defects |
| NEK1 | DNA | DNA damage checkpoint failure |
| C21orf2 | DNA | DDR defects via NEK1 interaction |
| SETX | DNA | R-loop accumulation, genomic instability |
| UNC13A | EXCITO | Synaptic vesicle release defects |
| DAO | EXCITO | D-serine metabolism, NMDA modulation |
For each patient i, two complementary pathway metrics were computed:
Binary Pathway Indicator:
Where
Pathway Burden Score:
Where
This scoring scheme quantifies the breadth of pathway disruption by counting unique genes affected, rather than total variant count, thereby avoiding inflation from multiple variants in a single gene.
Derived Variables per Patient:
| Variable | Computation | Interpretation |
|---|---|---|
n_pathways_affected |
Number of distinct pathways with ≥1 affected gene | |
primary_pathway |
Pathway with highest gene burden | |
pathway_X_genes |
Concatenated gene list | Genes contributing to pathway X |
Cohort-Level Summary:
| Metric | Value |
|---|---|
| Total patients | 15,000 |
| Carriers (≥1 pathway affected) | 6,043 (40.3%) |
| Mean pathways per carrier | 2.1 ± 1.3 |
| Patients with ≥3 pathways | 1,847 (30.6% of carriers) |
Pathway prevalence was computed within each non-control cluster:
Where
For each cluster, a symmetric
Diagonal elements represent within-pathway counts:
To normalize for pathway prevalence asymmetry, co-occurrence percentage was computed relative to the smaller pathway:
This formulation ensures that if all patients with the less common pathway also have the more common pathway, Co-occ% = 100%.
For each pathway pair
| Pathway |
Pathway |
Total | |
|---|---|---|---|
| Pathway |
|||
| Pathway |
|||
| Total |
Odds Ratio: $$ \text{OR}_{p,q} = \frac{a \times d}{b \times c} $$
95% Confidence Interval (Woolf logit method): $$ \ln(\text{OR}) \pm 1.96 \times \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} $$
Effect Size Interpretation:
| Odds Ratio | Effect Size Classification |
|---|---|
| OR > 3.0 | Large positive association |
| 1.5 < OR ≤ 3.0 | Medium positive association |
| 0.67 ≤ OR ≤ 1.5 | Negligible association |
| 0.33 ≤ OR < 0.67 | Medium negative association |
| OR < 0.33 | Large negative association |
To quantify pathway overlap independent of marginal frequencies:
Where
Jaccard index ranges from 0 (no overlap) to 1 (complete overlap).
Spearman's ρ was computed between pathway burden scores to assess monotonic (potentially non-linear) relationships:
Where $d_i = R(\text{Score}{i,p}) - R(\text{Score}{i,q})$ is the difference in ranks for patient
Pearson's
Rationale for Dual Analysis:
| Method | Detects | Use Case |
|---|---|---|
| Spearman ρ | Monotonic relationships | Robust to outliers, non-normal distributions |
| Pearson |
Linear relationships | Tests dose-dependent biological hypothesis |
Linearity Assessment:
The difference
Observed in overall cohort:
- Maximum
$|\rho - r|$ = 0.034 - Mean
$|\rho - r|$ = 0.009
The close agreement (Δ < 0.05 for all pairs) confirmed that pathway relationships are approximately linear, supporting Pearson correlation for dose-dependent inference.
| Correlation | Strength | Biological Interpretation |
|---|---|---|
| $ | r | \geq 0.7$ |
| $0.3 \leq | r | < 0.7$ |
| $ | r | < 0.3$ |
Pathway pairs were categorized into biological relationship types based on the joint distribution of co-occurrence frequency and correlation strength:
Classification Criteria:
| Category | Criteria | Biological Interpretation |
|---|---|---|
| Dose-dependent | Co-occ% > 50% AND |
Cascading failure: when one pathway intensifies, the other scales proportionally |
| Threshold effect | Co-occ% > 50% AND $ | r |
| Distinct subtypes | Co-occ% < 30% AND |
Mutually exclusive: suggests separate molecular subtypes |
| Other | Not meeting above criteria | Mixed or transitional patterns |
Exemplar Classifications (Overall Cohort):
| Pathway Pair | Co-occ% | Pearson |
Classification |
|---|---|---|---|
| Mitochondrial × Excitotoxicity | 86.5% | 0.687 | Dose-dependent |
| RNA Metabolism × Mitochondrial | 60.4% | 0.214 | Threshold effect |
| Proteostasis × Vesicle Trafficking | 24.4% | −0.408 | Distinct subtypes |
For each pathway, a
Where:
-
$O_{ij}$ = observed count in cell$(i,j)$ -
$E_{ij} = \frac{R_i \times C_j}{N}$ = expected count under independence - Degrees of freedom:
$df = (4-1)(2-1) = 3$
For pathway burden scores across clusters:
Where:
-
$N$ = total sample size -
$K$ = number of clusters -
$R_k$ = sum of ranks in cluster$k$ -
$n_k$ = sample size of cluster$k$
Under
Given the large sample size (
Epsilon-Squared (
| Effect Size | |
|---|---|
| ≥ 0.14 | Large |
| 0.06 – 0.14 | Medium |
| 0.01 – 0.06 | Small |
| < 0.01 | Negligible |
Cohen's
Where
|
For the correlation matrix (21 unique pairwise comparisons among 7 pathways), Bonferroni correction was applied:
Correlations with
Pathway relationships were represented as an undirected weighted graph
-
Vertices (
$V$ ): Seven molecular pathways -
Edges (
$E$ ): Pathway pairs with co-occurrence count > 0 -
Weights (
$W$ ): Co-occurrence counts
| Element | Encoding | Formula |
|---|---|---|
| Node size | Pathway prevalence |
|
| Node color | Pathway identity | Fixed color palette |
| Edge width | Co-occurrence strength | |
| Edge color | Effect size | Red (large), Orange (medium), Gray (negligible) |
Nodes were arranged in circular layout with angular position:
Coordinates: $$ x_p = x_{center} + R \cos(\theta_p) $$ $$ y_p = y_{center} + R \sin(\theta_p) $$
Where
| Component | Version | Purpose |
|---|---|---|
| Python | 3.10.12 | Primary analysis environment |
| pandas | 2.0.3 | Data manipulation |
| NumPy | 1.24.3 | Numerical computation |
| SciPy | 1.11.1 | Statistical tests |
| scikit-learn | 1.3.0 | K-means clustering |
| D3.js | 7.8.5 | Interactive visualization |
The complete pipeline comprised three executable modules:
phase1_federated_clustering.py— Federated K-means implementation with cross-population aggregationphase2_pathway_annotation_all.py— Gene-pathway mapping and patient scoringstage4_pathway_analysis_clusters.py— Statistical analysis, correlation matrices, and network generation
All analysis scripts and intermediate datasets are available at [repository URL to be inserted]. The ClinVar source data is publicly accessible through NCBI (https://www.ncbi.nlm.nih.gov/clinvar/).
With
- Small correlations (
$r = 0.10$ ) at$\alpha = 0.05$ - Small prevalence differences (Cohen's
$h = 0.20$ ) between clusters
Consequently, effect sizes rather than
-
Synthetic Data: Patients were computationally simulated based on variant-level characteristics; clinical phenotype heterogeneity and environmental factors were not modeled.
-
Pathway Boundary Definitions: Gene-pathway assignments were based on primary literature and may not capture context-dependent or tissue-specific pathway involvement.
-
Epistatic Effects: Complex genetic interactions between variants were not explicitly modeled in the severity score.
-
Population Structure: While five superpopulations were represented, within-population stratification (e.g., ancestry-informative markers) was not addressed.
-
Temporal Dynamics: The analysis represents a cross-sectional snapshot; longitudinal pathway evolution was not modeled.
This study utilized exclusively synthetic data generated from publicly available ClinVar variant annotations. No human subjects were enrolled, and no identifiable patient information was used. Institutional review board approval was not required.
Statistical analyses were performed in Python 3.10 using scipy.stats for inferential tests. Interactive visualizations were developed using D3.js v7.8.5. All analyses were conducted between January and February 2026.