Skip to content

statistics for ecoli mutants on covid patients classification

Notifications You must be signed in to change notification settings

brsynth/mutant-covid

Repository files navigation

Statistical curve analysis using generalized additive mixed models

For this part, we are testing if the groups are biologically different. The answers we’re seeking to answer are: Do group trajectories differ over time? Is there a statistically significant group-by-time effect? Are differences robust after accounting for repeated measures, patient variation, autocorrelation, etc.? Optical density (OD) growth measurements were analysed using generalized additive mixed models (GAMMs) implemented in the mgcv package (v1.8). To ensure comparability across individuals, all trajectories were truncated to a common endpoint defined as the minimum of the maximum recorded time across patients–replicate series. OD values were modelled as a smooth function of time with group-specific deviations by including both a global smooth term s(time) and interaction smooths s(time, by = Group). Group was included as a fixed effect, allowing baseline differences in OD between experimental conditions. Mixed-effects structure was incorporated to account for repeated measurements within patients and replicates. Random intercepts and subject-specific smooth terms were included using penalized regression splines. Temporal autocorrelation within patients was further addressed using an autoregressive moving-average correlation structure. We can look at GAMM as a Constrained Curve where it allows us to perform a Likelihood Ratio Test. We compare two models, one fits data from each group. If they significantly different residual sum of squares (RSS) while accounting for the degrees of freedom (model complexity), we have structural evidence that the "Group" variable is an essential feature, not noise.

Classification:

To prevent information leakage across repeated measurements, model evaluation was performed using patient-level train/test splitting. For each run, patients were randomly partitioned into training (80%) and testing (20%) sets, ensuring that all replicates from the same patient were assigned exclusively to one split. Models hyperparameters are fine tuned along with training by cross validation before models were evaluated on the held-out test set. Performance was quantified using balanced accuracy and macro-averaged precision to prevent overly optimistic results, especially because our dataset has very high intrinsic variance and subtle class differences. Final results are reported as mean ± standard deviation across runs.

Whole time serie: Representation Learning

To assess whether OD growth trajectories could be used to discriminate between experimental groups, we implemented a deep learning classification framework based on raw time-series measurements. Time-series OD values were used directly as model input without manual feature engineering. Each trajectory was represented as a one-dimensional sequence with an added channel dimension, enabling convolutional processing. Three neural network architectures that are suitable for time series were evaluated: a one-dimensional convolutional neural network (CNN1D), a long short-term memory network (LSTM), and a temporal convolutional network (TCN) with dilated convolutions. All models produced class predictions via a final fully connected layer and were trained using cross-entropy loss optimized with the Adam optimizer.

Classification of parameters: Feature Engineering

To assess whether traditional curve parameters could be used to discriminate between experimental groups. For each single curve, we calculated 7 (or 8?) features (including…?) with biological meaning and used them as input for classification models. Three supervised classifiers were assessed: support vector machine (RBF kernel), logistic regression, and extreme gradient boosting (XGBoost).

Multiple mutants classification

To address if several mutant combines can be better at classifying patients, we combine 2 best mutant (A15, A1) and 3 best mutant (A15, A1, A5) data for each patient. Parameters are stacked by column and cross combine between repetition (for the same row can be combined from the same patient with repetition 1 of mutant A vs rep 2 of mutant B, or rep 1 of mutant). In addition to individual classifiers, a soft voting ensemble was constructed by combining the tuned SVM, logistic regression, and XGBoost models using predicted class probabilities. Ensemble performance was evaluated under the same repeated grouped splitting framework. We also conduct feature selection to see which parameters from which mutant is important to the models

About

statistics for ecoli mutants on covid patients classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages