The R package case-based-reasoning provides an R interface case-based reasoning using machine learning methods.
Case-Based Reasoning (CBR) is an artificial intelligence (AI) and problem-solving methodology that leverages the knowledge and experience gained from previously encountered situations, known as cases, to address new and complex problems. CBR relies on the principle that similar problems often have similar solutions, and it focuses on identifying, adapting, and reusing those solutions to solve new problems.
The CBR process consists of four main steps:
-
Retrieve: In this step, the system searches its case database to identify the most similar cases to the current problem. It uses similarity measures and pattern matching techniques to compare the features of the new problem with the existing cases.
-
Reuse: Once the relevant cases are retrieved, the system adapts the solutions from those cases to fit the new problem. This may involve combining multiple solutions, adjusting parameters, or modifying the solution to accommodate any differences between the old and new cases.
-
Revise: After the adapted solution has been applied to the new problem, the system evaluates the outcome to determine its effectiveness. If necessary, the solution is further revised and optimized to better suit the specific context of the problem.
-
Retain: Finally, the new problem and its corresponding solution are added to the case database for future reference. This step enhances the system's knowledge base and improves its ability to solve similar problems in the future.
CBR has been successfully applied in various domains, including medical diagnosis, legal reasoning, customer support, and design optimization. Its ability to learn from experience and adapt to new situations makes it a valuable approach in fields where expertise and problem-solving skills are crucial.
The central challenge in CBR is defining what "similar" means. This package solves the Retrieve step by learning a distance function from data rather than relying on ad-hoc similarity measures.
The workflow follows three stages:
- Fit a statistical model to the training data. The model captures which variables matter and how strongly they influence the outcome.
- Derive a distance function from the fitted model. The model's structure determines how distance between cases is measured.
- Retrieve the k nearest cases for each new query observation using the learned distance.
The package offers two families of models for this:
A regression model (Cox proportional hazards, OLS, or logistic regression via the rms package) is fitted on the training data. The estimated regression coefficients are then used as weights in a weighted absolute distance function. Variables with larger coefficients contribute more to the distance, while variables the model deems irrelevant contribute little. This follows the approach of Dippon et al. (2002).
A random forest (via the ranger package) is fitted on the training data. Two distance measures can be extracted:
- Proximity distance: Two observations are similar if they frequently land in the same terminal node across trees.
- Depth distance: The distance between two observations is the sum of edges between their terminal nodes across all trees in the forest, following Englund and Verikas.
Both approaches produce an (n x m) distance matrix, where n is the number of training observations and m the number of query observations. This matrix can be used directly for clustering or visualization, or passed to the built-in k-nearest-neighbor search to retrieve the most similar cases.
install.packages("CaseBasedReasoning")
install.packages("devtools")
devtools::install_github("sipemu/case-based-reasoning")
This R package provides two methods case-based reasoning by using an endpoint:
-
Linear, logistic, and CPH Regression
-
Proximity and Depth Measure extracted from a fitted random forest (ranger package)
Besides the functionality of searching for similar cases, we added some additional features:
-
Automatic validation of the critical variables between the query and similar cases dataset
-
Checking proportional hazard assumption for the Cox Model
-
C++-functions for distance calculation
"Warning: Cases with missing values in the dependent variable (Y) or predictor variables (X) have been dropped from the analysis. This may lead to a reduced dataset and potential loss of information. Please review your data and consider appropriate missing value imputation techniques to mitigate these issues."
In the first example, we use the CPH model and the ovarian data set from the survival package. In the first step, we initialize the R6 data object.
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)
# initialize R6 object
cph_model <- CoxModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps, data = ovarian)
After the initialization, we may want to get for each case in the query data the most similar case from the learning data.
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), FALSE)
testID <- (1:n)[-trainID]
cph_model <- CoxModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps, data = ovarian[trainID, ])
# fit model
cph_model$fit()
# get similar cases
matched_tbl <- cph_model$get_similar_cases(query = ovarian[testID, ], k = 3)
# or use the standard S3 predict interface
matched_tbl <- predict(cph_model, newdata = ovarian[testID, ], k = 3)To analyze the results, you can extract the similar cases and training data and combine them:
-
Note 1: During the initialization step, all cases with missing values in the data and endpoint variables were removed. Be sure to conduct a missing value analysis beforehand.
-
Note 2: The data.frame returned from
cph_model$get_similar_cases()contains four columns that help identify the query cases, their matches, and the distances between them:-
caseId: This column allows you to map the similar cases to cases in the data. For example, if you chose k = 3, the first three elements in the caseId column will be 1 (followed by three 2s, and so on). These three cases are the three most similar cases to case 0 in the verum data.
-
scDist: The calculated distance between the cases.
-
scCaseId: Grouping number of the query case with its matched data.
-
group: Grouping indicator for matched or query data.
-
These columns help organize and interpret the results, ensuring a clear understanding of the most similar cases and their corresponding query cases.
The distance matrix is a square matrix that represents the pairwise distances between a set of data points. In the context of Case-Based Reasoning (CBR), the distance matrix captures the dissimilarities between cases in the training and test (or query) datasets, based on the fitted model and the values of the predictor variables.
The distance matrix can be helpful in various situations:
-
Identifying Similar Cases: By examining the distance matrix, you can identify the most similar cases to a given query case. Smaller distances indicate higher similarity, enabling the retrieval of relevant cases for CBR.
-
Clustering and Grouping: The distance matrix can be used as input for clustering algorithms, such as hierarchical clustering or k-means clustering, to group cases with similar characteristics. This can provide insights into the structure and patterns within the data.
-
Visualizing Relationships: By creating a heatmap of the distance matrix, you can visualize the relationships between cases. This representation can help identify trends and anomalies in the data, guiding further analysis and decision-making.
-
Model Validation: The distance matrix can be used to assess the performance of the CPH regression model or other statistical models employed in the CBR process. Comparing the distance matrix for different models can help determine which model better captures the relationships between cases.
In summary, a distance matrix can provide valuable insights into the relationships between cases, facilitate the identification of similar cases for CBR, and aid in the validation of the chosen statistical models.
distance_matrix <- cph_model$calc_distance_matrix()cph_model$calc_distance_matrix() calculates the distance matrix between train and test data. When test data is omitted, the distances between observations in the training data are calculated. Rows are observations in train and columns observations of test. The distance matrix is saved internally in the CoxModel object: cph_model$dist_matrix.
-
PD Dr. Jürgen Dippon, Institut für Stochastik und Anwendungen, Universität Stuttgart
-
Dr. Simon Müller, DataZoo GmbH
-
Dr. Peter Fritz
-
Professor Dr. Friedel
The Robert Bosch Foundation funded this work. Special thanks go to Professor Dr. Friedel (Thoraxchirugie - Klinik Schillerhöhe).
-
Dippon et al. A statistical approach to case based reasoning, with application to breast cancer data (2002),
-
Friedel et al. Postoperative Survival of Lung Cancer Patients: Are There Predictors beyond TNM? (2012).
-
Englund and Verikas A novel approach to estimate proximity in a random forest: An exploratory study
-
Stuart, E. et al. Matching methods for causal inference: Designing observational studies
-
Defossez et al. Temporal representation of care trajectories of cancer patients using data from a regional information system: an application in breast cancer