This code base contains an end-to-end pipeline for cohort-building, and calculating / modeling the impact of colonization pressure (defined below) on nosocomial pathogen acquisition using electronic health record data. It accompanies the following research manuscript:
"A proof-of-concept study for colonization pressure as a real-time machine-learning risk metric for nosocomial acquisition" by Sagers et al (Nature Communications, 2026)
Hospitalized patients are at risk for developing hospital-acquired infections (HAI). A primary mechanism for HAI begins when a patient who is colonized with a potential pathogen is admitted to the hospital. That individual becomes a reservoir from which the hands and clothing of healthcare workers, hospital equipment and room surfaces are contaminated. Contact with these contaminated surfaces results in transmission to and colonization of a new vulnerable host. Colonization is a strong predictor of future clinical infection. The cycle repeats when the second patient becomes a new reservoir for onward nosocomial transmission.
Active surveillance for colonization of asymptomatic individuals is a key part of infection control, but requires significant investment in infrastructure and human resources. For this reason it is typically limited to intensive care units (ICUs) and other high-risk units, and to a few drug-resistant or high-virulence organisms, such as methicillin-resistant Staphylococcus aureus (MRSA) and vancomycin-resistant Enterococcus species (VRE). Colonization pressure (CP), defined as the prevalence of an organism among patients in the ward into which a patient enters, has the potential to augment active surveillance efforts. This is due to the fact that estimating CP does not require new data collection, instead relying on information already present in the electronic health record (EHR). Furthermore, CP can easily be calculated for any number of organisms in any area of the hospital using routine EHR data, whereas expanding active surveillance can be disruptive to care and costly.
Prior studies have shown a direct association between CP and HAI, suggesting its potential role as a risk assessment tool. However, analyses were limited to known drug-resistant nosocomial pathogens and to ICU settings. Whether the same relationship applies to drug-susceptible organisms, which are responsible for a majority of infections in hospitals in the United States, and to non-ICU settings, remains unknown.
Our goal with this code-base is to publish a prototype of an infection control informatics tool that can construct ward-level CP across a variety of drug-susceptible and drug-resistant organisms using EHR data. Using this, we tested the hypothesis that CP is associated with nosocomial pathogen acquisition from those organisms by applying our prototype to a cohort of HAI cases matched to controls by demographics, surgery, and fine-grained antibiotic exposures.
-
HO_infxn_functions.RDescription: Functions for pre-processing and cohort building for environmental and patient analyses
Input Files
- None
Output Files
- None
-
HO_infxn_C_Diff_micro_query.RDescription: Script to pull all patients who had a C. difficile testing performed (only necessary if this data is separate from other microbiology data)
Input Files
edw_cdiff.csvEDW_Cdiff_results_map.csv
Output Files
Cdiff.csv
-
HO_infxn_micro_prep.RDescription: Script to process raw microbiology data
Input Files
micro_raw.csvCdiff.csv
Output Files
micro_ground_truth.csv
-
HO_infxn_input_data_preprocessing.RDescription: Raw clinical metadata pre-processing
Dependent Scripts
HO_infxn_functions.R
Input Files
micro_ground_truth.csvADT.csvabx.csvabx_map.csvIP_ED_encounters.csvencounters.csv
Output Files
ADT_clean.csvmicro_dedup.csvabx_prelim_clean.csvabx_courses.csvenc_clean.csvadmt_clean.csv
-
HO_infxn_build_unmatched_cohorts.RDescription: Build unmatched target pathogen cohorts
Dependent Scripts
HO_infxn_functions.R
Input Files
ADT_clean.csvmicro_dedup.csvabx_courses.csvenc_clean.csvmicro_ground_truth.csvabx_prelim_clean.csv
Output Files
adt_micro_raw.csvpath_cat_table.csvunmatched_case_controls_no_features.csv
-
HO_infxn_add_features.RDescription: Add features to each target pathogen cohort
Dependent Scripts
HO_infxn_functions.R
Input Files
unmatched_case_controls_no_features.csvadt_micro_raw.csvdemographics.csvabx_courses.csvelixhauser.csvprocedures.csvadmt_clean.csvdepartment_mapping.csvenc_clean.csvADT.csvmicro_ground_truth.csvpath_cat_table.csv
Output Files
dems_clean.csvabx_clean.csvelix_clean.csvcpt_clean.csvadmt_clean.csvcol_pressure.csvpath_cat_table_matching.csvunmatched_case_controls_features.csv
-
HO_infxn_build_matched_cohorts.RDescription: Match cases to controls for each target pathogen cohort for environmental and patient analyses
Dependent Scripts
HO_infxn_functions.R
Input Files
path_cat_table_matching.csvunmatched_case_controls_features.csv
Output Files
final_cohort.csvfinal_dataset_for_models.csv
-
HO_infxn_models.RDescription: Models for estimating impact of colonization pressure and prior antibiotic exposure on in-hospital acquisition of target pathogen
Input Files
final_dataset_for_models.csv
Output Files
clogit_coefficients.csvxgboost_feature_importance.csv
-
HO_infxn_tables_figures.RDescription: Data visualizations
Input Files
final_dataset_for_models.csvclogit_coefficients.csv
Output Files
- None
Publicly-releaed data files (located in the Physionet repository: Predictors of Hospital Onset Infection: A Matched Retrospective Cohort Dataset)
-
final_dataset_for_models.csvDescription: The final data ready for running conditional logistic regression (CLR) and XGBoost models for environmental and patient analyses
Key Columns:
match: The type of matching algorithm (environmental vs patient analysis)run: The name of target pathogengroup: Flag for control or case[abx]_0_60: Prior number of abx courses used in the prior 60 days.[organism]_cp: Calculated organism colonization pressure for the sample.elix_index_mortality: Calculated elixhauser index for the sample.
To ensure compliance with HIPAA guidelines and protect patient privacy:
- Patient IDs are anonymized, and all identifiable information is removed.
- Patients with Age > 90 are removed from analysis to enture patient privacy.
- Hospital units are given an anonymous random number
- Hospital Admission and Discharge dates are anonymized as random dates and time while maintaining relative time for the same patient.
- This dataset is publicly available on Physionet (noted above) for research purposes.
- Users are encouraged to contact the code base/dataset creators for support or further clarification of needed.
- R (v 4.4.0) is used in constructing the dataset and running the models.
This study was approved by the Institutional Review Board (IRB) of Massachusetts General Brigham health system with a waived requirement for informed consent.
When using the code and dataset, please cite: Sagers L, Wei Z, McKenna C, Chan C, Agan AA, Pak TR, Rhee C, Klompas M, Kanjilal S. A proof-of-concept study for colonization pressure as a real-time machine-learning risk metric for nosocomial acquisition. Nature Communications 2026.
For questions, clarifications, or further support, please contact:
Sanjat Kanjilal, MD, MPH Department of Medical Microbiology and Infection Prevention Amsterdam University Medical Center s.kanjilal@amsterdamumc.nl

