MySurgeryRisk Model Card

Model Details

Overview

MySurgeryRisk is an advanced predictive model designed to assess the likelihood of patients requiring prolonged mechanical ventilation (MV) following major surgical procedures. Specifically, it forecasts the risk of a patient needing mechanical ventilation for more than 48 hours post-surgery.

**Figure 1. Temporal Associations Between Automated Real-Time Data Inputs and Outcome Prediction Windows**

Owners

University of Florida Intelligent Clinical Care Center (ic3-center@ufl.edu)

Version

v1.0, Dec 5, 2024

License

CC BY-NC 4.0

Model Sources

Reference Papers

Yuanfang Ren, Tyler J Loftus, Shounak Datta, Matthew M Ruppert, Ziyuan Guan, Shunshun Miao, Benjamin Shickel, Zheng Feng, Chris Giordano, Gilbert R Upchurch, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac. "Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Predict Postoperative Complications and Report on a Mobile Platform". JAMA Network Open, 2022. DOI: 10.1001/jamanetworkopen.2022.11973
Azra Bihorac, Tezcan Ozrazgat-Baslanti, Ashkan Ebadi, Amir Motaei, Mohcine Madkour, Panagote M Pardalos, Gloria Lipori, William R Hogan, Philip A Efron, Frederick Moore, Lyle L Moldawer, Daisy Zhe Wang, Charles E Hobson, Parisa Rashidi, Xiaolin Li, Petar Momcilovic. "MySurgeryRisk: Development and Validation of a Machine-learning Risk Algorithm for Major Complications and Death After Surgery". Annals of Surgery, 2019. DOI: 10.1097/SLA.0000000000002706

Model Parameters

Architecture

The model is a random forest classifier.

Input

Tabular data with 78 features including 1) Socio-demographics (e.g., age, sex, race, ethnicity, language, area median income); 2) Admission information (e.g., emergent admission, admission source, night admission); 3) Comorbidities (e.g., diabetes, hypertension, cancer); 4) Scheduled procedure information (e.g., procedure code, surgeons, anesthesia type); 5) Historical medications (e.g., vancomycin, aspirin, beta-blokers); 6) Preoperative laboratory results (i.e., serum creatinine, hemoglobin, serum anion gap)

Additional Document

Feature List - See Supplementary Table 1

Output

The model outputs a probability score, ranging from 0 to 1, indicating the likelihood of a patient requiring prolonged MV post-surgery.

Training Datasets

UFH Gainesville training dataset: The dataset included all patients 18 years or older who were admitted to University of Florida Health (UFH) Gainesville for any type of inpatient surgical procedure. The final cohort consisted of 41,812 patients who received 52,117 procedures between June 1, 2014 and November 27, 2018. Each patient's medical record contained heterogeneous variables (eg, demographic characteristics and medical history, diagnoses and procedures, medications, laboratory results, and vital signs).

Labeling: The use of mechanical ventilation was identified using EHR data representing respiratory devices, ventilation modes, and measured values for respiratory vitals that include oxygen flow rate, tidal volume, and positive end-expiratory pressure. The detailed logic for mechanical ventilation identification is illustrated in Figure 2. Additionally, the outcome distribution was present in Figure 3.

**Figure 2. The logic for the identification of mechanical ventilation use**

**Figure 3. The outcome distribution of training dataset across all patients and subgroups stratified by sex, race and age**

Evaluation Datasets

UFH Gainesville evaluation dataset: The dataset included all patients 18 years or older who were admitted to University of Florida Health (UFH) Gainesville for any type of inpatient surgical procedure. The final cohort consisted of 19,132 patients who received 22,300 procedures between November 28, 2018 and September 20, 2020. We present the outcome distribution in Figure 4.

**Figure 4. The outcome distribution of the evaluation dataset across all patients and subgroups stratified by sex, race and age**

Training Details

The model was trained on the entire training dataset using the selected hyperparameters, which were selected using 5-fold cross validation.

Training Hyperparameters

min_samples_leaf	10
n_estimators	1500
max_features	10
class_weight	balanced

Quantitative Analysis

Metrics

The model performance was evaluated using several metrics, including area under the receiver operating characteristic curve (AUROC), area under the precision recall curve (AUPRC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (APV). 95% confidence intervals (CI) for all performance measures were calculated using bootstrao sampling and nonparametric methods. Detailed evaluation result was presented in the 'Evaluation Results' section.

Evaluation Results

UFH Gainesville evaluation dataset
AUROC	0.91 (0.9-0.91)
AUPRC	0.45 (0.41-0.48)
NPV	0.99 (0.99-0.99)
PPV	0.21 (0.2-0.24)
Sensitivity	0.85 (0.82-0.87)
Specificity	0.82 (0.8-0.84)

**Figure 5. AUROC curve for prolonged mechanical ventilation complication**

**Figure 6. AUPRC curve for prolonged mechanical ventilation complication**

Explainability

Utilizing SHapley Additive exPlanations (SHAP) on the evaluation dataset, we identified the key features contributing to prolonged MV risk prediction, as illustrated in Figure 7. The primary procedure code emerged as the most significant feature. Other top contributors included the attending surgeon, preoperative serum calcium and glucose levels, and surgery type, all ranking among the five most influential features.

**Figure 7. Ten important features contributing to the prolonged MV risk prediction**

Bias and Fairness

We evaluated the bias from the dataset and the prediction model across three sensitive attribute including sex, race and age. The evaluation results are shown in Figure 8 and Figure 9. We observed that while our prediction model and dataset satisfies several important fairness criteria, such as statistical parity, average odds, and equal opportunity, the disparate impact metric indicates potential unfairness in terms of selection rates across different groups (sex and race). This suggests that while the model maintains overall balance in its predictions, there may be subtle distributional differences that disproportionately affect certain groups (Figures 3 and 4). The 80% rule or Four-Fifths Rule has been applied to determine if there is bias.

Metrics

Disparate Impact (DI): DI compares the proportion of individuals that receive a favorable outcome for two groups, a protected group and a reference group. DI=P(outcome|protected group)/P(outcome|reference group).
Statistical Parity Difference (SPD): SPD measures the difference that the protected and reference classes receive a favorable outcome. SPD=P(outcome|protected group)-P(outcome|reference group).
Equal Opportunity Difference (EOD): EOD measures the difference in true positive rates (TPR) between the protected group and the reference group. EOD=TPR(protected group)-TPR(reference group).
Average Odds Difference (AOD): AOD measures the average of two differences: 1) The difference in false positive rates (FPR) between groups; 2) The difference in true positive rates (TPR) between groups. AOD = 0.5 * [(FPR(protected group)-FPR(reference group)) + (TPR(protected group)-TPR(reference group))].
Theil Index (TI): TI measures the inequality in the distribution of outcomes across different groups. Lower values indicate more equality among groups. T=(1/n) * Σ [(yi/μ) * ln(yi/μ)], where n is the number of groups, yi is the outcome for group i, μ is the mean outcome across all groups and ln is the natural logarithm.

**Figure 8. The summary of dataset bias across subgroups stratified by sex, race and age**

**Figure 9. The summary of model bias across subgroups stratified by sex, race and age**

Consideration

Primary Use Cases

Predict postoperative complication, prolonged MV, in patients undergoing major surgeries.

Intended Users

Surgeons and anethesiologists: To aid in preoperative risk assessment and surgical planning
Intensive care unit (ICU) physicians: To assist in postoperative patient management and resource allocation
Nursing staff: To help in monitoring patients and early identification of potential complications
Machine learning researchers: To serve as a benchmark for developing and comparing new algorithms in healthcare prediction tasks

Out of Scope Use Cases

Predict complications for non-surgical patients or patients undergoing minor procedures
Apply the model to pediatric populations or in healthcare settings significantly different from where it was developed and validated
Use the model as a sole determinant for patient care decisions without clinical judgment

Limitations

The model's performance may degrade when encountering test data with a distribution that significantly differs from the training dataset.
Distribution Shift: If the test data exhibits characteristics or patterns that were not well-represented in the training data, the model's accuracy and reliability may be compromised.
Missing Data: The model's performance is particularly sensitive to variations in missing data rates. For instance, if the primary procedure code is missing at a substantially higher rate in the test data compared to the training dataset, this discrepancy could adversely affect the model's predictions.
Generalizability: As the training dataset is derived from a single center, the model may not generalize well to other healthcare facilities or geographical regions. Different centers may have varying patient demographics, clinical practices, coding systems, or data collection methods, which could impact the model's performance when applied in new settings.

Ethic Considerations

Risk of perpetuating or amplifying existing biases in healthcare due to the use of socio-demographic features (age, sex, race, ethnicity, language, area median income)
Potential for unfair predictions resulting from manual feature selection bias.

MySurgeryRisk Model Card

Model Details

Overview

Owners

Version

License

Model Sources

Reference Papers

Model Parameters

Architecture

Input

Additional Document

Output

Training Datasets

Evaluation Datasets

Training Details

Training Hyperparameters

Quantitative Analysis

Metrics

Evaluation Results

Explainability

Bias and Fairness

Metrics

Consideration

Primary Use Cases

Intended Users

Out of Scope Use Cases

Limitations

Ethic Considerations