Introduction

In patients admitted to the emergency department (ED), survival is mainly determined by the severity of acute illness on admission [1, 2] and the quality of care provided throughout the entire treatment process [3]. The patient’s deterioration or unexpected death is often preceded by abnormalities in vital signs [4, 5]. The causes leading to deterioration are potentially treatable in a majority of patients and it is reasonable to assume that any intervention will be more efficient if initiated early [6, 7]. Less than optimal care in this setting is due to a failure to recognise signs of organ dysfunction, delayed alarm and subsequent failure to intervene rapidly and adequately [810]. As large EDs admit high numbers of patients, it is unrealistic to expect that all patients can be immediately assessed by an expert physician. In most EDs, patients are first assessed by ED nursing staff [11, 12]. Their subjective impression is used to integrate the available patient history and the first physical examination, and this serves to trigger an expedited evaluation by a physician [13, 14]. However, nurses and physicians agree only moderately when assessing the severity of a patient’s condition [15, 16].

Various scoring systems have been proposed for more systematic initial assessment and simplified identification of patients at risk [1721]. The discriminant value of these scores is only moderate and varies between studies [22]. The suboptimal performance of scoring systems is multifactorial. The choice of variables and their weight may be based on clinical experience and intuition [23]. To make scores easier to calculate, simplified correlations between pathological vital signs and outcome are used. Continuous variables, such as blood pressure or heart rate, are stratified into pathological and non-pathological ranges, leading to the loss of some discriminative power. It may be wrongly assumed that there is a linear relationship between abnormal values and the risk of death. Additionally, ED scores usually do not account for the fact that vital signs have already potentially been influenced by supportive treatment—such as vasopressors or supplemental oxygen—at the time of the initial assessment.

The aim of this study was to methodically construct and validate a systematic prognostic model based on all information available at the time of arrival in the ED and to compare the discriminative performance of this newly created prediction model with a subjective risk estimate by the first ED health-care provider triaging the patient. We hypothesised that the model-based risk evaluation would allow the reliable identification of patients at greater risk of mortality and would be superior to a non-systematic, experience-based assessment.

Methods

Study design

Prospective cohort study for the development and validation of a clinical prediction model.

Setting

The study was performed in the Departments of Intensive Care Medicine and Emergency Medicine of the Bern University Hospital, a 960-bed tertiary care centre. The ED provides initial treatment of all adult emergency patients. All ED staff hold a Bachelor of Science in Nursing and around 45 % have completed postgraduate training in emergency care nursing.

Patients

This study includes all patients admitted to the hospital via the ED during 1 year starting from October 2009. Patients remaining in the ED for more than 24 h before being discharged home and patients who died in ED were rated as hospital admissions. Patients treated on an outpatient basis were excluded retrospectively.

Variables and data sources

Study data was collected by triage ED staff from paper medical records and was entered into a study database by research staff. Data included age, gender, time of admission, cause of admission (APACHE II diagnostic category [24]) as established by the ED nurse, vital sign measurements for respiratory rate (current impedance pneumography), oxygen saturation (pulse oximetry), systolic and diastolic blood pressure (sphygmomanometry), heart rate (electrocardiogram), Glasgow coma score (GCS), body temperature and peripheral perfusion (capillary refill time over/under 3 s) and the presence of a threatened airway. The ED staff noted the following interventions that had occurred before the time of ED arrival: endotracheal intubation, mechanical ventilation, ongoing cardiopulmonary resuscitation (CPR), continuous infusion of vasopressors or inotropes, fraction of inspired oxygen (FiO2) or litres per minute of supplemental oxygen and use of sedative drugs. In sedated patients, the last GCS before the start of sedation was included. For comparison purposes (nurse risk estimate), the ED nurse recorded their personal subjective estimate of the likelihood that the patient would not survive this hospital stay, on the basis of all available information on arrival, and employing a scale from 1 (very low risk of death) to 10 (very high risk of death).

Statistical methods

Several variables were transformed before inclusion in the analysis. Saturation index [25] was calculated as the oxygen saturation (SpO2) divided by FiO2. The saturation index was categorised as being under or above the median of all observations (4.2 arbitrary units). The transformation of litres of oxygen per minute to FiO2 was based on current guidelines [26, 27]. Respiratory rate was categorised according to the cut-offs of the APACHE II scoring system. Age and mean blood pressure were included in the model, together with their quadratic terms. Chained equations were used to generate a stochastic single imputed data set for missing observations, utilizing for imputation equations all potential predictors and the outcome.

The model was based on a multivariable logistic regression for the probability of death during the hospital stay. Since the data included multiple admissions for some of the patients (9.3 %), 500 distinct data sets of single patient observations were generated (permuted data sets), and one observation was randomly chosen for each patient with multiple admissions. A logistic regression was fitted to each permuted data set, including all the potential predictors. Stepwise backward elimination was performed, with retention of predictors with p < 0.1 in the Wald test. Predictors maintained in over 67 % of the resulting models were included in the final prediction model. Nagelkerke’s R 2 and the Brier score were used as measures of overall model performance [28, 29]. Model discrimination was measured as the area under the receiver operating characteristic (AUROC) curve and model calibration as the calibration slope for each permuted model and in addition assessed visually. Measures are reported for the final model as the average over all permutations. Bootstrapping was used as the validation method, since no external data source was available. The validation procedure was performed within each permuted data set by sampling—with replacement—300 samples from the permuted data set; these samples were of the same size as the original data set [30, 31]. To assess overfitting, model optimism was assessed by bootstrapping including the variable selection process. As an alternative method for internal validation, a split-sample approach was used for comparing sensitivity (Electronic Supplementary Material).

The performance of the developed model was compared to the performance of the nurse risk estimate as the sole predictor. Likelihood ratio tests were used to compare the fit of a model including the selected variables and the nurse estimation to that of a model with the nurse estimate alone.

The variable selection process was repeated, with the inclusion of the nurse estimate as a potential predictor to test whether it would be selected as being important. Additionally, the DeLong test was used to compare the developed model’s AUROC to that of a model based on the nurse estimate alone.

Continuous variables are summarised as mean ± standard deviation or median and lower (Q1) and upper (Q3) limits of interquartile range for parametric and non-parametric data, respectively. Analyses were performed using Stata 13.1 (Stata Corporation, Texas).

Results

During the study period, there were 8,607 emergency hospital admissions of 7,680 patients (38 % women). A total of 6,965 patients were admitted once and 715 patients were admitted between two and seven times. Most frequent APACHE II diagnostic categories at the time of admission were neurological (2,052, 24 %), trauma (1,522, 18 %), infection categories [1,328, 15 %; including sepsis (357, 4.1 %), severe sepsis (249, 2.9 %), septic shock (27, 0.3 %)], cardiovascular (1,022, 12 %), gastrointestinal (848, 10 %) and respiratory (449, 5 %). Few data were missing apart from breathing rate (16 % of points). Follow-up to hospital discharge was complete in all patients. The mean of patient age was 58 ± 20 years, with ages ranging from 16 to 100 years. In a total of 398 (4.6 %) of all admissions, the patient died during the subsequent hospital stay (Table 1). The median length of hospital stay (LOS) of all admissions was 5.67 days (Q1–Q3, 2.74–10.51). The mean LOS of hospital survivors and non-survivors was 5.74 (Q1–Q3, 2.83–10.57) and 2.77 (Q1–Q3, 0.73–8.95) days, respectively.

Table 1 Parameters determined by patient history, vital sign assessment and treatment at the time of ED admission, stratified by hospital survival

The vital signs low blood pressure, reduced peripheral circulation, decrease in GCS and a saturation index below 4.2 indicated an increased risk of death, as did higher age, mechanical ventilation on ED arrival and a history of ED visit in the preceding 12 months. Quadratic terms indicating a non-linear association of age and mean blood pressure and the probability of death were retained in the model. In comparison with patients with respiratory problems, patients admitted for cardiovascular or neurological problems or trauma had a lower risk of death and patients with gastrointestinal or non-specified problems had a higher risk of death. Patient gender, intubation, defibrillation and CPR before or at arrival to ED, heart rate, temperature, low or high BP, and threatened airway were not retained in the final model. Table 2 shows the parameters used in the final model.

Table 2 Coefficients of the final model

Model performance and validation

The model appeared well calibrated by all indicators (Fig. 1a, b). The model explained approximately half of the observed variation with optimism-corrected mean Nagelkerke’s R 2 at 0.48 (reduced from 0.50, 95 % range 0.494–0.509). The overall optimism-corrected Brier score was 0.028 (corrected from sample’s 0.027, 95 % range 0.026–0.027), suggesting good calibration and high validity. The AUROC of the final model showed good discriminative ability and good internal validity at 0.920 (optimism corrected from 0.922, 95 % range 0.916–0.927; Fig. 2).

Fig. 1
figure 1

Calibration plot showing a all deciles and b the lower nine deciles of predictions. Predicted probabilities using the model’s coefficients are aggregated to deciles of patients. The mean observed death rate in each decile is the percentage (and 95 % confidence intervals) of observed death from all observations in this decile, and are marked by black triangle and lines. The dashed line indicates the optimal 1:1 fit. The dotted line represents the locally weighted scatterplot smoothing (LOWESS) smoother of the predictions. The calibration slope was 0.95, indicating good calibration

Fig. 2
figure 2

ROC curve of the final model. Area under the curve (AUROC) was 0.922 (95 % range 0.916–0.927), indicating good internal validity

The nurse risk estimate

The nurse risk estimate model showed inferior performance to the developed model and was by itself not a significant predictor. The nurse risk estimate exhibited low explained variability (mean Nagelkerke’s R 2 = 0.17). The overall calibration and discriminative ability was good (Brier score 0.040, mean AUROC 0.78). The optimism for all measures—AUROC, Brier score and Nagelkerke’s R 2—was negligible (<0.001), as was the shrinkage factor (0.997) calculated from the bootstrap validation. Likelihood ratio tests comparing the developed model with the model of the nurse estimation alone were always highly significant in favour of the developed model (all p < 0.001). A DeLong test of AUROC values of the nurse estimate model compared to the developed model proved highly significant in favour of the developed model (p < 0.001). The correlation between predictions of the developed model and the risk estimate model was relatively low (Spearman’s rho = 0.38, p < 0.001). The nurse risk estimate was not selected as a predictor for the final model when included in the variable selection process for model development. Compared to the developed model, the nurse risk estimate predicted a higher risk of death with a higher degree of variability in patients with mortality below 5 % and a lower risk in patients with highest mortality risk (Fig. 3).

Fig. 3
figure 3

Comparison of prediction of risk of death of the nurse risk estimate model and the developed prediction model in deciles of patients according to the calibration plot. In the 8 deciles of patients with lower mortality risk the nurse risk estimate consistently predicted a higher risk of death than the developed model with considerable higher variance in prediction. In the 20 % of patients with the highest risk of death, higher mortality risk was predicted by the model than by the nurse risk estimate. N nurse risk estimate model, M prediction model

Discussion

In this prospective cohort study, we found that increased risk of hospital mortality could be predicted in the initial phase after ED arrival on the basis of routinely collected and easily available clinical information. These included patient history, preliminary diagnosis, findings of the first clinical examination and the treatment interventions implemented at the time of ED arrival. The model showed overall good discrimination in identifying patients at risk; it was superior to the subjective risk estimate of the first health-care provider triaging the patient.

The main strength of our study is the use of readily available parameters to assess the mortality risk. The analysed patients represent a cohort originating from a population covering the whole spectrum of emergencies. The sample size was large enough to offer sufficient statistical power to test all examined variables and the prospective nature of the study ensured high data quality and reduced the risk of bias due to confounders.

The main limitations of our study are related to the single-centre design and the lack of an external validation sample. This reduces the generalizability of our model and creates a risk of overfitting the model to the data at hand. In this setting, split-sample or cross validation is sometimes applied. As a result of the inherent limitations of these methods [32], we chose bootstrapping for internal validation. In a split-sample approach used for comparing sensitivity, relevant variables were very similar (Electronic Supplementary Material). A limitation of the bootstrapping method is that model development processes, such as variable selection, must occur in a programmable manner in order to run them repeatedly. Therefore, the necessary categorisation of variables can affect estimation, depending on apparently arbitrary cut-off values. To minimise these effects, we based our categorisation decisions on clinical rationales or on the analysis of the frequency distribution of the variables. We used single and not multiple imputation to account for missing observations, mainly because of the already high complexity of the computational process in creating the model. However, respiratory rate was the only predictor with more than 10 % missing values and was not included in the final model. Hospital mortality was used as the outcome parameter for model development. The use of an alternative predefined outcome assessment period such as 7 or 28 days would have provided additional relevant information. In contrast to many other European countries, such outcome data is not readily available in Switzerland for data protection reasons, and the follow-up was therefore limited to the hospital stay. The complex nature of the model mandates the use of a calculation program which can be obtained from the authors.

Several risk scoring systems for triaging of unselected emergency patients have been described [21, 3339], but the discriminative power of the presented model appears to be superior. In contrast to previous models, we solely used information available on admission (i.e. no laboratory tests) and also integrated patient history, preliminary diagnosis and treatments on ED admission. The discriminative power of scoring systems to identify patients with higher mortality reported as AUROC ranges from 0.65 to 0.90 in the validation samples, in comparison to the AUROC of 0.92 in our model [21, 3336, 38, 40]. A comparison of AUROC values of previously published scores calculated with our patient data is provided in the Electronic Supplementary Material. Seymour et al. [39] evaluated a model based on out-of-hospital vital sign measurements to predict a combined endpoint of death and occurrence of critical illness during hospitalization. In this model the AUROC was 0.77. Goodacre et al. [21] derived several models predicting 7-day mortality in emergency medical admissions by stepwise inclusion of predictors. The AUROC increased from 0.80 for a model based on age and diagnostic code alone to 0.87 by inclusion of physiological variables and 0.90 by inclusion of blood test results. In both models non-linear associations between predictors and outcome were accounted for, either by categorizations using a priori determined cut-off points based on natural distributions or after plotting mortality against deciles of each variable. It seems conceivable that optimal modelling requires accounting for non-linear associations between predictors and outcome and—as our results show—integrating effects of treatment modalities on predictors.

In the present study, the predicted and observed risk of death was lower than 5 % in more than 80 % of patients. In these patients, the mean predicted and observed probabilities exhibit partially overlapping confidence intervals. However, model discrimination is substantial when comparing predicted and observed probabilities in patients with a mortality risk above 5 %. From a clinical point of view, it is more important to recognise patients with an increased risk of death than to exactly predict a low risk of death. Mechanical ventilation at the time of ED arrival and low saturation index, but not low SpO2 per se, were associated with higher mortality. This indicates that support by mechanical ventilation and increased FiO2 has to be taken into account when interpreting SpO2 levels in a patient with respiratory failure. A similar effect was expected but not observed for the predictive value of blood pressure measurements and the use of continuous infusions of vasopressors and/or inotropes at the time of ED arrival. Despite the high mortality (42 %) of patients receiving vasoactive drugs, this was not statistically significant. This may be related to the very small number of patients (ca. 0.5 %) receiving vasoactive drugs at the time of ED admission.

The prognostic performance of the subjective risk estimation by the triaging health-care provider was inferior to the developed model. It seems that higher mortality was predicted by triage nurses in the low-risk patients, and lower mortality in high-risk patients. This could potentially lead to misallocation of resources in low-risk situations and to delayed intervention in high-risk patients.

In conclusion, in this prospective cohort study, we found that a prediction model derived from readily available patient characteristics, physiological parameters and established treatment interventions registered during the first minutes after ED arrival was a strong indicator for the subsequent risk of death. Our data indicate that the use of the prediction model instead of an assessment based only on subjective evaluation can facilitate the identification of patients with higher mortality risk. We do not suggest that health-care providers ignore their personal experience when assessing patients and instead base their decisions solely on a scoring system. However, the model might provide a more systematic approach to the integration of all available information and facilitate the identification of patients with higher mortality risk who might be missed by an assessment based on subjective evaluation alone. This might be used to design interventions to ensure rapid commencement of treatment in patients at risk of an unfavourable outcome. The model might also serve to identify patients with a high risk of mortality in whom full treatment is not warranted. Early prediction of mortality in an end-of-life setting might prevent the use of unnecessarily aggressive treatment modalities.

Future research should be guided at external validation of this or similar prediction models using a multicentre approach and to ascertain the effectiveness in clinical impact studies.