How do dialysis nurses and AI reason clinically? A scenario-based comparative study
- Open Access
- 31.01.2026
- Research
Abstract
Introduction
Nurses working in dialysis units are required to make frequent, high-stakes clinical decisions under conditions of uncertainty. During routine shifts, they must rapidly assess patient stability, interpret subtle changes in vital signs and symptoms, anticipate complications related to fluid balance or vascular access, and decide when escalation of care is required. These judgments are often made with limited physician availability and are shaped by both formal protocols and accumulated clinical experience [1].
Clinical reasoning in nursing is not a linear or purely algorithmic process. It involves the integration of technical knowledge, pattern recognition, contextual awareness, and professional judgment developed through repeated exposure to complex clinical situations. In dialysis care, this reasoning is particularly critical, as patients with end-stage kidney disease often present with overlapping symptoms, chronic instability, and competing clinical priorities [2‐4].
Anzeige
Technology has been increasingly recognized as a core element of the nursing metaparadigm, shaping professional knowledge, practice, and care. This framing supports examining artificial intelligence as a supportive resource for clinical reasoning that complements—rather than displaces—the humanistic and contextual foundations of nursing judgment [5‐7].
Despite its central role in patient safety, nursing clinical reasoning remains difficult to systematically evaluate, support, or teach at scale [8]. Most existing assessment methods rely on self-report, simulated checklists, or retrospective outcome measures, which fail to capture how nurses actually reason through complex, evolving situations in real time [9]. Recent advances in artificial intelligence, particularly large language models (LLMs), have prompted interest in whether such tools could support clinical decision-making [10]. While general-purpose models can generate structured clinical explanations, concerns remain regarding their transparency, contextual sensitivity, and alignment with nursing judgment. Importantly, most evaluations of AI systems have not benchmarked their reasoning directly against frontline nursing practice [11, 12].
To o address this gap, the present study conducts a comparative examination of clinical reasoning among experienced dialysis nurses and artificial intelligence–based systems using standardized nephrology scenarios. The analysis focuses on similarities and differences in reasoning approaches in order to clarify how AI tools may be positioned as decision-support resources within nursing practice [13, 14].
Methods
Study design
This study employed a prospective, scenario-based comparative design to evaluate clinical reasoning in dialysis care. A scenario-based questionnaire was developed specifically for this study and comprised four standardized hemodialysis clinical vignettes, each accompanied by six open-ended questions designed to elicit key components of clinical reasoning (Supp. 2). Clinical decision-making performance was compared across three groups—experienced dialysis nurses, a general-purpose large language model, and an agent-based AI system—using these standardized nephrology scenarios. The overall study workflow and comparison framework are illustrated in Fig. 1.
Anzeige
Participants
A total of 110 licensed dialysis nurses participated in the study. Nurses were eligible if they had at least one year of clinical experience in dialysis care. Participants were recruited from dialysis units and completed the study assessment independently and anonymously via an electronic questionnaire. Participation was voluntary, and no financial or other incentives were provided. Demographic and professional characteristics of the participants are presented in Table 1.
Table 1
Characteristics of the nurse participants (N = 110)
Variable | Value | N/Mean | %/SD |
|---|---|---|---|
Age | 46.4 | 10.1 | |
Years of Experience | 19.5 | 11 | |
Years in Dialysis | 14.6 | 9.5 | |
Gender | Female | 67 | 67.7 |
Gender | Male | 32 | 32.3 |
Gender | Missing | 11 | 10 |
Education | BA | 51 | 53.7 |
MA | 41 | 43.2 | |
Diploma | 3 | 3.2 | |
Missing | 15 | 13.6 | |
Role | Staff Nurse | 26 | 35.1 |
Supervisor | 21 | 28.4 | |
Head Nurse | 21 | 28.4 | |
Other | 6 | 8.1 | |
Missing | 36 | 32.7 | |
Clinical scenarios
Four standardized clinical scenarios representing common decision-making challenges in hemodialysis care were used in this study. The scenarios reflected real-world situations routinely encountered by dialysis nurses, including patient instability, assessment of clinical urgency, identification of potential complications, diagnostic reasoning, and treatment planning. Each scenario consisted of a brief clinical vignette followed by six open-ended questions designed to elicit key components of clinical reasoning: urgency assessment, identification of red flags, hypothesis generation (differential diagnosis), selection of diagnostic tests, treatment decisions, and clinical justification. The scenarios were developed by senior dialysis nurses with more than 15 years of clinical experience and were reviewed and validated by nephrology physicians to ensure clinical relevance and authenticity. Full descriptions of the clinical scenarios and associated questions are provided in Supp 2.
Scoring protocol
A structured scoring rubric was used to evaluate all responses on a 0–100 scale. The rubric was designed to reflect clinically relevant dimensions of nursing decision-making, including diagnostic accuracy, clinical appropriateness, clarity of clinical reasoning, and attention to patient safety and resource stewardship. To ensure methodological independence, the principal investigator managed anonymization, randomization, and data handling procedures but was not involved in scoring or adjudication.
AI model configurations
ChatGPT-4 (Single-agent model)
ChatGPT-4 (OpenAI, March 2024 version) was evaluated as a general-purpose, single-agent large language model. The model was prompted using a structured, stepwise format designed to elicit clinical reasoning across predefined domains, including hypothesis generation, diagnostic testing considerations, and treatment planning. All prompts were standardized across scenarios to ensure consistency of model input and output.
MAI-DxO (Multi-agent simulation)
MAI-DxO is an agent-based large language model framework adapted from prior work and evaluated in this study as a comparative AI system [15]. The framework simulates collaborative clinical reasoning by generating parallel, role-specific reasoning components that address complementary aspects of decision-making, such as differential diagnosis, test prioritization, and patient-centered considerations. These components are integrated into a single structured response through an internal aggregation process. Detailed descriptions of the model structure and prompting sequence are provided in the Appendix.
Scoring process and inter-rater agreement
All responses, including those generated by nurses and AI systems, were independently evaluated by two senior dialysis nursing specialists using a structured scoring rubric. In addition, overall scenario-level decisions were reviewed by a senior nephrologist to ensure clinical appropriateness. Each response to the six questions within each scenario was scored on a 0–100 scale.
Scoring criteria included:
-
Diagnostic accuracy.
-
Clinical appropriateness.
-
Clarity and completeness of clinical reasoning.
-
Attention to patient safety and resource stewardship.
Raters evaluated all responses independently and were blinded to participant identity and group assignment. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC; two-way random effects model, absolute agreement). The mean ICC across all scenario–question pairs exceeded 0.88, indicating high inter-rater consistency. When discrepancies between raters exceeded 15 points, a third nephrologist adjudicated the response. Final scores reflected the mean of the two ratings or the adjudicated value, when applicable.
Outcome measures
The primary outcome was the mean total clinical decision-making score per scenario. Secondary outcomes included question-level scores (Q1–Q6) and domain-specific comparisons across the three groups: dialysis nurses, the single-agent AI model, and the multi-agent AI framework.
Statistical analysis
Descriptive statistics were used to summarize participant characteristics, including age, gender, years of clinical experience, and professional role. Continuous variables are reported as means and standard deviations, and categorical variables as counts and percentages. Missing demographic data were handled using pairwise deletion. Missing values in clinical response scores were addressed using multiple imputation by chained equations with five iterations.
To compare clinical decision-making performance, mean total scores and component scores were calculated for each clinical scenario (Cases 1–4) across the three study groups: dialysis nurses, the single-agent AI model, and the multi-agent AI framework. Group-level comparisons were summarized descriptively and visualized using score distribution plots.
To explore patterns of clinical reasoning among nurses, multivariate analyses were conducted on individual response profiles (24 scores per participant: 4 scenarios × 6 questions). Dimensionality reduction techniques were used to visualize variability in reasoning styles, and hierarchical clustering was applied to identify subgroups of nurses with similar decision-making profiles. Cluster validity was evaluated using standard internal validation metrics.
Anzeige
Agreement between AI-generated responses and expert reference scores was examined using complementary quantitative approaches, including similarity measures and agreement analyses, to assess alignment beyond mean score comparisons. Where appropriate, Bland–Altman plots were used to explore agreement between AI systems and human nurse scores.
All analyses were performed using Python (version 3.11). Standard statistical and data science libraries were employed for data preprocessing, clustering, and visualization. Analytical procedures were reviewed internally prior to application to the study dataset.
Ethics approval
The study was approved by the institutional ethics committee of TAU university (Approval number- 0006223-2(. All participants provided written informed consent after receiving detailed information about the research and assurance of confidentiality. This study did not involve a clinical trial. Clinical trial number: not applicable.
Results
Participant characteristics
A total of 110 dialysis nurses participated in the study. The mean age of participants was 46.4 years (standard deviation [SD], 10.1), with an average of 19.5 years (SD, 11.0) of general clinical experience and 14.6 years (SD, 9.5) specifically in dialysis care. Most participants identified as female (67.7%), with males comprising 32.3%. Educational backgrounds included bachelor’s degrees for 53.7% of participants, master’s degrees for 43.2%, and diploma-level training for 3.2%. Clinical roles were diverse: 35.1% were staff nurses, 28.4% held supervisory positions, 28.4% were head nurses, and 8.1% occupied other roles (e.g., educators or coordinators). Demographic characteristics are summarized in Table 1.
Anzeige
Scenario scoring performance
Each nurse responded to four validated nephrology scenarios, each comprising six structured, open-ended clinical questions (24 responses per participant). Responses were independently evaluated by two senior nephrology nursing specialists using a standardized 0–100 scoring rubric assessing diagnostic accuracy, clinical appropriateness, clarity of clinical reasoning, and attention to patient safety and resource stewardship.
Inter-rater agreement was high, with a mean intraclass ICC exceeding 0.85 across all scenario–question pairs. In cases where score differences exceeded 15 points (9.4% of responses), a third nephrologist adjudicated the response. Final scores reflected either the mean of the two raters or the adjudicated value, when applicable. Individual question scores were averaged to generate total scenario-level scores, allowing for comparisons across cases and groups.
Comparative performance across groups
Across all four scenarios, the MAI-DxO model achieved higher mean total scores than both ChatGPT-4 and the nurse group, with scenario-level scores ranging from 84.5 to 91.2 (on a 0–100 scale). ChatGPT-4 scores ranged from 75.4 to 80.8, while nurse scores demonstrated greater variability, ranging from 63.1 to 76.4. These patterns are illustrated in Fig. 2.
Performance differences were most pronounced in questions requiring integrative clinical judgment and explicit justification. For example, in hypothesis generation (Q3), MAI-DxO consistently produced structured differential diagnoses that integrated multiple clinical cues, whereas nurse responses often emphasized situational factors such as recent dialysis trends or patient-reported symptoms. Similarly, in clinical justification (Q6), MAI-DxO responses followed a systematic reasoning structure aligned with guideline-based priorities, while nurse justifications varied in depth and narrative style. ChatGPT-4 demonstrated relatively strong performance in structured tasks such as urgency assessment (Q1) and test selection (Q4); however, its explanations were generally less contextualized and less consistent in justification. Nurse responses exhibited substantial heterogeneity, with high-performing individuals approaching or exceeding ChatGPT-4 scores in certain scenarios, despite lower group-level averages (Fig. 3).
Anzeige
Representative examples of clinical reasoning across nurses and AI systems are provided in Supplementary Table S1.
Cognitive archetypes and semantic alignment with experts
Hierarchical clustering analysis identified three recurring cognitive archetypes among nurses: protocol-driven responders, holistic explainers, and minimalist responders. Protocol-driven nurses focused on immediate actions and red flags, often with limited elaboration. Holistic explainers provided extended reasoning that integrated patient history, dialysis parameters, and anticipated complications, whereas minimalist responders offered concise but technically correct answers with minimal justification. These archetypes are illustrated in Fig. 4.
Alignment between AI-generated responses and expert reference scores was examined using similarity and agreement analyses, as described in the Methods section. MAI-DxO demonstrated the highest alignment with expert evaluations conducted by senior nephrology nursing specialists, followed by ChatGPT-4 and the nurse group average. This finding suggests that MAI-DxO responses more closely mirrored the structure and prioritization patterns used by expert evaluators. Bland–Altman analyses (not shown) further indicated narrower agreement bands between MAI-DxO and expert scores compared with other groups.
Discussion
This study examined clinical reasoning across experienced dialysis nurses and two AI–based systems using standardized nephrology scenarios. The findings demonstrate meaningful differences not only in overall performance but also in the structure, depth, and contextual grounding of clinical reasoning. While the agent-based AI system consistently produced more structured and guideline-aligned responses, nurses demonstrated strengths in contextual interpretation, experiential judgment, and sensitivity to real-world dialysis practice. Together, these results highlight complementary reasoning approaches rather than a unidirectional advantage of AI [4, 10, 16].
Clinical reasoning in dialysis nursing is shaped by repeated exposure to chronic instability, familiarity with patient trajectories, and the need to balance protocol-based actions with individualized care. The variability observed among nurses in this study reflects well-described differences in professional judgment, reasoning style, and comfort with ambiguity. The identification of distinct reasoning archetypes—protocol-driven, holistic explanatory, and minimalist—underscores that effective nursing decision-making does not follow a single optimal pathway but rather adapts to clinical context and professional experience.
Our results highlight both the potential and the limitations of artificial intelligence–based reasoning when applied to clinical decision-making in dialysis care. Across the evaluated scenarios, the agent-based AI system consistently generated more structured and guideline-aligned responses than the general-purpose AI model [17, 18]. This consistency reflects the ability of AI systems to organize clinical information, articulate differential diagnoses, and present explicit justifications in a transparent manner [19, 20].
At the same time, this structured reasoning contrasted with the more variable, context-sensitive approaches observed among human nurses. In scenarios involving common or high-risk dialysis-related conditions, such as electrolyte disturbances or infection, experienced nurses often demonstrated strengths that extended beyond formal structure, including rapid pattern recognition, prioritization based on patient history, and practical judgment shaped by prior clinical encounters. These findings underscore the importance of tacit knowledge and experiential learning in nursing decision-making.
The observed variability among nurses was not random but followed discernible patterns. The identification of protocol-driven, holistic-explanatory, and minimalist reasoning styles aligns with previous literature describing diversity in clinical judgment among nursing professionals. Such variation reflects differences in professional role, experience, and comfort with clinical uncertainty, rather than deficiencies in competence [21]. Importantly, each reasoning style may offer distinct advantages depending on the clinical context.
Consistent with real-world dialysis practice, nurses in this study frequently extended their reasoning beyond narrowly defined task boundaries, for example by anticipating diagnostic needs or suggesting early management steps prior to physician involvement. This behavior reflects the realities of chronic care environments, where nurses often serve as the first point of clinical assessment and operate with substantial professional autonomy. Decision-making in such settings is shaped not only by clinical guidelines but also by organizational structures, workflow constraints, and familiarity with individual patients [22‐24].
In contrast, AI-generated responses, while often theoretically sound, occasionally lacked sensitivity to practical considerations such as resource stewardship, redundancy of investigations, or local clinical norms. These limitations highlight an important distinction between formal clinical reasoning and applied clinical judgment, the latter remaining a core strength of experienced nursing practice [25].
Taken together, these findings support a hybrid model for the integration of artificial intelligence into nephrology and chronic care. AI systems may serve as valuable tools for supporting structured reasoning, standardization, and educational scaffolding, particularly in training or decision-support contexts. However, they cannot replace the contextual awareness, adaptability, and ethical judgment provided by human clinicians. Effective implementation of AI in nursing practice will therefore depend on aligning technological capabilities with the lived realities of frontline care.
Limitations
This study has several limitations. First, although the scenarios were based on real-world nephrology cases and validated by clinical experts, they do not fully capture the dynamic and interpersonal aspects of real-time patient care. Second, participating nurses were recruited from a single national health system, which may limit the generalizability of findings to international settings with different clinical protocols, training standards, or scopes of nursing practice. Third, both AI model outputs and nurse responses were generated in a non-time-pressured environment, thereby potentially overestimating real-world performance for all groups. Additionally, the rubric-based scoring system, while standardized, may not fully reflect nuanced clinical decision-making. Finally, although MAI-DxO demonstrated strong alignment with expert judgments, its responses were not prospectively tested for safety or patient outcomes in live clinical workflows.
Conclusion
In this scenario-based evaluation of clinical decision-making in nephrology, the agent-based large language model (MAI-DxO) consistently outperformed both a general-purpose LLM (ChatGPT-4) and experienced dialysis nurses in tasks demanding structured reasoning, differential diagnosis, and guideline-informed justification. MAI-DxO’s modular, agent-driven architecture enabled transparent and reproducible responses across complex cases, demonstrating strong alignment with expert benchmarks. However, the strengths of the human clinicians particularly the nurses highlight the indispensable role of lived experience in patient care. Nurses outperformed AI in areas where contextual sensitivity, rapid pattern recognition, and practical decision-making were essential, such as recognizing common dialysis complications and considering real-world constraints like scope of practice and resource availability. These abilities, grounded in clinical intuition and tacit knowledge, remain beyond the reach of current AI systems. Together, these findings underscore the promise of agent-based LLMs not as replacements but as collaborative tools that can enhance decision-making in chronic care. Successful integration will depend on aligning AI capabilities with frontline realities amplifying the strengths of both machines and clinicians to improve patient outcomes.
Fig. 1
Study design and comparison framework for nephrology scenario assessment. This flowchart illustrates the study design for evaluating clinical decision-making across three groups: human nurses, ChatGPT-4, and a MAI-DxO–based multi-agent simulation. Four nephrology clinical scenarios were co-developed and validated by expert nephrologists. Nurses from dialysis units independently solved the scenarios. ChatGPT-4 was prompted using a structured, role-based sequence emulating diagnostic steps (Dr. Hypothesis → Dr. Test-Chooser → Dr. Challenger → Dr. Stewardship → Dr. Checklist). The MAI-DxO simulation applied the same roles within an interactive multi-agent team format. All responses underwent comparative analysis based on diagnostic accuracy, reasoning style, and resource cost
Fig. 2
Average Scores per decision-making component across four clinical scenarios. This figure illustrates the mean scores assigned to each clinical reasoning component (Q1–Q6) within the four validated nephrology scenarios (C1–C4), based on expert evaluations of responses by human nurses. Each component represents a key decision-making step, including urgency assessment, identification of red flags, diagnostic hypothesis formation, test selection, treatment decision, and clinical justification. The visualization highlights variability in nursing performance across components and scenarios, with the highest scores observed in hypothesis generation and the lowest in cost-aware decision-making
Fig. 3
Performance comparison across clinical scenarios by decision-making group. This figure displays the average decision-making scores for each of the four validated nephrology scenarios (C1–C4) as solved by three distinct groups: human nurses, ChatGPT-4 using a structured single-agent process, and a MAI-DxO–inspired multi-agent simulation. Scores reflect composite expert evaluations of diagnostic accuracy, clinical reasoning, and appropriateness of action. MAI-DxO consistently outperformed other groups across all scenarios, followed by an intermediate performance of ChatGPT-4, and then human nurses showing greater variability across cases
Fig. 4
t-SNE visualization of nurse decision-making clusters. This scatter plot illustrates the distribution of 101 nurses across three decision-making clusters based on their graded responses to four nephrology clinical scenarios. Each point represents an individual nurse, positioned according to a t-distributed stochastic neighbor embedding (t-SNE) projection of their 24 decision scores (6 per case). Cluster assignment was derived using KMeans (k = 3) following imputation and standardization. The visualization shows clear separation between clusters, suggesting distinct reasoning profiles among participants
Acknowledgements
The authors thank Angam Kittany for her contributions and support throughout the study.
Declarations
Ethics approval and consent to participate
The study was approved by the Institutional Ethics Committee of Tel Aviv University, Gray Faculty of Medical & Health Sciences (Approval No. 0006223-2). All participants provided written informed consent prior to participation. All study procedures were conducted in accordance with the principles of the Declaration of Helsinki.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.