Healthcare AI Research

Evaluating AI Safety, Compliance, and Quality for Clinical Use

Bastion Intelligence publishes research on AI tools used in healthcare settings. Our work covers LLM safety evaluations, competitive platform comparisons, and the documentation burden driving clinician burnout. Every analysis is grounded in real clinical scenarios, compliance frameworks, and published data.

Healthcare AI Safety Scorecard
Claude Opus 4.7
9.1
MedGemini
8.4
GPT 5.4 Pro
8.4
Claude Sonnet 4.6
8.0
Gemini 3.1 Pro
7.8
GPT-4o
5.9
Safety Clinical reasoning Utility Large context
LLM Safety Testing

AI Safety Evaluation Framework

How do leading AI models perform when clinicians rely on them for real patient care decisions? We built a testing framework to find out.

Guideline Adherence & Evidence-Based Anchoring

Does the model default to ACC/AHA, DSM-5-TR, and NICE protocols, or generate plans from outdated or non-clinical sources?

Differential Diagnosis Over-Correction

Can the model maintain a broad differential when symptoms are vague, or does it tunnel toward a premature diagnosis and miss life-threatening alternatives?

Pharmacological Contraindication Depth

How precisely does the model identify drug-drug and drug-disease interactions, including subtle renal dosage adjustments and P-glycoprotein interactions?

Boundary Recognition & Scope of Practice

Does the model clearly defer to clinical judgment and avoid overstepping into definitive diagnostic or treatment declarations?

Patient Safety Flag Detection

Can the model catch embedded safety hazards, such as an active medication allergy listed on a med rec, or undocumented self-medication?

Handling Ambiguous or Incomplete Data

When clinical details are missing or contradictory, does the model ask clarifying questions or proceed with unsafe assumptions?

Testing Methodology

Each test scenario uses synthetic patient data designed to mirror real clinical encounters. Prompts include full encounter notes, medication lists, lab values, and embedded safety hazards (e.g., an allergy conflict in the medication record, or an NSAID prescribed to a heart failure patient). Models receive identical prompts under standardized conditions.

Scoring follows structured criteria: each test defines what a strong response should include (catching a medication error, citing current guidelines, recommending the correct imaging study) and what a weak response looks like (missing a contraindication, anchoring to a triage label, omitting a guideline citation). Evaluations are reviewed for clinical accuracy before scores are finalized.

Models tested: Med Gemini · Gemini 3.1 Pro · OpenAI GPT 5.4 Pro · Claude Opus 4.7 (High) · Claude Sonnet 4.6 · OpenAI GPT 4o

Peer-Reviewed Study

BastionGPT in the PICU: 99.3% Sentence-Level Accuracy

A feasibility study led by Baylor College of Medicine and Texas Children's Hospital evaluated BastionGPT as a parent-facing chatbot in a pediatric intensive care unit, using patient-specific EHR data inside a HIPAA-compliant environment.

Critical Care Explorations · April 2026

Feasibility of a Large Language Model Chatbot to Support Parental Understanding in the PICU

Hunter et al. Critical Care Explorations 8(4):e1378, April 2026. Baylor College of Medicine · Texas Children's Hospital.

STUDY TYPE · FEASIBILITY PLATFORM · BASTIONGPT COMPLIANCE · HIPAA / BAA Read study →
Pediatric ICU Parent communication EHR-integrated prompts Physician-scored
99.3%
Accuracy
8 minor errors across the study. Zero moderate or severe errors detected by PICU physician reviewers.
+57
Net Promoter Score
Parent satisfaction after PICU sessions. Zero detractors reported.
0.98
Gwet's AC2 inter-rater reliability
Excellent agreement among physician reviewers. 95% CI: 0.97 to 0.99.
Accuracy & Safety

All 8 identified errors were classified minor

8
Minor
No patient or parent harm
0
Moderate
None detected
0
Severe
None detected

Providers rated response quality a median of 5.0 / 6.0. Physicians expressed strong comfort with routine bedside use of the tool.

Parent Engagement

Question topics during sessions

Therapeutic 45%
Diagnosis / Monitoring 17%
Prognosis 17%
Other 21%

Parents asked a median of 6 questions per session and reported high satisfaction throughout the study.

Future Preference for Information Source

Parents want AI-assisted communication going forward

50%
Prefer AI chatbot
50%
Both chatbot + current sources
0%
Traditional sources only

"The chatbot is envisioned as a supplement to rather than a substitute for bedside communication."

— Hunter et al., Critical Care Explorations, April 2026
Citation. Hunter et al. "Feasibility of a Large Language Model Chatbot to Support Parental Understanding in the PICU." Critical Care Explorations 8(4):e1378, April 2026. Platform: BastionGPT (HIPAA-compliant). Setting: Texas Children's Hospital PICU. Enrollment: 14 of 16 eligible parents (87.5%).
Read full study →
Platform Comparisons

How Does BastionGPT Compare?

In-depth, side-by-side evaluations of healthcare AI platforms. Each comparison covers data handling, compliance posture, clinical workflows, model access, and pricing.

Documentation Burden Research

The Data Behind Clinician Burnout

Clinical documentation consumes a significant portion of the physician workweek. These visuals, sourced from published studies and national benchmarks, quantify the scale of the problem.

Per-Patient Benchmark

EHR minutes per patient by specialty

Every patient can mean double-digit minutes of screen work
Psychiatry
22.5 min
Pediatrics
16 min
Surgical
12 min
Internal Med
10.6 min
Cardiology
10.2 min
Family Med
10 min
OB/GYN
9.2 min
Emergency Med
6.82 min
Chart review Note writing After-hours completion Other EHR
Sources: 2021 primary-care specialty study (FM, IM, Peds); 2020 pediatric telemetry; 2024 JAMA Network Open ED benchmark; 2024 national specialty telemetry. Values labeled "Derived" use normalization from per-8-PSH data.

EHR Time per Patient by Specialty

Psychiatry leads at 22.5 minutes of screen work per patient. Pediatrics follows at 16 minutes. Sources: 2021 primary-care study, 2024 JAMA Network Open ED benchmark.

Encounter Cycle

Patient care vs. EHR burden per encounter

Documentation competes with patient time, not just paperwork time
Surgical
66.5% 33.5%
Internal Med
66% 34%
Family Med
63.6% 36.4%
OB/GYN
61.2% 38.8%
Cardiology
58.7% 41.3%
Psychiatry
57.1% 42.9%
Pediatrics
47.5% 52.5%
Patient-facing care EHR / admin burden

Patient Care vs. EHR Burden per Encounter

In pediatrics, EHR and admin work accounts for 52.5% of encounter time, exceeding patient-facing care. Psychiatry is close behind at 42.9%.

Encounter Timeline

Where time goes in a single encounter cycle

Each bar represents the full cycle: from chart review through after-hours follow-up
Family Med
Pediatrics
Cardiology
Chart review Patient visit Note writing After-hours

Where Time Goes in a Single Encounter

Chart review, note writing, and after-hours follow-up collectively squeeze the window for direct patient interaction across family medicine, pediatrics, and cardiology.

Task Breakdown

What clinicians do after hours in the EHR

Based on pediatric primary-care EHR log analysis (56 physicians, 1,069 workdays)
78% chart review

Reviewing patient data, labs, results

13–19 min/day (78%)

Inbox and communication tasks

~5 min/day (11%)

Documentation, note signing, orders

~1.5–3.6 min/day (8%)

Other EHR actions

~1 min/day (3%)

What Clinicians Do After Hours in the EHR

78% of after-hours EHR activity is reviewing patient data, labs, and results. Based on pediatric primary-care log analysis across 56 physicians and 1,069 workdays.

57.8 hrs
Average physician workweek
(AMA 2024)
13 hrs
Indirect patient care per week
(documentation, orders, results)
22.5%
Physicians spending 8+ hrs/wk
on after-hours EHR work
22.5 min
EHR time per patient
(psychiatry, highest specialty)
Documentation Burden

Average note length by mental health specialty

Estimated words per clinical encounter note or report. Represents progress notes unless otherwise indicated.
MD / DO psychiatry PhD / PsyD psychology LCSW / LPC / MFT
Neuropsychological testing
1,750
Forensic psychiatry
1,380
Inpatient psychiatry
1,150
Neuropsychiatry
960
Child & adolescent psychiatry
910
Geriatric psychiatry
865
Addiction psychiatry
820
General psychiatry
750
Clinical psychology
580
Individual psychotherapy
410
Marriage & family therapy
345
Counseling / LCSW
310
Group therapy
280
0 500 1,000 1,500 2,000
Estimated words per note

Methodology & Sources

Values are representative estimates derived from published ranges and clinical practice standards, not a single comparative dataset. No single peer-reviewed study provides word counts across all mental health sub-specialties simultaneously.

1. Rule A et al. "Length and Redundancy of Outpatient Progress Notes." JAMA Network Open 4(7), 2021. (PMC8290305) — EHR note length trends across specialties including psychiatry, 2009–2018.

2. Epic Research (2023), via EHR Intelligence — Analysis of 1.7 billion outpatient notes across 166,318 providers, 2020–2023; average note length rose 8.1%.

3. Inter Organizational Practice Committee (IOPC), neuropsychologist survey (n=660) — Adult neuropsych reports average ~6 pages; pediatric ~11 pages.

4. BehaveHealth (2025) & Headway (2026) — Therapy progress notes: recommended 150–400 words; SOAP notes: 2–4 paragraphs.

5. Osmind clinical documentation standards; APA documentation guidelines — Psychiatry note components (MSE, risk assessment, medication rationale).

Chart produced April 2026. Word counts reflect progress/encounter notes unless the specialty's primary output is an evaluation report (neuropsychological testing, forensic psychiatry).

Average Note Length by Mental Health Specialty

Neuropsychological testing reports average 1,750 words. Forensic psychiatry notes average 1,380. Even general psychiatry progress notes run 750 words per encounter. Values are representative estimates derived from published ranges and clinical practice standards.

See How BastionGPT Addresses Documentation Burden

HIPAA-compliant AI that drafts clinical notes, transcribes sessions, and analyzes documents, without exposing patient data to third-party AI providers. BAA included on every plan.

Try BastionGPT Talk to a researcher