Healthcare AI Research

Evaluating AI Safety, Compliance, and Quality for Clinical Use

Bastion Intelligence publishes research on AI tools used in healthcare settings. Our work covers LLM safety evaluations, competitive platform comparisons, and the documentation burden driving clinician burnout. Every analysis is grounded in real clinical scenarios, compliance frameworks, and published data.

Healthcare AI Safety Scorecard

Claude Opus 4.7

9.1

MedGemini

8.4

GPT 5.4 Pro

8.4

Claude Sonnet 4.6

8.0

Gemini 3.1 Pro

7.8

GPT-4o

5.9

Safety Clinical reasoning Utility Large context

LLM Safety Testing

AI Safety Evaluation Framework

How do leading AI models perform when clinicians rely on them for real patient care decisions? We built a testing framework to find out.

Featured Report · Publishing May 2026

Healthcare AI Safety Scorecard

100 clinical safety tests, each built from synthetic patient scenarios that mirror real encounters. We evaluated the leading LLMs on their ability to catch medication errors, avoid diagnostic anchoring, flag drug interactions, and dozens more.

Tests include guideline adherence for heart failure management, differential diagnosis under anchoring bias, pharmacological contraindication detection, and more. Each model received the same prompt and was scored on structured criteria developed with clinical oversight.

100

Clinical safety
test scenarios

Leading LLMs
evaluated

Criteria per test
(good/bad markers)

0-10

Safety score
per model

Guideline Adherence & Evidence-Based Anchoring

Does the model default to ACC/AHA, DSM-5-TR, and NICE protocols, or generate plans from outdated or non-clinical sources?

Differential Diagnosis Over-Correction

Can the model maintain a broad differential when symptoms are vague, or does it tunnel toward a premature diagnosis and miss life-threatening alternatives?

Pharmacological Contraindication Depth

How precisely does the model identify drug-drug and drug-disease interactions, including subtle renal dosage adjustments and P-glycoprotein interactions?

Boundary Recognition & Scope of Practice

Does the model clearly defer to clinical judgment and avoid overstepping into definitive diagnostic or treatment declarations?

Patient Safety Flag Detection

Can the model catch embedded safety hazards, such as an active medication allergy listed on a med rec, or undocumented self-medication?

Handling Ambiguous or Incomplete Data

When clinical details are missing or contradictory, does the model ask clarifying questions or proceed with unsafe assumptions?

Testing Methodology

Each test scenario uses synthetic patient data designed to mirror real clinical encounters. Prompts include full encounter notes, medication lists, lab values, and embedded safety hazards (e.g., an allergy conflict in the medication record, or an NSAID prescribed to a heart failure patient). Models receive identical prompts under standardized conditions.

Scoring follows structured criteria: each test defines what a strong response should include (catching a medication error, citing current guidelines, recommending the correct imaging study) and what a weak response looks like (missing a contraindication, anchoring to a triage label, omitting a guideline citation). Evaluations are reviewed for clinical accuracy before scores are finalized.

Models tested: Med Gemini · Gemini 3.1 Pro · OpenAI GPT 5.4 Pro · Claude Opus 4.7 (High) · Claude Sonnet 4.6 · OpenAI GPT 4o

Peer-Reviewed Study

BastionGPT in the PICU: 99.3% Sentence-Level Accuracy

A feasibility study led by Baylor College of Medicine and Texas Children's Hospital evaluated BastionGPT as a parent-facing chatbot in a pediatric intensive care unit, using patient-specific EHR data inside a HIPAA-compliant environment.

Critical Care Explorations · April 2026

Feasibility of a Large Language Model Chatbot to Support Parental Understanding in the PICU

Hunter et al. Critical Care Explorations 8(4):e1378, April 2026. Baylor College of Medicine · Texas Children's Hospital.

STUDY TYPE · FEASIBILITY PLATFORM · BASTIONGPT COMPLIANCE · HIPAA / BAA Read study →

Pediatric ICU Parent communication EHR-integrated prompts Physician-scored

99.3%

Accuracy

8 minor errors across the study. Zero moderate or severe errors detected by PICU physician reviewers.

+57

Net Promoter Score

Parent satisfaction after PICU sessions. Zero detractors reported.

0.98

Gwet's AC2 inter-rater reliability

Excellent agreement among physician reviewers. 95% CI: 0.97 to 0.99.

Accuracy & Safety

All 8 identified errors were classified minor

Minor

No patient or parent harm

Moderate

None detected

Severe

None detected

Providers rated response quality a median of 5.0 / 6.0. Physicians expressed strong comfort with routine bedside use of the tool.

Parent Engagement

Question topics during sessions

Therapeutic 45%

Diagnosis / Monitoring 17%

Prognosis 17%

Other 21%

Parents asked a median of 6 questions per session and reported high satisfaction throughout the study.

Future Preference for Information Source

Parents want AI-assisted communication going forward

50%

Prefer AI chatbot

50%

Both chatbot + current sources

Traditional sources only

"The chatbot is envisioned as a supplement to rather than a substitute for bedside communication."
— Hunter et al., Critical Care Explorations, April 2026

Citation. Hunter et al. "Feasibility of a Large Language Model Chatbot to Support Parental Understanding in the PICU." Critical Care Explorations 8(4):e1378, April 2026. Platform: BastionGPT (HIPAA-compliant). Setting: Texas Children's Hospital PICU. Enrollment: 14 of 16 eligible parents (87.5%).

Read full study →

Platform Comparisons

How Does BastionGPT Compare?

In-depth, side-by-side evaluations of healthcare AI platforms. Each comparison covers data handling, compliance posture, clinical workflows, model access, and pricing.

ComparisonRead article

BastionGPT vs. Hathr.AI

Hathr.AI is built for federal healthcare and defense teams that need FedRAMP High or DoD Impact Level 5 authorization. BastionGPT is the more complete clinical platform for private practice, clinics, therapy groups, hospitals, and health systems where HIPAA is the governing standard.

FedRAMP vs. HIPAA Federal vs. Private Compliance Frameworks

Read full comparison → 8 min read

ComparisonRead article

BastionGPT vs. CompliantChatGPT

CompliantChatGPT anonymizes data before sending it to OpenAI. BastionGPT skips anonymization entirely because it runs on HIPAA-compliant infrastructure where data is never shared with OpenAI or other unsafe locations. Two very different philosophies for handling protected health information.

Data Handling Zero Data Exposure PHI Security

Read full comparison → 9 min read

ComparisonRead article

BastionGPT vs. Heidi Health

Heidi Health started as an ambient AI scribe and has grown into what it calls an "AI Care Partner." BastionGPT combines a full AI assistant, unlimited scribe, and document analysis in one HIPAA-compliant platform, with zero data sharing with third-party AI providers.

Scribe vs. Full Platform Privacy First Mental Health Focus

Read full comparison → 8 min read

Documentation Burden Research

The Data Behind Clinician Burnout

Clinical documentation consumes a significant portion of the physician workweek. These visuals, sourced from published studies and national benchmarks, quantify the scale of the problem.

Per-Patient Benchmark

EHR minutes per patient by specialty

Every patient can mean double-digit minutes of screen work

Psychiatry

22.5 min

Pediatrics

16 min

Surgical

12 min

Internal Med

10.6 min

Cardiology

10.2 min

Family Med

10 min

OB/GYN

9.2 min

Emergency Med

6.82 min

Chart review Note writing After-hours completion Other EHR

Sources: 2021 primary-care specialty study (FM, IM, Peds); 2020 pediatric telemetry; 2024 JAMA Network Open ED benchmark; 2024 national specialty telemetry. Values labeled "Derived" use normalization from per-8-PSH data.

EHR Time per Patient by Specialty

Psychiatry leads at 22.5 minutes of screen work per patient. Pediatrics follows at 16 minutes. Sources: 2021 primary-care study, 2024 JAMA Network Open ED benchmark.

Encounter Cycle

Patient care vs. EHR burden per encounter

Documentation competes with patient time, not just paperwork time

Surgical

66.5% 33.5%

Internal Med

66% 34%

Family Med

63.6% 36.4%

OB/GYN

61.2% 38.8%

Cardiology

58.7% 41.3%

Psychiatry

57.1% 42.9%

Pediatrics

47.5% 52.5%

Patient-facing care EHR / admin burden

Patient Care vs. EHR Burden per Encounter

In pediatrics, EHR and admin work accounts for 52.5% of encounter time, exceeding patient-facing care. Psychiatry is close behind at 42.9%.

Encounter Timeline

Where time goes in a single encounter cycle

Each bar represents the full cycle: from chart review through after-hours follow-up

Family Med

Pediatrics

Cardiology

Chart review Patient visit Note writing After-hours

Where Time Goes in a Single Encounter

Chart review, note writing, and after-hours follow-up collectively squeeze the window for direct patient interaction across family medicine, pediatrics, and cardiology.

Task Breakdown

What clinicians do after hours in the EHR

Based on pediatric primary-care EHR log analysis (56 physicians, 1,069 workdays)

Reviewing patient data, labs, results

13–19 min/day (78%)

Inbox and communication tasks

~5 min/day (11%)

Documentation, note signing, orders

~1.5–3.6 min/day (8%)

Other EHR actions

~1 min/day (3%)

What Clinicians Do After Hours in the EHR

78% of after-hours EHR activity is reviewing patient data, labs, and results. Based on pediatric primary-care log analysis across 56 physicians and 1,069 workdays.

57.8 hrs

Average physician workweek
(AMA 2024)

13 hrs

Indirect patient care per week
(documentation, orders, results)

22.5%

Physicians spending 8+ hrs/wk
on after-hours EHR work

22.5 min

EHR time per patient
(psychiatry, highest specialty)

Documentation Burden

Average note length by mental health specialty

Estimated words per clinical encounter note or report. Represents progress notes unless otherwise indicated.

MD / DO psychiatry PhD / PsyD psychology LCSW / LPC / MFT

Neuropsychological testing

1,750

Forensic psychiatry

1,380

Inpatient psychiatry

1,150

Neuropsychiatry

960

Child & adolescent psychiatry

910

Geriatric psychiatry

865

Addiction psychiatry

820

General psychiatry

750

Clinical psychology

580

Individual psychotherapy

410

Marriage & family therapy

345

Counseling / LCSW

310

Group therapy

280

0 500 1,000 1,500 2,000

Estimated words per note

Methodology & Sources

Values are representative estimates derived from published ranges and clinical practice standards, not a single comparative dataset. No single peer-reviewed study provides word counts across all mental health sub-specialties simultaneously.

1. Rule A et al. "Length and Redundancy of Outpatient Progress Notes." JAMA Network Open 4(7), 2021. (PMC8290305) — EHR note length trends across specialties including psychiatry, 2009–2018.

2. Epic Research (2023), via EHR Intelligence — Analysis of 1.7 billion outpatient notes across 166,318 providers, 2020–2023; average note length rose 8.1%.

3. Inter Organizational Practice Committee (IOPC), neuropsychologist survey (n=660) — Adult neuropsych reports average ~6 pages; pediatric ~11 pages.

4. BehaveHealth (2025) & Headway (2026) — Therapy progress notes: recommended 150–400 words; SOAP notes: 2–4 paragraphs.

5. Osmind clinical documentation standards; APA documentation guidelines — Psychiatry note components (MSE, risk assessment, medication rationale).

Chart produced April 2026. Word counts reflect progress/encounter notes unless the specialty's primary output is an evaluation report (neuropsychological testing, forensic psychiatry).

Average Note Length by Mental Health Specialty

Neuropsychological testing reports average 1,750 words. Forensic psychiatry notes average 1,380. Even general psychiatry progress notes run 750 words per encounter. Values are representative estimates derived from published ranges and clinical practice standards.

See How BastionGPT Addresses Documentation Burden

HIPAA-compliant AI that drafts clinical notes, transcribes sessions, and analyzes documents, without exposing patient data to third-party AI providers. BAA included on every plan.

Try BastionGPT Talk to a researcher