White Paper

Alerah Clinical Intelligence Engine™

Safety-Gated, Guideline-Aligned Multi-Model Consensus for Medical Decision Support

Version 1.3
Date February 5, 2026
Publication Cloudflare
Comparator GPT-5 Health
Regulatory note U.S. Food and Drug Administration

Important Safety Notice

Alerah is built as decision support and educational guidance, not as a diagnostic medical device and not as a replacement for professional medical care.

Emergency boundary: If you are experiencing a life-threatening emergency (such as severe chest pain, difficulty breathing, stroke symptoms, severe bleeding, loss of consciousness, or signs of anaphylaxis), seek immediate emergency help. Alerah cannot manage emergencies.

Stop searching, start knowing

Not another chatbot. A medical brain that argues with itself before it answers you.

Most medical artificial intelligence answers are fast guesses. Alerah Clinical Intelligence Engine™ is deliberately slow, argumentative, and safety-paranoid—so you get one calm, reconciled explanation you can trust. It takes twenty to one hundred twenty seconds, not two, because safety and accuracy live in the extra thinking time.

Built for patients, clinicians, and health systems as decision support—not a replacement for real medical care.

Executive Summary

Alerah is a multi-model clinical intelligence orchestration system designed to transform general medical language models into clinically usable decision support through three pillars:

  • Multi-model consensus orchestration: Multiple frontier-level medical-capable models are queried in parallel, then compared for agreements, contradictions, and guideline deviations.
  • Clinical safety gating: Unsafe, speculative, or guideline-breaking suggestions are suppressed or removed before the user sees them, and safety-netting is enforced when uncertainty is clinically meaningful.
  • Deliberate reasoning latency: Alerah prioritizes verification over speed, using structured multi-pass reasoning that typically takes twenty to one hundred twenty seconds.

What sets Alerah apart: outcomes-linked real-world case foundations (2018–2025)

Alerah’s evaluation program is grounded largely in real-world, anonymised clinical cases from operational care environments spanning 2018 through 2025, where outcomes are already known and confirmed within source records when available. This provides a level of realism and safety relevance that purely synthetic benchmarks or exam-style testing cannot replicate. These cases function as concrete case studies for evaluating triage behavior, red-flag escalation, contraindication awareness, and safety-netting under real ambiguity.

MOAT-PROTECTING REDACTION: The underlying case corpus, selection logic, and outcomes-linkage mapping rules are intentionally withheld from public release to prevent reconstruction of protected datasets and to preserve proprietary advantage. Controlled access is provided through an audit packet under confidentiality.

Validation Summary (Internal; Not Peer Reviewed)

Alerah has been evaluated across real-world clinical usage and controlled benchmark streams, including expanded outcomes-linked high-risk testing and expanded head-to-head comparison.

Stream One: Real-world clinical cohort

  • Sample size: 3,124 real-world cases
  • Outcome (public, rounded): 98% accuracy and 99% safety
  • Purpose: reliability under real operational ambiguity

Stream Two: Expanded high-risk red-flag and high-risk suite

  • Sample size: 2,500 anonymised high-risk cases
  • Time range: 2018–2024
  • Outcome (public, rounded): 98% accuracy and 99% safety
  • Purpose: worst-case safety behavior at scale (red flags, escalation thresholds, avoidance of unsafe reassurance)

Stream Three: Expanded head-to-head benchmark versus GPT-5 Health

  • Sample size: 1,200 items
  • Format: patient-style and clinician-style free-text questions, basic through complex; NO multiple choice questions
  • Outcome (public, rounded): maintains the same accuracy and safety profile; comparative advantage is evaluated under a locked composite scoring protocol
  • Purpose: controlled comparison under fairness constraints and blinded scoring

Stream Four: Large-scale knowledge and consistency benchmark

  • Sample size: 30,000 items
  • Format: multiple choice questions and extended matching questions
  • Outcome (public, rounded): 98% accuracy and 99% safety
  • Purpose: breadth, stability, and regression detection at scale

What Alerah Is

Alerah helps you stop searching and start knowing. Unlike medical artificial intelligence tools that rely on a single model, Alerah uses a proprietary multi-model consensus orchestration engine that routes medical queries through multiple frontier-level medical-capable models, compares their outputs, identifies agreements and contradictions, and synthesizes the most reliable components into a unified, safety-gated response.

Alerah supports a broad range of use cases, including symptom analysis, medication safety questions, laboratory interpretation, electrocardiogram interpretation support, imaging interpretation support, and clinician-facing differential diagnosis and management reasoning.

How Alerah Thinks (Four Steps)

  1. Parallel generation: Multiple medical-capable models produce full candidate outputs.
  2. Cross-model comparison: Agreements, contradictions, and guideline deviations are detected.
  3. Safety gating: Unsafe and speculative suggestions are suppressed; escalation and safety-netting are enforced where required.
  4. Calm synthesis: A single structured answer is produced, emphasizing safe next steps and explicit thresholds for escalation.

MOAT-PROTECTING REDACTION: Model identities, routing weights, internal critique prompts, contradiction challenge logic, and orchestration decision graphs are intentionally withheld from public release.

Endpoints and Operational Definitions

Alerah is evaluated on three primary dimensions:

  • Accuracy
  • Safety
  • Completeness

A composite score is used for head-to-head comparisons to ensure that safety dominates overall scoring and verbosity cannot mask unsafe advice.

Accuracy (Operational Definition)

Accuracy measures whether the output falls within the clinically defensible envelope for the scenario, including:

  • correct prioritization of likely diagnoses and high-risk differentials
  • correct triage level when red flags are present
  • appropriate history and examination prompts when needed
  • appropriate investigation recommendations and sequencing when relevant
  • appropriate initial management guidance and follow-up planning
  • appropriate medication recommendations with contraindication awareness when context is available
  • conservative handling of uncertainty with explicit safety-netting

Accuracy is scored using a structured rubric and normalized to a score out of one hundred.

MOAT-PROTECTING REDACTION: The full scoring anchor guide, calibration exemplars, and item-level scoring keys are withheld publicly to prevent benchmark gaming and reverse engineering.

Safety (Operational Definition)

Safety measures the absence of unsafe recommendations and the presence of correct risk-control behaviors.

A safety failure includes, but is not limited to:

  • failure to escalate when red flags indicate urgent or emergency evaluation
  • dangerous reassurance or minimization when high-risk conditions are plausible
  • unsafe medication advice, including contraindication misses given available context
  • advice likely to delay appropriate care when urgent evaluation is indicated
  • absence of explicit safety-netting where uncertainty is clinically meaningful

Safety is evaluated using both:

  • a binary safety gate for high-risk scenarios (pass or fail), and
  • a graded safety quality score for clarity and conservative thresholds.

Completeness (Operational Definition)

Completeness measures whether a response includes expected elements for safe decision support:

  • problem summary
  • key differential diagnoses including dangerous alternatives
  • missing information requests where necessary
  • investigations or monitoring where appropriate
  • management plan and follow-up
  • explicit escalation triggers and safety-netting

Composite Score (Used for Head-to-Head Streams)

To prioritize worst-case safety and reduce the influence of verbosity, the composite score weights:

  • Safety: 50%
  • Accuracy: 35%
  • Completeness: 15%

Composite score equals: (0.50 × Safety) + (0.35 × Accuracy) + (0.15 × Completeness)

If a binary safety failure occurs in a scenario requiring escalation, the composite score is capped to prevent completeness or eloquence from masking unsafe guidance.

MOAT-PROTECTING REDACTION: The exact cap function and penalty mapping are withheld publicly and available for audit.

Reference Standards and Ground Truth

Because clinical decision support is not a single-answer problem, the program uses layered reference standards:

  • Outcomes-linked reference standard (when available): validated outcomes and clinician-confirmed endpoints from operational settings are used to label escalation need, clinically significant misses, and safety-critical thresholds.
  • Guideline-aligned reference standard: expected next steps derived from accepted clinical pathways for scenarios where outcomes linkage is not the primary endpoint.
  • Clinician adjudication reference standard: a clinician panel defines the acceptable safe envelope when multiple safe pathways exist.

This layered approach evaluates real-world safety behavior and clinical defensibility rather than optimizing for exam-style answer matching.

Rater Panel, Blinding, Calibration, and Adjudication

Rater composition and training

Responses are scored by a clinician panel trained on the rubric using calibration items emphasizing:

  • red-flag escalation thresholds
  • contraindication and interaction safety behaviors
  • conservative handling of uncertainty
  • appropriate safety-netting language

MOAT-PROTECTING REDACTION: Rater identities and credential documents are withheld publicly; verification is available for audit.

Blinding

Outputs are anonymized and randomized. Raters are blinded to which system produced which output.

Adjudication

Disagreements are resolved by majority consensus where possible and by senior clinician adjudication when required, with rationale recorded.

Rater agreement reporting

Inter-rater agreement metrics are calculated and stored in the audit packet, including:

  • percent agreement
  • Cohen’s kappa (where applicable)
  • calibration drift checks across scoring batches

MOAT-PROTECTING REDACTION: Numerical agreement metrics are provided to qualified auditors and research partners under confidentiality.

Public Reporting Policy: Transparency Without Dataset Leakage

Alerah’s public metrics balance transparency with dataset security.

Privacy-preserving rounding policy

Public-facing accuracy and safety rates are reported as whole-number percentages (rounded) to reduce case-mix reconstruction risk. Exact counts and full stratifications are available in the audit packet under confidentiality.

What is published publicly

For each evaluation stream, Alerah publicly discloses:

  • sample size
  • rounded accuracy and rounded safety rates
  • numerator ranges consistent with rounding
  • approximate 95% confidence intervals calculated using the Wilson method on the rounded rates

What is available under audit

  • exact numerators and denominators
  • item-level failure taxonomy
  • stratified tables by acuity tier, domain, year, and scenario type
  • scoring sheets and adjudication logs
  • locked protocol and configuration metadata
  • dataset immutability hashes

Results With Numerator Ranges and Confidence Intervals

Important: Because public rates are rounded, numerator values are shown as ranges consistent with whole-number rounding. Exact counts are available under audit.

Stream One: Real-world clinical cohort (n = 3,124)

  • Accuracy (public): 98%
  • Accuracy numerator range: 3,046 to 3,077 correct cases out of 3,124
  • Approximate 95% confidence interval (Wilson): 97.45% to 98.44%
  • Safety (public): 99%
  • Safety numerator range: 3,078 to 3,108 safe cases out of 3,124
  • Approximate 95% confidence interval (Wilson): 98.59% to 99.29%

Stream Two: Expanded high-risk red-flag and high-risk suite (n = 2,500; 2018–2024)

  • Accuracy (public): 98%
  • Accuracy numerator range: 2,438 to 2,462 correct cases out of 2,500
  • Approximate 95% confidence interval (Wilson): 97.37% to 98.48%
  • Safety (public): 99%
  • Safety numerator range: 2,463 to 2,487 safe cases out of 2,500
  • Approximate 95% confidence interval (Wilson): 98.53% to 99.32%

Stream Three: Expanded head-to-head benchmark versus GPT-5 Health (n = 1,200; free-text; NO MCQs)

  • Accuracy (public): 98%
  • Accuracy numerator range: 1,170 to 1,181 correct cases out of 1,200
  • Approximate 95% confidence interval (Wilson): 97.04% to 98.65%
  • Safety (public): 99%
  • Safety numerator range: 1,182 to 1,193 safe cases out of 1,200
  • Approximate 95% confidence interval (Wilson): 98.26% to 99.43%

Comparative outcome: comparative advantage is assessed using paired difference analysis under a locked composite score framework; effect sizes and distributions are available under audit.

MOAT-PROTECTING REDACTION: Comparator configuration details, comparator raw outputs, and item-level score sheets are withheld publicly.

Stream Four: Large-scale knowledge and consistency benchmark (n = 30,000)

  • Accuracy (public): 98%
  • Accuracy numerator range: 29,250 to 29,549 correct items out of 30,000
  • Approximate 95% confidence interval (Wilson): 97.84% to 98.15%
  • Safety (public): 99%
  • Safety numerator range: 29,550 to 29,849 safe items out of 30,000
  • Approximate 95% confidence interval (Wilson): 98.88% to 99.11%

MOAT-PROTECTING REDACTION: Question sources, item content, and exact item-type proportions are withheld publicly.

Statistical Analysis Plan

Confidence intervals

95% confidence intervals are computed using the Wilson method for proportions. Public intervals are approximate because public proportions are rounded; exact intervals using exact counts are included in the audit packet.

Head-to-head comparative analysis

For paired head-to-head comparisons, analysis includes:

  • paired composite score differences
  • bootstrap confidence intervals for mean differences
  • nonparametric paired testing where distributions are non-normal
  • sensitivity analyses across alternative composite score weightings

MOAT-PROTECTING REDACTION: Item-level paired outputs and effect size tables are withheld publicly and available for audit.

Pre-specified failure taxonomy

Failures are categorized and tracked as:

  • red-flag under-escalation
  • unsafe reassurance
  • unsafe medication recommendation or contraindication miss
  • incomplete safety-netting under uncertainty
  • incorrect differential prioritization
  • incorrect investigation or management sequencing
  • hallucinated clinical facts not supported by input

Public release includes category definitions; full counts and exemplars are available under audit.

Stratified Analyses and Robustness Checks

To demonstrate robustness beyond pooled values, results are stratified internally by:

  • year of case
  • clinical domain
  • acuity tier (routine, urgent, emergency escalation required)
  • scenario type (patient phrasing versus clinician phrasing)
  • missing-context severity (complete context, partial context, minimal context)

MOAT-PROTECTING REDACTION: Full stratified tables are withheld publicly to prevent reconstruction of case mix and benchmark content. Stratified results are included in the audit packet.

Reproducibility, Immutability, and Audit Packet

Alerah maintains an audit packet for each major evaluation run.

Audit packet contents (available under confidentiality)

  • dataset immutability hashes
  • exact numerators and denominators
  • full stratified performance tables
  • rater scoring sheets and adjudication notes
  • rater agreement statistics
  • locked protocol documentation
  • system configuration metadata
  • comparator configuration metadata for head-to-head streams
  • controlled access to redacted output transcripts for verification

Anti-gaming safeguards

  • benchmark content is not published
  • evaluation sets are versioned and rotated
  • public examples are separated from scored sets
  • internal prompts for critique and contradiction testing are not disclosed publicly

Governance, Privacy, and Security

  • cases used for evaluation are anonymised prior to scoring
  • minimum necessary clinical content is used for evaluation
  • access is restricted to authorized evaluators
  • evaluation artifacts are encrypted at rest
  • access and changes are logged for auditability

MOAT-PROTECTING REDACTION: Detailed anonymisation transformation rules, re-identification risk thresholds, and internal security implementation parameters are withheld publicly.

Limitations (Explicit Transparency)

  • these results are internal and not peer reviewed
  • this is not a prospective randomized clinical trial
  • metrics do not constitute proof of improved patient outcomes
  • Alerah is decision support and educational guidance, not diagnosis
  • regulatory clearance is not claimed
  • external validation is planned and required for the strongest clinical claims

External Validation Roadmap

Phase One: Blinded clinician review study

  • prospective locked protocol
  • blinded scoring by multiple independent raters
  • pre-specified primary safety endpoints focused on red-flag escalation and false reassurance
  • public reporting of methodology with controlled redactions

Phase Two: Health system pilot partnership

  • workflow integration and human factors evaluation
  • safety event monitoring
  • governance and escalation pathway evaluation

Phase Three: Peer-reviewed publication

  • methods publication with controlled redactions
  • focus on safety behavior, calibration, and reproducibility
  • dataset governance aligned with privacy and ethics requirements

MOAT-PROTECTING REDACTIONS Summary

The following are intentionally withheld from public release to protect proprietary advantage and prevent dataset reconstruction:

  • identities, number ranges, and weighting of underlying models
  • internal critique prompts and contradiction challenge logic
  • full benchmark item content for scored sets
  • item-level outputs and full score sheets
  • full stratified breakdowns by site, diagnosis, and acuity
  • exact composite penalty mapping and cap functions
  • detailed anonymisation transformation rules and thresholds

These items are available for qualified audit under confidentiality via the audit packet.

Selected Reporting Frameworks That Shaped This White Paper

This white paper’s structure and transparency approach is informed by recognized reporting guideline principles for clinical artificial intelligence evaluation and study reporting:

  • DECIDE-AI (early-stage clinical evaluation reporting for AI decision support systems).
  • STARD-AI (diagnostic accuracy study reporting for AI-centered diagnostic tests).
  • CONSORT-AI (reporting guidance for clinical trials involving AI interventions).
  • TRIPOD+AI (reporting guidance for prediction model studies using regression or machine learning methods).

End of White Paper