Current AI solutions for chest X-ray interpretation suffer from fragmentation: specialized models for segmentation, classification, and report generation operate in isolation, limiting their clinical utility. General foundation models like GPT-4, meanwhile, often hallucinate or struggle with the multi-step reasoning that accurate medical diagnosis demands.

MedRAX bridges this gap—the first AI agent framework designed to seamlessly integrate state-of-the-art CXR analysis tools with large language models into a unified reasoning system, without requiring additional training.

Architecture

MedRAX operates on a ReAct (Reasoning and Acting) loop driven by GPT-4o:

  1. Observation: Analyzing the current state and user input
  2. Thought: Determining what actions are necessary
  3. Action: Executing specialized tools and integrating findings into memory

Integrated Tools

The framework orchestrates purpose-built models for distinct clinical tasks:

  • Visual QA: CheXagent and LLaVA-Med
  • Segmentation: MedSAM and ChestX-Det
  • Grounding: Maira-2 (localizing regions from text descriptions)
  • Classification: TorchXRayVision (18 pathology detection)
  • Report Generation: Model trained on CheXpert Plus

Built on LangChain/LangGraph, the system allows flexible deployment and easy tool replacement without retraining the core agent.

ChestAgentBench

To rigorously evaluate multi-step reasoning, we introduced ChestAgentBench—a benchmark significantly more demanding than existing single-step VQA datasets:

  • 2,500 complex queries derived from 675 expert-curated clinical cases (Eurorad)
  • Six-choice format requiring multi-step reasoning
  • Seven core competencies: detection, localization, diagnosis, reasoning, and more

Results

  • 63.1% overall accuracy on ChestAgentBench (SOTA), vs. 56.4% for GPT-4o and 39.5% for CheXagent
  • 90.35% on SLAKE VQA, surpassing the previous best of 85.1%
  • 79.1% micro-F1 on MIMIC-CXR report generation, vs. 60.6% for M4CXR

A key finding: general-purpose LLMs outperformed specialized biomedical models on reasoning tasks, but MedRAX bridged the gap by combining generalist reasoning with specialist tools.

In qualitative analysis, MedRAX correctly resolved cases where GPT-4o hallucinated—for example, correctly identifying a chest tube by synthesizing report data and visual QA, where GPT-4o hallucinated an endotracheal tube based on positioning alone.

Authors

Adib Fallahpour, Jichen Ma, Alif Munim, Hanlin Lyu, Bo Wang