Current AI solutions for chest X-ray interpretation suffer from fragmentation: specialized models for segmentation, classification, and report generation operate in isolation, limiting their clinical utility. General foundation models like GPT-4, meanwhile, often hallucinate or struggle with the multi-step reasoning that accurate medical diagnosis demands.
MedRAX bridges this gap—the first AI agent framework designed to seamlessly integrate state-of-the-art CXR analysis tools with large language models into a unified reasoning system, without requiring additional training.
Architecture
MedRAX operates on a ReAct (Reasoning and Acting) loop driven by GPT-4o:
- Observation: Analyzing the current state and user input
- Thought: Determining what actions are necessary
- Action: Executing specialized tools and integrating findings into memory
Integrated Tools
The framework orchestrates purpose-built models for distinct clinical tasks:
- Visual QA: CheXagent and LLaVA-Med
- Segmentation: MedSAM and ChestX-Det
- Grounding: Maira-2 (localizing regions from text descriptions)
- Classification: TorchXRayVision (18 pathology detection)
- Report Generation: Model trained on CheXpert Plus
Built on LangChain/LangGraph, the system allows flexible deployment and easy tool replacement without retraining the core agent.
ChestAgentBench
To rigorously evaluate multi-step reasoning, we introduced ChestAgentBench—a benchmark significantly more demanding than existing single-step VQA datasets:
- 2,500 complex queries derived from 675 expert-curated clinical cases (Eurorad)
- Six-choice format requiring multi-step reasoning
- Seven core competencies: detection, localization, diagnosis, reasoning, and more
Results
- 63.1% overall accuracy on ChestAgentBench (SOTA), vs. 56.4% for GPT-4o and 39.5% for CheXagent
- 90.35% on SLAKE VQA, surpassing the previous best of 85.1%
- 79.1% micro-F1 on MIMIC-CXR report generation, vs. 60.6% for M4CXR
A key finding: general-purpose LLMs outperformed specialized biomedical models on reasoning tasks, but MedRAX bridged the gap by combining generalist reasoning with specialist tools.
In qualitative analysis, MedRAX correctly resolved cases where GPT-4o hallucinated—for example, correctly identifying a chest tube by synthesizing report data and visual QA, where GPT-4o hallucinated an endotracheal tube based on positioning alone.
Links
- Paper: arxiv.org/abs/2502.02673
Authors
Adib Fallahpour, Jichen Ma, Alif Munim, Hanlin Lyu, Bo Wang