MedRAX: Medical Reasoning Agent for Chest X-ray

Current AI solutions for chest X-ray interpretation suffer from fragmentation: specialized models for segmentation, classification, and report generation operate in isolation, limiting their clinical utility. General foundation models like GPT-4, meanwhile, often hallucinate or struggle with the multi-step reasoning that accurate medical diagnosis demands.

MedRAX bridges this gap—the first AI agent framework designed to seamlessly integrate state-of-the-art CXR analysis tools with large language models into a unified reasoning system, without requiring additional training.

Architecture

MedRAX operates on a ReAct (Reasoning and Acting) loop driven by GPT-4o:

Observation: Analyzing the current state and user input
Thought: Determining what actions are necessary
Action: Executing specialized tools and integrating findings into memory

Integrated Tools

The framework orchestrates purpose-built models for distinct clinical tasks:

Visual QA: CheXagent and LLaVA-Med
Segmentation: MedSAM and ChestX-Det
Grounding: Maira-2 (localizing regions from text descriptions)
Classification: TorchXRayVision (18 pathology detection)
Report Generation: Model trained on CheXpert Plus

Built on LangChain/LangGraph, the system allows flexible deployment and easy tool replacement without retraining the core agent.

ChestAgentBench

To rigorously evaluate multi-step reasoning, we introduced ChestAgentBench—a benchmark significantly more demanding than existing single-step VQA datasets:

2,500 complex queries derived from 675 expert-curated clinical cases (Eurorad)
Six-choice format requiring multi-step reasoning
Seven core competencies: detection, localization, diagnosis, reasoning, and more

Results

63.1% overall accuracy on ChestAgentBench (SOTA), vs. 56.4% for GPT-4o and 39.5% for CheXagent
90.35% on SLAKE VQA, surpassing the previous best of 85.1%
79.1% micro-F1 on MIMIC-CXR report generation, vs. 60.6% for M4CXR

A key finding: general-purpose LLMs outperformed specialized biomedical models on reasoning tasks, but MedRAX bridged the gap by combining generalist reasoning with specialist tools.

In qualitative analysis, MedRAX correctly resolved cases where GPT-4o hallucinated—for example, correctly identifying a chest tube by synthesizing report data and visual QA, where GPT-4o hallucinated an endotracheal tube based on positioning alone.

Authors

Adib Fallahpour, Jichen Ma, Alif Munim, Hanlin Lyu, Bo Wang

Architecture

Integrated Tools

ChestAgentBench

Results

Links

Authors