Large language models have shown great promise in healthcare, but deploying proprietary cloud-based systems (like GPT-4) in clinical settings faces significant hurdles: data privacy, cost, and infrastructure dependence. Open-source models exist but often require massive compute (e.g., DeepSeek-R1 at 671B parameters), making them impractical for local deployment.

This work investigates whether on-device LLMs—models designed to run on consumer-grade hardware—can match cloud-based performance for clinical decision support while preserving patient privacy.

Methodology

We benchmarked two on-device models (gpt-oss-20b and gpt-oss-120b) against state-of-the-art proprietary models (GPT-5, o4-mini) and DeepSeek-R1, across three clinical tasks:

LLM-as-a-Generalist

General disease diagnosis from 207 radiology case reports (Eurorad) based on patient history and imaging findings.

LLM-as-a-Specialist

Complex ophthalmology diagnosis and management questions requiring specialty-specific knowledge.

LLM-as-a-Clinical-Judge

Simulating human expert grading of clinical decisions across five specialties: internal medicine, neurology, cardiology, pulmonology, and gastroenterology.

Key Results

Zero-shot Performance

The on-device gpt-oss-120b effectively matched the proprietary o4-mini and surpassed DeepSeek-R1 (671B parameters) in general diagnosis. In ophthalmology, gpt-oss-120b outperformed both GPT-5 and o4-mini.

Fine-tuning

Fine-tuning the smaller gpt-oss-20b on general radiology data produced a remarkable accuracy jump:

  • 77.3% → 86.5% diagnostic accuracy
  • Outperformed DeepSeek-R1 and o4-mini
  • Approached GPT-5 performance (88.9%)

This is significant: a 20B parameter model, runnable on a single consumer GPU, achieving near-SOTA clinical performance after domain-specific fine-tuning.

Clinical Judgment

On-device models proved to be stable and reliable clinical judges, aligning closely with human expert scores and exhibiting greater stability than proprietary alternatives.

Implications

On-device LLMs represent a practical path to deploying AI in clinical settings where:

  • Patient data cannot leave the institution (privacy regulations, institutional policy)
  • Internet connectivity is unreliable (rural clinics, disaster response)
  • Cost constraints prohibit ongoing API fees at scale
  • Latency requirements demand immediate responses

The fine-tuning results suggest that relatively small, efficient models can achieve clinical-grade performance when adapted to specific medical domains.

Authors

Alif Munim, Jichen Ma, Omar Ibrahim, Ahmed Abdalla, Shuo Yin, Leo Chen, Bo Wang