Current foundation models for echocardiography face a fundamental challenge rooted in the physics of ultrasound: stochastic noise. Ultrasound video is dominated by speckle patterns and artifacts—shadows, attenuation—that vary by acquisition and do not correlate with anatomy.

This creates a critical problem for existing approaches:

  • Masked Autoencoders (MAE) try to reconstruct raw pixels. Because speckle is high-frequency noise, reconstruction losses force the model to memorize this noise rather than learn anatomy.
  • Contrastive models align images with text reports, which often focus on diagnostic language rather than fine-grained visual anatomy.

The JEPA Hypothesis

EchoJEPA shifts the training objective from pixel reconstruction to latent prediction. Built on JEPA—Yann LeCun’s architecture for intelligent machines with internal world models—the model predicts the meaning of masked video regions in an abstract space where noise can be discarded.

The architecture adapts V-JEPA2 for medical imaging. The input video is split into masked and unmasked tubelets. An encoder processes the unmasked frames, and a predictor infers the embeddings of the masked regions. Crucially, instead of predicting pixels, the model predicts the output of a Target Encoder (an Exponential Moving Average of the Context Encoder). This target evolves slowly, naturally suppressing unpredictable stochastic noise while reinforcing stable structures like chamber walls.

Domain Adaptations

  • Temporal resolution increased to 24 fps to capture rapid cardiac dynamics
  • Augmentation restricted to preserve the clinically relevant fan-shaped sector geometry
  • EchoJEPA-G: ViT-Giant (1.1B parameters) trained on 18M proprietary videos
  • EchoJEPA-L: ViT-Large (300M parameters) trained on public MIMIC-IV-Echo (525K videos) for reproducibility

Evaluation Framework

To ensure fair comparison, we developed a Unified Evaluation Protocol:

  • Frozen backbones: pretrained weights are frozen; only lightweight probes are trained
  • Multi-view probing: a novel framework integrating information across multiple cardiac views using factorized stream embeddings
  • Physics-informed robustness: benchmarks simulating real ultrasound degradation—depth attenuation and acoustic shadows

Results

  • 20% lower error on cardiac function estimation (LVEF) vs. the best existing foundation model
  • 17% improvement on Right Ventricular Systolic Pressure estimation
  • 79% view classification accuracy with 1% of labels vs. 42% for the best baseline at 100%
  • Only 2.3% degradation under simulated poor acoustic windows vs. 16.8% for competitors
  • Zero-shot pediatric transfer beats all baselines, even after fine-tuning

When architecture and compute are held constant, EchoJEPA (latent prediction) reduces LVEF error by 26.7% compared to EchoMAE (pixel reconstruction).

Acknowledgments

This work was a collaboration with Adib Fallahpour, Ahmadreza Attarpour, Teodora Szasz, River Jiang, Brana Soori, Maala Sooriyakanthan, Heather Whitney, and Jeremy Slivnick. Mentorship from Quentin Garrido and Koustuv Sinha (Meta AI, V-JEPA). Cross-institutional support from AWS, University of Toronto, University of Chicago, and Philips. Supervised by Dr. Bo Wang, with clinical lead Dr. Wendy Tsang and medical director Dr. Barry Rubin.