ML4H: Reproducibility, Foundation Models, and the Future of Health AI

This blog post is written by AI CDT student, Phillip Sloan

I recently attended ML4H (Machine Learning for Health), a conference bringing together researchers, clinicians, and industry leaders at the intersection of machine learning and healthcare. Held at The Westin in San Diego’s Gaslamp Quarter, the event spanned two packed days and showcased cutting-edge research, practical deployments, and thoughtful discussion on how AI can transform medicine.

ML4H 2025 opened with a clear message: breakthrough research in health AI will mean very little if it cannot be reproduced. Matthew McDermott, Assistant Professor at Columbia’s Department of Biomedical Informatics, set the tone by confronting a fundamental issue in medical machine learning: too much published research cannot be reliably recreated because clinical data are structured inconsistently across institutions and studies. The MEDS ecosystem offers a path forward: a shared, open data standard for longitudinal medical records designed to make experiments repeatable, results comparable, and collaboration frictionless. From this foundation, Paul Liang, Assistant Professor at the MIT Media Lab and MIT EECS, introduced CLIMB and QoQ-Med, both of which seek to improve the training of multimodal clinical foundation models that can reason across text, images, and physiological signals, using domain-aware reinforcement strategies that emphasise harder medical reasoning tasks.

After these two technical talks, Gabriel Brat, Assistant Professor of Surgery and Biomedical Informatics at Harvard, offered his insights into the adoption of AI by clinicians. He outlined a current paradox: while a human with AI should, in principle, outperform humans alone, real-world studies show that radiologists often neglect or resist AI recommendations. Senior clinicians may lose confidence when AI disagrees with them, even when models are explainable. His talk raised a crucial question: how do we design systems that physicians trust and collaborate with effectively?

Serena Yeung-Levy, Assistant Professor of Biomedical Data Science at Stanford University, shifted the focus towards computer vision in healthcare, discussing benchmarks spanning microscopy and wider clinical settings. She introduced MicroVQA, a multimodal reasoning benchmark for microscopy-based scientific research, and BIOMEDICA, an open biomedical image-caption archive derived from scientific literature. Her talk highlighted that progress in medical AI depends as much on better benchmarks as it does on better models.

The afternoon session began with spotlight talks, followed by a poster session highlighting the diversity of work at ML4H. I presented our own research on clinically aligned multimodal chest X-ray classification, which led to thoughtful discussions about how such systems might integrate into real radiology workflows. The day concluded with a research roundtable, where I spent most of my time at the drug discovery table. Interestingly, conversations about persuading biologists to adopt machine learning closely mirrored Gabriel Brat’s earlier talk on radiologists’ scepticism towards AI. This reinforced the idea that technological progress must be matched by cultural change. The evening ended with a social event, providing space for informal conversations that carried many of these themes well beyond the conference halls.

Day two began with a fireside career chat featuring Tara Taghavi, Senior Director of AI at Oracle Health; Stephen Pfohl, Senior Research Scientist at Google; and Irene Chen, Assistant Professor in Computational Precision Health at UC Berkeley and UCSF, who reflected on their paths through academia and industry and offered candid advice on navigating interdisciplinary careers in health AI. This was followed by a series of live demonstrations that brought research into the realm of practice: an on-device ECG interpretation app enabling privacy-preserving, real-time analysis; a cancer case-prioritisation system designed to ensure urgent cases reach clinicians faster; a medical coding platform for home health agencies; and RadGame, an AI-powered radiology education platform. Given my background in radiology, I was particularly interested in RadGame, which incorporated elements of my research area, automated radiology report generation, into its pipeline.

The remainder of the day turned towards deployment and impact. Julia Adler-Milstein, Professor of Medicine at UCSF, delivered a keynote on health AI delivery science, addressing how to build AI-ready clinical infrastructures, sustain ongoing model monitoring, and train a workforce capable of safely using these tools. Sherri Rose, Professor of Health Policy and Computer Science at Stanford, followed with a perspective on social drivers of health, emphasising that without careful design, machine learning risks amplifying existing health disparities. After lunch, an informative panel on clinical implementation tackled questions such as what makes a model ‘deployable’ in medicine, after which spotlight talks showcased agentic approaches to rare disease phenotyping, diffusion-based MRI reconstruction, and large-scale clinical data standardisation pipelines.

After this came the second poster session, where I spent most of my time at two posters: “Multimodal Cancer Modeling in the Age of Foundation Models,” which presented a framework for predicting cancer survival using zero-shot foundation model embeddings across pathology reports, histology images, and gene expression data, showing that simple multimodal fusion improves prognostic performance; and “RadGame: An AI-Powered Platform for Radiology Education,” which introduced a gamified AI platform to provide automated feedback to train medical students in localisation and reporting skills for chest radiographs. It was fascinating to discuss how my research interests could extend not only into clinical applications, but also into medical education.

The final perspective talk was delivered by Allison Koenecke, Assistant Professor of Information Science at Cornell Tech, who spoke on listening to users when auditing medical AI scribes. She highlighted the challenges of automated speech recognition in healthcare, particularly for languages other than English and for patients with speech impairments such as dysphasia and aphasia, underscoring the importance of inclusive design when deploying clinical transcription systems. The conference closed with the final keynote by Emily Fox, Professor of Statistics and Computer Science at Stanford, who explored how causal machine learning methods can help unravel disease mechanisms and advance drug discovery. Together, these talks provided a fitting conclusion, emphasising that the future of health AI must be not only powerful, but also equitable and grounded in deeper scientific understanding.

I thoroughly enjoyed my time at ML4H 2025. Beyond the exciting research and technical discussions, the conference offered a valuable opportunity to connect with researchers and clinicians tackling many of the same challenges I encounter in my own work. I left San Diego with new ideas, fresh perspectives, and a deeper appreciation for the collaborative effort required to bring AI meaningfully into healthcare. I am already looking forward to seeing how these conversations and innovations evolve in the years ahead.