Category: event

Auto Added by WPeMatico

ML4H: Reproducibility, Foundation Models, and the Future of Health AI

This blog post is written by AI CDT student, Phillip Sloan

I recently attended ML4H (Machine Learning for Health), a conference bringing together researchers, clinicians, and industry leaders at the intersection of machine learning and healthcare. Held at The Westin in San Diego’s Gaslamp Quarter, the event spanned two packed days and showcased cutting-edge research, practical deployments, and thoughtful discussion on how AI can transform medicine.

ML4H 2025 opened with a clear message: breakthrough research in health AI will mean very little if it cannot be reproduced. Matthew McDermott, Assistant Professor at Columbia’s Department of Biomedical Informatics, set the tone by confronting a fundamental issue in medical machine learning: too much published research cannot be reliably recreated because clinical data are structured inconsistently across institutions and studies. The MEDS ecosystem offers a path forward: a shared, open data standard for longitudinal medical records designed to make experiments repeatable, results comparable, and collaboration frictionless. From this foundation, Paul Liang, Assistant Professor at the MIT Media Lab and MIT EECS, introduced CLIMB and QoQ-Med, both of which seek to improve the training of multimodal clinical foundation models that can reason across text, images, and physiological signals, using domain-aware reinforcement strategies that emphasise harder medical reasoning tasks.

After these two technical talks, Gabriel Brat, Assistant Professor of Surgery and Biomedical Informatics at Harvard, offered his insights into the adoption of AI by clinicians. He outlined a current paradox: while a human with AI should, in principle, outperform humans alone, real-world studies show that radiologists often neglect or resist AI recommendations. Senior clinicians may lose confidence when AI disagrees with them, even when models are explainable. His talk raised a crucial question: how do we design systems that physicians trust and collaborate with effectively?

Serena Yeung-Levy, Assistant Professor of Biomedical Data Science at Stanford University, shifted the focus towards computer vision in healthcare, discussing benchmarks spanning microscopy and wider clinical settings. She introduced MicroVQA, a multimodal reasoning benchmark for microscopy-based scientific research, and BIOMEDICA, an open biomedical image-caption archive derived from scientific literature. Her talk highlighted that progress in medical AI depends as much on better benchmarks as it does on better models.

The afternoon session began with spotlight talks, followed by a poster session highlighting the diversity of work at ML4H. I presented our own research on clinically aligned multimodal chest X-ray classification, which led to thoughtful discussions about how such systems might integrate into real radiology workflows. The day concluded with a research roundtable, where I spent most of my time at the drug discovery table. Interestingly, conversations about persuading biologists to adopt machine learning closely mirrored Gabriel Brat’s earlier talk on radiologists’ scepticism towards AI. This reinforced the idea that technological progress must be matched by cultural change. The evening ended with a social event, providing space for informal conversations that carried many of these themes well beyond the conference halls.

Day two began with a fireside career chat featuring Tara Taghavi, Senior Director of AI at Oracle Health; Stephen Pfohl, Senior Research Scientist at Google; and Irene Chen, Assistant Professor in Computational Precision Health at UC Berkeley and UCSF, who reflected on their paths through academia and industry and offered candid advice on navigating interdisciplinary careers in health AI. This was followed by a series of live demonstrations that brought research into the realm of practice: an on-device ECG interpretation app enabling privacy-preserving, real-time analysis; a cancer case-prioritisation system designed to ensure urgent cases reach clinicians faster; a medical coding platform for home health agencies; and RadGame, an AI-powered radiology education platform. Given my background in radiology, I was particularly interested in RadGame, which incorporated elements of my research area, automated radiology report generation, into its pipeline.

The remainder of the day turned towards deployment and impact. Julia Adler-Milstein, Professor of Medicine at UCSF, delivered a keynote on health AI delivery science, addressing how to build AI-ready clinical infrastructures, sustain ongoing model monitoring, and train a workforce capable of safely using these tools. Sherri Rose, Professor of Health Policy and Computer Science at Stanford, followed with a perspective on social drivers of health, emphasising that without careful design, machine learning risks amplifying existing health disparities. After lunch, an informative panel on clinical implementation tackled questions such as what makes a model ‘deployable’ in medicine, after which spotlight talks showcased agentic approaches to rare disease phenotyping, diffusion-based MRI reconstruction, and large-scale clinical data standardisation pipelines.

After this came the second poster session, where I spent most of my time at two posters: “Multimodal Cancer Modeling in the Age of Foundation Models,” which presented a framework for predicting cancer survival using zero-shot foundation model embeddings across pathology reports, histology images, and gene expression data, showing that simple multimodal fusion improves prognostic performance; and “RadGame: An AI-Powered Platform for Radiology Education,” which introduced a gamified AI platform to provide automated feedback to train medical students in localisation and reporting skills for chest radiographs. It was fascinating to discuss how my research interests could extend not only into clinical applications, but also into medical education.

The final perspective talk was delivered by Allison Koenecke, Assistant Professor of Information Science at Cornell Tech, who spoke on listening to users when auditing medical AI scribes. She highlighted the challenges of automated speech recognition in healthcare, particularly for languages other than English and for patients with speech impairments such as dysphasia and aphasia, underscoring the importance of inclusive design when deploying clinical transcription systems. The conference closed with the final keynote by Emily Fox, Professor of Statistics and Computer Science at Stanford, who explored how causal machine learning methods can help unravel disease mechanisms and advance drug discovery. Together, these talks provided a fitting conclusion, emphasising that the future of health AI must be not only powerful, but also equitable and grounded in deeper scientific understanding.

I thoroughly enjoyed my time at ML4H 2025. Beyond the exciting research and technical discussions, the conference offered a valuable opportunity to connect with researchers and clinicians tackling many of the same challenges I encounter in my own work. I left San Diego with new ideas, fresh perspectives, and a deeper appreciation for the collaborative effort required to bring AI meaningfully into healthcare. I am already looking forward to seeing how these conversations and innovations evolve in the years ahead.

May 25, 2026
Spring Research Conference Day 1 – Isabel Potter “Artists are not Technologists – AI for Scenography “

This blog post is written by AI CDT student, Lucy Farnik

Isabel Potter gave a talk at this year’s Spring Research Conference about their work on applying AI to scenography, which they are exploring in their PhD. They chose this research area partially due to their extensive amounts of experience in the creative arts, having been involved with theatre since age 14. They have also founded their own company in this space and are taking on various freelance projects in theatre alongside their PhD.

Isabel’s talk was built on one central theme — artists are not technicians. At the moment, generative AI is getting closer to being able to automate parts of scenography, from creating background music to staging. However, many of these tools are made for people with a STEM background and use a terminology that matches this. For example, tools which can be used for immersive technology in the arts include Unreal Engine which uses many computer vision and mathematics terms. One may contrast this with tools like Adobe Photoshop which uses terms such as “paintbrush tool”, which comes from the terminology artists use on a daily basis.

Isabel is trying to reduce the barrier to entry for artists. They are specifically focusing on lighting design, as this is the most under-explored area of immersive technology for scenography and is also the area that they have the most experience working in. At the moment, prompting large language models to create diagrams such as lighting plots leads to results which are not yet usable, but the step of translating lighting ideas into programs which can be loaded into a lighting desk is already somewhat doable by existing foundation models. They are currently exploring this as a starting point while optimizing for ease of use by a non-technical audience.

May 25, 2026
Spring Research Conference Day 1 – Professor Seth Bullock “AI for Collective Intelligence (AI4CI) Research Hub”

This blog post is written by AI CDT student, Fahd Abdelazim

Recent Artificial Intelligence (AI) advances have shown that the applications of AI extend far beyond increasing efficiency or convenience. It is now possible to use AI to tackle some of humanity’s most pressing challenges from minimizing pandemics to managing extreme weather events and guiding sustainable urban development. However addressing these issues requires specialized systems and skilled researchers to lead these innovations.

Recognizing the importance of tackling these challenges the University of Bristol established the AI for Collective Intelligence (AI4CI) research hub which will serve as the cornerstone for interdisciplinary collaboration and bringing together experts partners from across academia, government, charities and industry to work together to harness the power of AI to address the complex challenges which lie at the intersection of humans and AI.

For example, the personalization of treatment for diabetes patients and using data to enhance the NHS policies for patients. The hub will also work on enhancing pandemic prediction and response through analysing previous pandemics and exploring how AI can be used to help policy makers and healthcare professionals to make swift and informed decisions in the future.

Climate change is another pressing issue which increases the frequency and intensity of extreme weather events. AI can play a pivotal role in disaster management and mitigation through analysing real-time meteorological data to predict extreme weather events and provide early warnings. In the area of urban development AI can allow for more creating smarter and more resilient cities. This can be done through analysing population density, transportation routes and energy consumption. This can allow for optimized infrastructure and improved public services.

It is clear that AI will play a pivotal role in the process of building a better future and it is necessary to fully capitalize on the potential of this technology. Through initiatives like the AI4CI research hub we can harness the power of AI to address the future challenges that humanity will face and create a better and sustainable world for future generations.

May 25, 2026
ELISE Wrap up Event
This blog post is written by AI CDT student, Jonathan Erskine

I recently attended the ELISE Wrap up Event in Helsinki, marking the end of just one of many programs of research conducted under the ELLIS society, which “aims to strengthen Europe’s sovereignty in modern AI research by establishing a multi-centric AI research laboratory consisting of units and institutes distributed across Europe and Israel”.

This page does a good job of explaining ELISE and ELLIS if you want more information.

Here I summarise some of the talks from the two-day event (in varying detail). I also provide some useful contacts and potential sources of funding (you can skip to the bottom for these).

Robust ML Workshop

Peter Grünwald: ‘e’ is the new ‘p’

P-values are an important indicator of statistical significance when testing a hypothesis, whereby a calculated p-value must be smaller than some predefined value, typically $alpha = 0.05$. This is a guarantee that Type 1 Errors (where null hypothesis can be falsely rejected) are less than 5% likely.

“p-hacking” is a malicious practice where statistical significance can be manufactured by, for example:
- stopping the collection of data once you get a P<0.05
- analyzing many outcomes, but only reporting those with P<0.05
- using covariates
- excluding participants
- etc.
Sometimes this is morally ambiguous. For example, imagine a medical trial where a new drug shows promising, but not statistically significant results. Should a p-test fail, you can simply repeat the trial, sweep the new data into the old and repeat until you achieve the desired p-value, but this can be prohibitively expensive, and it is hard to know whether you are p-hacking or haven’t tested enough people to prove your hypothesis. This approach, called “optional stopping”, can lead to violation of Type 1 Error guarantees i.e. it is hard to have faith in your threshold $alpha$ due to the increasing cumulative probability that individual trials are in the minority case of false positives.

Peter described the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for “effortlessly combining results from several studies in the common scenario where the decision to perform a new study may depend on previous outcomes.“

Unlike with the p-value, this proposed method is “safe under optimal continuation with respect to Type 1 error”; no matter when the data collecting and combination process is stopped, the Type-I error probability is preserved. For singleton nulls, e-values coincide with Bayesian Factors.

In any case, general e-values can be used to construct Anytime-Valid Confidence Intervals (AVCIs), which are useful for A/B testing as “with a bad prior, AVCIs become wide rather than wrong”.

In comparison to classical approaches, you need more data to apply e-values and AVCIs, with the benefit of performing optional stopping without introducing Type 1 errors. In the worst case you need more data, but on average you can stop sooner.

This is being adopted for online A/B testing but is more challenging for expensive domains, such as medical trials; you need to reserve more patients for your trial, but you wont need them all – a challenging sell, but probability indicates that you should save time and effort in the majority of cases.

Other relevant literature which is pioneering this approach to significance testing is Waudby-smith and Ramdas, JRSS B, 2024

There is an R package here for anyone who wants to play with Safe Anytime-Valid Inference.

Watch the full seminar here:

https://www.youtube.com/watch?v=PFLBWTeW0II

Tamara Broderick: Can dropping a little data change your conclusions – A robustness metric

arxiv.org

Tamara advocated the value of economics datasets as rich test beds for machine learning, highlighting that one can examine the data produced from economic trials with respect to robustness metrics and can come to vastly different conclusions than those published in the original papers.

Focusing in, she described a micro-credit experiment where economists ran random controlled trials on small communities, taking approximately 16500 data points with the assumption that their findings would generalise to larger communities. But is this true?

When can I trust decisions made from data?

In a typical setup, you (1) run an analysis on a series of data, (2) come to some conclusion on that data, and (3) ultimately apply those decisions to downstream data which you hope is not so far out-of-distribution that your conclusions no longer apply.

Why do we care about dropping data?

Useful data analysis must be sensitive to some change in data – but certain types of sensitivity are concerning to us, for example, if removing some small fraction of the data $alpha$ were to:
- Change the sign of an effect
- Change the significance of an effect
- Generate a significant result of the opposite sign
Robustness metrics aim to give higher or lower confidence on our ability to generalise. In the case described, this implies a low signal-to-noise ratio, which is where Tamara introduces her novel metric (Approximate Maximum Influence Perturbation) which should help to quantify this vulnerability to noise.

Can we drop one data point to flip the sign of our answer?

In reality, this is very expensive to test for any dataset where the sample size N is large (by creating N*(N-1) datasets and re-running your analysis. Instead, we need an approximation.

Let the Maximum Influence Perturbation be the largest possible change induced in the quantity of interest by dropping no more than 100α% of the data.

From the paper:

We will often be interested in the set that achieves the Maximum Influence Perturbation, so we call it the Most Influential Set.

And we will be interested in the minimum data proportion α ∈ [0,1] required to achieve a change of some size ∆ in the quantity of interest, so we call that α the Perturbation-Inducing Proportion. We report NA if no such α exists.

In general, to compute the Maximum Influence Perturbation for some α, we would need to enumerate every data subset that drops no more than 100α% of the original data. And, for each such subset, we would need to re-run our entire data analysis. If m is the greatest integer smaller than 100α, then the number of such subsets is larger than $binom{N}{m}$. For N = 400 and m = 4, $binom{N}{m} = 1.05times10^9$. So computing the Maximum Influence Perturbation in even this simple case requires re-running our data analysis over 1 billion times. If each data analysis took 1 second, computing the Maximum Influence Perturbation would take over 33 years to compute. Indeed, the Maximum Influence Perturbation, Most Influential Set, and Perturbation-Inducing Proportion may all be computationally prohibitive even for relatively small analyses.

Further definitions are described better in the paper, but suffice to say the approximation succeeds in identifying where analyses can be significantly affected by a minimal proportion of the data.For example, in the Oregon Medicaid study (Finkelstein et al., 2012), they identify a subset containing less than 1% of the original data that controls the sign of the effects of Medicaid on certain health outcomes. Dropping 10 data points takes data from significant to non-significant.

Code for the paper is available at:

https://github.com/rgiordan/AMIPPaper/blob/main/README.md

An R version of the AMIP metric is available:

https://github.com/maswiebe/metrics.git

Watch a version of this talk here:

https://www.youtube.com/watch?v=7eUrrQRpz2w

Cedric Archambeau | Beyond SHAP : Explaining probabilistic models with distributional values

Abstract from the paper:

A large branch of explainable machine learning is grounded in cooperative game theory. However, research indicates that game-theoretic explanations may mislead or be hard to interpret. We argue that often there is a critical mismatch between what one wishes to explain (e.g. the output of a classifier) and what current methods such as SHAP explain (e.g. the scalar probability of a class). This paper addresses such gap for probabilistic models by generalising cooperative games and value operators. We introduce the distributional values, random variables that track changes in the model output (e.g. flipping of the predicted class) and derive their analytic expressions for games with Gaussian, Bernoulli and categorical payoffs. We further establish several character- ising properties, and show that our framework provides fine-grained and insightful explanations with case studies on vision and language models.

Cedric described how Shap values can be reformulated as random variables on a simplex, shifting from weight of individual players to distribution of transition probabilities. Following this insight, they generate explanations on transition probabilities instead of individual classes, demonstrating their approach on several interesting case studies. This work is in it’s infancy – and has plenty of opportunity for further investigation.

Semantic, Symbolic and Interpretable Machine Learning Workshop

Nada Lavrač: Learning representations for relational learning and literature-based discovery

This was a survey of types of representation learning, focusing on Nada’s area of expertise in propositionalisation and relational data, Bisociative Literature-Based Discovery, and interesting avenues of research in this direction.

Representation Learning

Deep learning, while powerful (accurate), raises concerns over interpretability. Nada takes a step back to survey different forms of representation learning.

Sparse, Symbolic, Propositionalisation:
- These methods tend to be less accurate but are more interpretable.
- Examples include propositionalization techniques that transform relational data into a propositional (flat) format.
Dense, Embeddings:
- These methods involve creating dense vector representations, such as word embeddings, which are highly accurate but less interpretable.
with recent work focusing on unifying approaches which can incorporate the strengths of both approaches.

Hybrid Methods:
- Incorporate Sparse and Deep methods
- DeepProp, PropDRM, propStar(?) – Methods discussed in their paper.
Representation learning for relational data can be achieved by:
- Propositionalisation – transforming a relational database into a single-table representation. example: Wordification
- Inductive logic programming
- Semantic relational learning
- Relational sub-route discovery (written by Nada and our own P. Flach)
- Semantic subgroup discovery system, “Hedwig” that takes as input the training examples encoded in RDF, and constructs relational rules by effective top-down search of ontologies, also encoded as RDF triples.
- Graph-based machine learning
  
  data and ontologies are mapped to nodes and edges
  
  In this example, gene ontologies are used as background knowledge for improving quality assurance of literature-based Gene Ontology Annotation
These slides, although a little out of date, talk about a lot of what I have noted here, plus a few other interesting methodologies.

The GitHub Repo for their book contains lots of jupyter notebook examples.

https://github.com/vpodpecan/representation_learning.git

Marco Gori: Unified approach to learning over time and logic reasoning

I unfortunately found this very difficult to follow, largely due to my lack of subject knowledge. I do think what Marco is proposing requires an open mind as he re-imagines learning systems which do not need to store data to learn, and presents time as an essential component of learning for truly intelligent “Collectionless AI”.

I wont try and rewrite his talk here, but he has full classroom series available on google, which he might give you access to if you email him.

Conclusions:
- Emphasising environmental interactions – collectionless AI which doesn’t record data
- Time is the protagonist: higher degree of autonomy, focus of attention and consciousness
- Learning theory inspired from theoretical physics & optimal control: hamiltonian learning
- Nuero-symbolic learning and reasoning over time: semantic latent fields and explicit semantics
- Developmental stages and gradual knowledge acquisitation
Contacts & Funding Sources

For Robust ML:

e-values, AVCIs:

Aaditya Ramdas at CMU

Peter Grünwald Hiring

For anyone who wants to do a Robust ML PhD, apply to work with Ayush Bharti : https://aalto.wd3.myworkdayjobs.com/aalto/job/Otaniemi-Espoo-Finland/Doctoral-Researcher-in-Statistical-Machine-Learning_R40167

If you know anyone working in edge computing who would like 60K to develop an enterprise solution, here is a link to the funding call: https://daiedge-1oc.fundingbox.com/ The open call starts on 29 August 2024.

If you’d like to receive monthly updates with new funding opportunities from Fundingbox, you can subscribe to their newsletter: https://share-eu1.hsforms.com/1RXq3TNh2Qce_utwh0gnT0wfegdm

Yoshua Bengio said he had fellowship funding but didn’t give out specific details, or I forgot to write them down… perhaps you can send him an email.
May 25, 2026