Understanding Clinical AI Research with Ethan Goh
This is part of our weekly executive insights series where Elion CEO Bobby Guelich speaks with healthcare leaders about their tech priorities and learnings. For more, become a member and sign up for our email here.
Role: Stanford Clinical Excellence Research Center Fellow, focused on evaluating large language models and generative AI applications for healthcare
Organization: Stanford University
You recently published a paper on the influence of AI on doctors’ diagnostic reasoning that got a lot of coverage. Can you give a quick overview of the study you ran, how it was designed, and its goal?
We wanted to evaluate how doctors would use AI tools on challenging diagnostic vignettes compared to conventional tools like UpToDate or internet searches. We expected better performance with AI, but that wasn’t the case. We had done a previous smaller pilot to establish the scope of the study we would need to run, the number of doctors, and the number of cases, and that pilot suggested AI would help, so the results of the larger study were particularly surprising.
Given how close you were to the study, what were your takeaways? Do you think the findings reflect real challenges with the tools, or was it more about how doctors were using them?
Great question. I think it highlights two things:
Without sufficient training, doctors may not trust or use these tools effectively.
While these tools are powerful, we still need to figure out how to integrate them into workflows and specific use cases in a way that clinicians can use them effectively.
This study leveraged GPT-4. What were your observations on how physicians used it?
Most doctors found it intuitive, which was great—it didn’t require heavy instructions. In fact, its ease of use made the administrative aspects of the study much easier. Because for so many tools that doctors use, you need an operations or administrative person to set it up and train the physician side-by-side.
We also have transcripts of how they used it. A common approach was treating it like a Google search, asking for specific information like drug dosages. Some realized they could paste the entire vignette, which turned out to be one of the more effective strategies. However, many didn’t know to do that, which limited their results.
The AI alone outperformed the other cohorts. Was that from simply pasting the vignettes into GPT-4?
Exactly. We pasted the entire vignette into ChatGPT and gave it a direct prompt to solve the case. We did not do any sophisticated prompting; we gave it the exact same case details and questions that we gave the doctors. We then gave all of the responses from the doctors and AI to graders to review, blinded.
Any other takeaways for our audience of healthcare professionals, vendors, and investors?
Yes, several:
First, these tools are powerful across domains, but we have more work to do to better understand their specific strengths and limitations.
Second, as powerful as these tools are, training and trust are crucial—simply giving a tool to clinicians won’t work without thoughtful integration and education.
Finally, while vendors will likely use studies like this to highlight AI’s potential, additional robust studies are needed to help healthcare leaders make educated decisions and address liability and regulatory concerns.
What future research are you excited about?
We’ve built the ARiSE network with institutions like Stanford, Beth Israel, Virginia, and Minnesota to expand these types of studies. I’m particularly interested in combining generative AI with traditional predictive models. For instance, pairing AI that predicts sepsis with a generative model to explain the prediction could make the tool more transparent and actionable for clinicians.
Another focus is moving beyond simulated vignettes to real-world settings, incorporating real patient data and two-way conversations. Hospital leaders will need this level of evidence to guide adoption while managing liability and regulatory challenges.