Executive Insights
April 11, 2025

Understanding AI Evaluation and Governance with Dr. Sarah Gebauer

Bobby Guelich's headshot
Bobby Guelich
CEO, Elion
Sarah Gebauer social.png

This is part of our executive insights series where Elion CEO Bobby Guelich speaks with healthcare leaders about their tech priorities and learnings. For more, become a member and sign up for our email list here.

Role: Physician / Clinical AI Consultant / Researcher
Organization: Validara Health, RAND Corporation, ML for MDs, Healthcare AI consultant 

Tell us about your background and your work in healthcare AI.

I’m a physician board-certified in anesthesiology and palliative care with a clinical informatics degree from OHSU. My career has spanned clinical leadership roles, risk management committees, and hospital executive committees. Before medical school, I worked as a consultant at Bain and have continued healthcare consulting for the past five years, primarily focused on AI and general IT issues, including clinical workflow and physician metrics.

I became interested in AI—specifically natural language processing—before ChatGPT’s rise and started engaging with other physicians to discuss its impact. That led me to create a Slack community, which has grown to over 500 physicians discussing AI in medicine, and a newsletter to help educate both myself and others about different healthcare AI issues.

About two years ago, I joined RAND corporation, a think tank, where I work on AI system evaluations for national security concerns. That field is still evolving, but the methods we’re developing to evaluate AI risk and performance are widely applicable, including to healthcare.

AI evaluation is a major challenge in healthcare today. Given your perspective, what are the key issues you see?

Healthcare IT has a history of over-promising and under-delivering. I remember when EHRs were introduced—doctors were told they’d save time and streamline everything. While some things improved (like record-sharing across hospitals), documentation burden skyrocketed. Physicians are understandably skeptical of new technology that promises to make their lives easier.

That’s why AI evaluation needs to be rigorous. For example, we need:

  • Clear validation metrics: AI vendors often create their own benchmarks, making it hard to compare solutions. Health systems need standardized ways to evaluate AI across their own priorities, like clinical risk reduction, revenue impact, workflow fit, and patient experience. It’s also important to acknowledge that solutions will perform differently across different contexts or specialties.

  • Contextualized results: How does the AI solution compare to what we’re doing now? Physicians will be much more willing to accept products that aren’t perfect if the current state is poor. That’s one reason physicians are so willing to adopt ambient scribes when they’re often resistant to new technology; the documentation burden is so high and medical notes have incorrect facts about half the time, so any improvement is acceptable.

  • Ongoing monitoring: I think contracts will start to require that vendors show their product performs as well six months after deployment as it did in initial tests. Also most hospitals are layering multiple AI solutions, but we rarely test how they interact. In many cases, AI models working together perform worse than they do alone.

These are still open challenges, but solving them is critical for ensuring AI meaningfully improves care.

Practically speaking, how do we begin to address some of these challenges, given the associated cost and the pace at which this technology is evolving—and particularly as solutions are increasingly agentic in nature?

I am hopeful that some of the work that's being done in non-healthcare spaces related to agentic benchmarking will be helpful in the healthcare setting as well. Banking, education, and other regulated industries face similar AI challenges, and AI researchers are working on standardized benchmarks for agentic AI. Healthcare can build on that work rather than reinventing the wheel.

Public AI benchmarks are also gaining traction. In other industries, it’s standard practice for companies to test their AI on shared benchmarks as a credibility signal. Healthcare should move in that direction—allowing hospitals to compare AI solutions apples-to-apples.

You’ve been on a lot of hospital executive committees. What’s your perspective on how AI governance efforts are shaping up?

The health system process for approving a lot of these AI systems is so disparate right now. Some hospitals approve AI tools after a single test, while others require 3 to 4 committee meetings at multiple levels and rounds of sign-offs. There’s a lot of anxiety—more than there was for previous health IT tools—about approving AI and getting it wrong.

Some key governance challenges:

  • One-size-fits-all approval processes: AI governance committees often apply the same scrutiny to low-risk tools (e.g., pre-authorization automation) as they do to high-risk tools (e.g., clinical decision support). 

  • Unrealistic transparency expectations: Some hospitals want AI vendors to fully explain how a model makes decisions, but that’s often impossible, especially for tools built on proprietary LLMs. Governance needs to focus on validating outcomes rather than demanding perfect explainability.

  • Endless testing loops: Many AI vendors get stuck in cycles of repeated testing because hospitals keep asking for “just one more validation.” This slows adoption without necessarily improving patient safety.

What advice would you give health systems on improving AI governance?

  1. Create tiered review processes: Different tiers of review scrutiny between high-risk and low-risk AI applications.

  2. Follow evolving data on AI testing: Research is emerging on how many test cases are needed to detect rare AI failures. Governance should be based on evidence, not arbitrary demands.

  3. Support innovation-friendly cultures: AI will never be perfect. If hospital leadership treats every unexpected AI issue as a “fireable offense,” no one will take risks. Creating a culture where decision-makers feel safe experimenting is key to adoption.

Dr. Gebauer’s views are her own and not those of the RAND Corporation.