Mapping Markets
September 25, 2024

Large Language Models Market Map: Unlocking the clinical use case

Patrick Wingo's headshot
Patrick Wingo
Head of Research, Elion

This is part of Elions weekly market map series where we break down critical vendor categories and the key players in them. For more, become a member and sign up for our email here.

You can’t bake a cake without flour, sugar, and eggs, and you can’t build a GenAI application without an LLM underpinning it. While many AI vendors are developing their own models or using off-the-shelf models, there are an increasing number of innovative health systems and providers turning to LLMs in order to build critical, custom applications.

LLMs are large-scale deep learning or machine learning models trained on vast datasets to perform tasks like natural language understanding, text summarization, translation, question answering, and conversational interaction. They’re used in AI applications to ingest large amounts of data—such as medical datasets, research findings, or patient records—to generate summaries, recommendations, and responses. Examples include ambient scribes, clinical decision support, intelligent contact center agents, and more.

To be considered for our large language model category, vendors need to produce a specific model or mixture of models that’s available for direct consumption by other developers, as opposed to building a model used only within the confines of their application.

Are LLMs Ready for Clinical Use?

While LLMs are already being used extensively in administrative functions like revenue cycle management, the next frontier is their use in clinical settings. The key question is: When will LLMs be accurate and safe enough for clinical use, and what safeguards need to be in place?

While we’ve seen dramatic progress in performance on tests like USMLE and MedQA, indicating improved clinical reasoning on individual tasks, only 5% of studies in a systematic review of LLM performance evaluated the models on real data from patient care.

Some of the major challenges in applying these models to clinical scenarios include:

  • Accuracy and safety: The models must handle rare clinical cases, dosage calculations, and complex drug names without making errors or producing hallucinations.

  • Regulatory compliance: They must ensure HIPAA-compliant infrastructure, accurate patient identification, and robust privacy protections.

  • Bias and consistency: These models need to incorporate demographic data in decision-making without introducing biases, and their outputs must be consistent.

  • Patient interaction: The communication must be clear, non-leading, and adaptable to patient variability.

  • Workflow integration: LLMs should integrate smoothly into clinical workflows, supporting clinician judgment rather than replacing it.

Breaking Down the Large Language Model Market

Given the rate of progress with LLMs, we’re seeing a few different approaches in how vendors are producing and making these models consumable.

  • Frontier models: AI research labs and hyperscalers, focused on building the most advanced general models, like OpenAI, Meta/Llama, Gemini, and Anthropic, are heavily investing in algorithmic research and infrastructure. These models, however, are geared toward broad use cases rather than healthcare-specific needs.

  • Fine-tuned models: John Snow Labs’ MedLlama3, Gradient’s Nightingale, ScienceIO, and Insights AI are examples of frontier models fine-tuned for healthcare. As frontier models evolve, these specialized models need to be retrained to maintain their functionality, often incorporating compliance features for handling patient data.

  • Foundational models: Some use-case-specific models, such as those from GenHealth.ai (focused on population health) and Harrison.ai (multimodal models for radiology), are built independently from frontier models to meet highly specialized needs.

  • Mixture of agents: One of the most promising techniques for LLMs seems to be the use of multiple agents (models with specific instructions, context, and memory), each prompted or fine-tuned for specific tasks, where the output of one model can be processed and used as the input of other models. This approach has shown success in products like Hippocratic AI and MedGemini.

Our bet is that progress in frontier models and in novel architectures like mixture of agents will continue to drive improved performance on benchmarks like the USMLE, MedQA, and other field-specific tests. However, determining whether a model is safe, accurate, consistent, and free of bias in real-world clinical settings remains challenging without rigorous testing on actual patient data.

When developing workflows, its crucial to evaluate several leading models side by side to compare their performance. Its also wise to anticipate upgrading or replacing models within a year as they evolve and become more refined, potentially yielding even better results.

Better LLMs=Better For Everybody

This is arguably one of the most exciting areas in healthcare technology, as a rising tide lifts all boats. Advancements at any level, whether in frontier models, fine-tuning, or architecture, can lead to significant improvements in model performance, unlocking more clinical use cases and driving broader adoption.