High hopes and hard truths about AI in healthcare

The rise of generative AI and large language models (LLMs), such as ChatGPT, has generated considerable excitement and investment in their potential medical applications. Despite the hype, this enthusiasm has yet to be matched with the level of scientific evidence needed that is critical to ensure these technologies are deployed in healthcare safely.

The promise of generative AI in medicine

We’ve all heard that generative AI holds significant potential for transforming various aspects of medical practice. One potential application is in clinical note-taking and documentation. Automating this seemingly harmless routine task is heralded as a way to lighten the administrative load on healthcare professionals, giving them more time to focus on patient care. Additionally, generative AI has the potential to improve diagnostic processes and provide personalised treatment recommendations by processing vast amounts of medical data.

For instance, Epic Systems, the health and technology system provider that organises, stores and shares electronic medical records, has integrated GPT-4 to manage electronic patient messaging. This aims to streamline workflows by drafting responses to patient inquiries, which clinicians can then review and modify. The goal is to reduce the growing administrative burden on healthcare providers, which has increased due to the expansion of electronic health record (EHR) systems​​.

However, we must temper our enthusiasm with caution. These technologies are still in their infancy and need a thorough evaluation to confirm their effectiveness and safety, even if they aren’t specifically for a medical purpose.

What does the evidence say?

Recent evaluations of LLMs in clinical oncology have shown significant variability in their performance, raising concerns about their reliability in medical applications. For example, a study published in NEJM AI tested five publicly available LLMs, including GPT-4, on over 2,000 oncology questions. The study found substantial differences in performance among the models, with GPT-4 being the only one to perform above the 50th percentile when benchmarked against human oncologists. Yet, even GPT-4 had clinically significant error rates, including overconfidence and inaccuracies​​​.

These inconsistencies show that while generative AI can boost efficiency, it also poses risks like hallucinations and biased outputs, potentially leading to serious misinformation​​​​. For instance, allowing note-taking apps to streamline patient information risks granting AI and LLMs the power to determine the importance of information, potentially leading to issues such as missed diagnoses, drug interactions and even life-threatening allergies. Errors of commission (hallucinations) are just as risky as errors of omission.

Academic studies have further underscored the disappointing results of AI in clinical settings, emphasising the need for rigorous empirical tests and standardised evaluations. A study in JAMA Network Open looked at AI-generated draft replies in an EHR system and found that while the drafts were longer and more informative, they also increased the time spent reading messages. Some drafts even posed severe risks when not properly edited by clinicians​​​.

Regulatory landscape for AI in medicine

Navigating the regulatory landscape is a major challenge for AI technologies aiming to be used in medicine. Regulatory frameworks like the FDA, CE, and UKCA have stringent requirements for Software as a Medical Device (SaMD) and AI as a Medical Device (AIaMD) to ensure the safety and efficacy of these devices. Each of these bodies mandates a detailed process for defining the medical problem the AI technology aims to solve and establishing a solid evidence base. This involves clearly specifying the intended use, determining the risk classification, and outlining the necessary performance and safety requirements.

For instance, the FDA requires detailed documentation of the device's intended use to determine its risk class and the corresponding regulatory controls. Similarly, CE marking in the EU and UKCA in the UK require compliance with specific directives and regulations, including the Medical Device Regulation (MDR) and In Vitro Diagnostic Regulation (IVDR) in Europe, which mandate rigorous testing and validation processes​​​​. In addition, the EU AI Act will come into full force for AI-driven medical devices in the next couple of years, yet no-one it seems is prepared for this.

Compliance with data protection laws such as GDPR and HIPAA adds another layer of complexity. These laws require stringent data handling and cybersecurity measures to protect patient information, which is challenging given the often blackbox nature of AI model training and validation processes. The lack of transparency in how AI models are built complicates quality assurance efforts, making regulatory approval a daunting task.

Moreover, the regulatory process demands continuous monitoring of regulatory cleared devices, and updating of AI models must also be reviewed to ensure they remain effective and safe over time. This includes addressing cybersecurity risks and ensuring ongoing compliance with evolving regulations. While the path to regulatory approval is undeniably complex, it’s essential to ensure that AI technologies in healthcare are both safe and reliable for clinical use​​​​. We’ve already laid out the theoretical framework for how to get an LLM regulatory approved, but to date, no-one has yet achieved this.

What does the future of generative AI and LLM in medicine look like? 

Many argue that regulations ‘stifle innovation’ - they do not, they are an enabler to safe and effective innovation. Simply put, real improvement in the AI architecture is needed to tackle the inaccuracies in generative AI and LLM, and some experts even believe that it is impossible to solve the hallucination problem, which left unchecked could exponentially adversely affect the sanctity and fidelity of the electronic health record. This means even if an LLM designed specifically for healthcare gets to market, constant collaboration between AI developers, regulators and medical professionals is paramount. These tools need to be monitored and quality assured continuously, just as we do for any medical equipment. To even attempt to meet healthcare needs developers must ensure practical and reliable outputs that, most importantly, meet clinical standards​. Adapting regulations to keep pace with these technological advancements is challenging but essential for maintaining trust and safety in healthcare applications​​.

Responsible innovation, at any level, must prioritise patient safety and care quality, involving rigorous testing, transparent reporting, and ongoing evaluation. This approach could perhaps get AI and LLM somewhere near the line of approval meeting regulatory and ethical standards.

At Hardian we specialise in the regulations and standards for all software medical devices, including AI-driven clinical tools. From defining Intended Use Statements and developing clear regulatory strategies to building robust Quality Management Systems and guiding you through ISO accreditations and CE / UKCA / FDA certifications, we are committed to leveraging our extensive knowledge of technical and clinical requirements and deep understanding of the regulatory landscape to help your business succeed.

Hardian Health is a clinical digital consultancy focused on leveraging technology into healthcare markets through clinical strategy, scientific validation, regulation, health economics and intellectual property.

Dr Hugh Harvey

By Dr Hugh Harvey, Managing Director

Previous
Previous

Reflections from HLTH Europe

Next
Next

When does a radiology AI marketplace need regulatory certification?