How to get ChatGPT regulatory approved as a medical device

The advent of ChatGPT and similar large language models (LLMs) has created unprecedented excitement for their application in medicine. Advocates of the technology are imagining a wide range of clinical utility, from clinical note-taking to diagnostic tools and beyond. During the initial splurge of hype it can be easy to get carried away with futuristic thinking, while pragmatists and sceptics will only be more validated as the limitations and risks of such models become more widely acknowledged and addressed. However, there will eventually be a path that emerges that brings us closer to successfully applying this ground-breaking technology to medicine, and that’s what we’re going to explore in this blog.

Is it even possible?

First, let’s begin with the bad news. ChatGPT (and other models like it) can not be used safely in medical practice in their current form. They are prone to hallucination, bias, and can produce extremely plausible sounding misinformation. As such large language models are far better suited to generating creative rather than factual output. Additionally, they aren’t necessarily compliant with data protection laws such as GDPR and HIPAA, a number of cybersecurity risks present themselves, and there is little to no information publicly available on how they were built, trained and validated so no effective quality assurance can be conducted. To rely on them in medicine would likely be illegal in most jurisdictions, violating professional standards, clinician codes-of-conduct, medical device regulations and patient data protection laws. Even allowing them to be used for general medical search and queries could land their developers in trouble, as according to Haupt et al they could be liable for medical misinformation, as well as setting up litigation issues for clinical end users who may rely on plausible sounding, non-standard misinformation to make clinical decisions. 

You might be thinking, what about Google search? Yes, doctors use the internet of course, but general systems such as those are protected under laws such as Section 230 of the 1996 Communications Decency Act, and as such have no regulatory burden when it comes to processing medical queries, since the content provided is from third parties (each with their own liability). ChatGPT and the like are not protected since they do not disclose the sources of third party information ingested into their training, and are self-contained systems which act as far more than “a passive transmitter of information provided by others”.

Where to start

So, how could you go about demonstrating compliance for a medical large language model? It’s certainly not going to be easy, and currently most likely impossible, but we can at least explore what this might look like in the future by learning from current regulatory frameworks and ongoing active research into AI safety.

Define the problem

Before even starting to think about approving an LLM for medical use, take a step back and define the problem you are trying to solve. Blindly adopting the newest technology for the sake of it does not equate to creating an economically valuable solution. Taking radiology AI as an example from the 2010s, there is still uncertainty of how the use of these computer-vision tools translates into economic value for healthcare providers and systems.

Defining your unmet clinical need and subsequent business case is more important than ever in the turbulent economic climate of 2023. With funding continuing to slow and investors prioritising near-term profitability over the promise of longer term potential, it’s crucial to demonstrate how generative LLMs will solve the problem in an economically viable way. Once you've accomplished this, you’re ready to start the product development and regulatory journey.

Intended Use

The starting point for any new technology in medicine is to define its intended use in order to a) decide if it is a medical device or not, in which case it requires regulatory approval in the form of FDA approval, a CE or UKCA mark as Software as a Medical Device (SaMD) or more specifically AIaMD, and b) what risk class and special controls are required. The intended use statement informs performance and safety requirements, as well as defining end users, clinical indications, clinical and operational context, and importantly, reasonably foreseeable misuse. Let’s pick an example that has been widely discussed - providing a differential diagnosis to a doctor based on a patient consultation. 

As part of the overall intended use for such a device, indications for use may be written as:

“MedGPT is intended for use by qualified medical practitioners in the context of outpatient clinical consultations with patients aged over 18 for any clinical condition that is not immediately life threatening or critical. MedGPT provides a top three differential diagnosis based on clinician-derived prompt inputs, basing its outputs on data from a curated general medical knowledge database, real-time structured data from the patient record combined with consultation transcriptions. MedGPT is not intended to be used in emergent care, paediatrics, obstetrics, or psychiatry and its outputs should assist with clinical decision making only, not drive management or be relied upon for formal diagnosis.”

This statement clearly sets limits on who can use the device, in what context, and for what clinical conditions in a limited adult population. The full intended use document would need to be much more detailed, but this is a good starting point for our theoretical example. You can already see that we have had to significantly limit the potential of the language model to a predefined target use case of clinical decision support for a given clinical population with limited scope and severity of conditions. This example is indeed a medical device, and will require regulatory approval.

Risk Classification


Next, we would have to determine the risk classification of this device. Based on the MDCG risk classification guidance this device would likely be Class IIa in the UK and EU, since it is intended to inform clinical decision making. The UK regulator, the MHRA is clear on this point -

“A device is considered to ‘allow direct diagnosis’ when it provides the diagnosis of the disease or condition by itself, it provides decisive information for making a diagnosis, or claims are made that it can perform as, or support the function of, a clinician in performing diagnostic tasks.”

Indeed, predicate devices such as Ada Assess are CE marked as Class IIa. 

Treating or diagnosing

EU Class III

Critical risk for patient

EU Class IIb

Serious risk for patient

EU Class IIa

Non-serious risk for patient

Driving clinical
management

EU Class IIb

Critical risk for patient

EU Class IIa

Serious risk for patient

EU Class IIa

Non-serious risk for patient

Informing
clinical management

EU Class IIa

Critical risk for patient

EU Class IIa

Serious risk for patient

EU Class IIa

Non-serious risk for patient

Note that if further claims are made such as treating or diagnosing, then the risk classification can easily fall into class IIb or III, which require a higher burden of regulatory oversight. Currently CADx (diagnostic) systems in the United States of America are likely to be Class III, so it is important that we temper our intended use to clarify it is intended for clinical decision support only. Additionally, since this is novel technology, the FDA may require a De Novo submission, which has a longer timeline and greater regulatory scrutiny than a standard 510(k) submission that is used to leverage already existing substantially equivalent devices that are FDA approved.

Defining our requirements

Next, we’ll need to define the requirements for our medical large language model. What are the parameters in which we want it to operate, associated claimed benefits, performance benchmarks and safety requirements? In medical device regulation, there are essential requirements, and then product specific requirements. The essential requirements (also known as general safety and performance requirements) include the elimination of risk to the maximum extent possible for the duration of the lifetime of the device, appropriate design of electronic programmable systems and general requirements regarding labelling and product information (think instruction manuals and user information). Performance requirements depend on the claims being made - so we would need to demonstrate a valid clinical association that a medical LLM system can provide factually correct clinical diagnostic differential (which is currently hard to prove). We also need to decide on appropriate clinically measurable metrics, often based on the results and outputs of a systematic literature review specifically conducted to explore current state of the art, as well as technological benchmarks. Our Literature Review should be reported to PRISMA standards. All of this feeds into what is known as a Benefit Risk Analysis, which again feeds back into your requirements stack, alongside a Product Requirements and Risk Register managed within an ISO 13485 certified Quality Management System. We’ll also need to consider operating environment requirements, systems connectivity, architecture and data security. There’s a lot to do!

A theoretical system

Let’s look at architecture next, since we can’t test something until we actually have something to test. Research in this space is moving fast, with many groups working on improving the reliability and safety of large language model outputs. Taking example from Microsoft’s work into building a fact checking system incorporating an LLM, external knowledge and automated feedback, a sensible starting point may look something like this:

Here we have a third-party LLM connected by API which receives prompts guided by a custom prompt engine linked to a data retriever which can access a curated medical knowledge base. Some form of rules-based module controls the system to prompt, receive feedback, store or read memory or retrieve data. The system is designed to automate feedback so that LLM responses can be fact checked against the medical knowledge base, and sent back to be refined if not deemed factual when compared to knowledge extracted from the medical knowledge base. The memory module ensures all data in the flow is stored and informs the system so it improves rather than deteriorates in response quality. The rules-based module, with a series of pre-programmed IF>THEN logic could disallow queries relating to our exclusions (e.g. obstetrics, critical care), as well as function as a prompt guide rail to ensure LLM responses are returned in a specific format.

Let’s assume this set up works in a test environment (a feasibility study), and provides more verifiably factually correct outputs than current unprocessed LLM outputs. The entire system architecture will have to be designed and documented within an appropriate Software Development LifeCycle to IEC 62304 and 82304-1 standards, and verified as cybersecure to at least ISO 27001, while also being GDPR and HIPAA compliant. That’s a lot of technical documentation to go into your Medical Device File.

3 major hurdles

Our challenge will now be to verify and validate each component of our system, and demonstrate that it works in the real world. Software verification and validation may be relatively simple under current regulatory frameworks for most of our system modules except for the medical knowledge base and of course, the LLM itself. 

Validating a curated medical knowledge base 

The knowledge base would by necessity need to be curated and validated, and this is where the first major hurdle lies. Even if we could ingest the entirety of medical literature, not all medical information is up-to-date, accurate or relevant to all locations. Papers can be biassed, results can be outdated, and guidelines broken in a myriad of ways. Disease prevalence, population demographics and best practice guidance all differ across the world, and as such curating a database that is fit for purpose within our intended use will be extremely challenging. Additionally, access to all this information will be expensive if we aim to be comprehensive, and not all of it is machine-readable. However, let’s assume it’s somehow possible to curate, vet and validate such a large database, and move on to consider the third-party LLM. 

Software of Unknown Provenance

Currently, access to LLMs such as ChatGPT are by API only. The developers have not made public any documentation as to how it was built, trained or maintained as these details remain a trade secret. This is our second major hurdle, known as SOUP, or Software of Unknown Provenance. This means that if we cannot verify or validate a piece of third party software according to IEC 62304, we cannot claim to have mitigated all risks, since it could for instance be changed without warning (the current status quo will not stay still for long!), or be withdrawn from market making our system unable to function. The UK regulators are clear on this position, but that doesn’t mean that all hope is lost. As LLM technology becomes more accessible, developers will start building their own versions, and it may be that in due course someone will produce one with the required documentation (perhaps like Med-PaLM). Until then, we won’t see a regulatory approved system that uses an LLM at its core.

Analytical and Clinical Validation

Probably the largest hurdle will be to demonstrate the clinical evidence required to prove that the system is safe and effective for all cases within our intended use. Assuming we can set a benchmark of non-inferiority to clinician diagnostic differential performance, we would need to run a clinical investigation to ISO 14155 standard, with ethical approval, appropriately powered to achieve statistically relevant performance metrics, with enough room left over to analyse the almost infinite sub-stratifications of cases, all done to the STARD-AI criteria. We’re talking about the mother of clinical investigations.

Ground-truthing the sheer number of potential cases for the clinical investigation will be a challenge in itself, likely requiring a form of independent panel-approved vetting of clinical input scenarios and data, expert group opinion on acceptable outputs, and a robust system for fact-checking, ranking or rating final system outputs. Checks would need to be in place for ‘red flag’ cases, and to make it even more difficult, our system ideally should not be changed or tweaked for the duration of the investigation. 

If we wanted to claim that our system actually helps clinicians reduce time spent making decisions, we would also need to run an investigation to prove it, ethically approved and appropriately registered, comparing current clinical practice without AI to a new pathway with AI, and measure the differences. We should use best practice guidance such as SPIRIT-AI and  TRIPOD-AI depending on our claims and intentions. This would not necessarily need to be in a randomised-control fashion, but could require investigation of two matched groups of clinicians with matched cases across two powered cohorts (again, no mean feat!).

Technically, none of this is impossible, but it will require a significant amount of time and expertise to pull off. Ultimately, all of our clinical evidence, from literature review, feasibility studies, clinical evaluation plan (CEP), clinical investigation plans (CIPs) and reports (CIRs) will need to be compiled into a regulatory compliant Clinical Evaluation Report (CER). Of course, regulatory approval will entirely depend on our investigation actually showing positive results, so fingers crossed it actually works as planned.

Putting it all together

Let’s assume that we have managed to overcome all of the above hurdles, and have ended up with a fully documented Software Development LifeCycle, Clinical Evaluation Report and have compiled a Medical Device File. The process simplified to its core components looks like this:

Ongoing monitoring

You’ll note we haven’t covered everything in the above diagram for the sake of brevity, but one essential component will be post market follow up. It’s important to acknowledge that once our device is ‘on market’ (i.e. being made available for use as per its intended use), there is a legal requirement for us to monitor its performance and safety for the duration it remains on market. This is done both proactively and reactively in two components known as Post Market Surveillance and Post Market Clinical Follow Up. At the most simplistic level we will need to predefine our ongoing surveillance, including all complaints and feedback, as well as declaring our methodology for ongoing clinical assurance, which could be further clinical investigations and powered studies (to demonstrate performance across subgroups where we haven’t fully been able to demonstrate safety pre-market) or ongoing sampled audits. Do not underestimate the magnitude of this challenge - imagine having to audit potentially billions of input/output pairs forever and act accordingly for all errors and adverse events. The results of the post market surveillance must be incorporated back into our Clinical Evaluation Report and reported annually to the regulators. We’ll also need to update our Literature Review annually to check for any studies done on our device, and to assess performance and safety issue issues of other similar devices. If we don’t maintain these processes, then our device could be removed from market.

Getting regulatory approval

Now the fun begins. Hopefully, way back at the start we engaged with a regulatory body who is going to audit our work and certify us. This could be the FDA directly, or a UK Approved Body or EU Notified Body, or any number of country-specific competent authorities, depending on which geographies we want to deploy our system into clinical care. Each audit will come with fees (think tens of thousands) plus a nice long time delay (think months to years) before an audit can actually be performed. Assuming we pass audit, we receive our market authorisation, and we are almost there. We just need to appoint an appropriately qualified Person Responsible for Regulatory Compliance, register our device, produce appropriate labeling and instructions for use, and then start selling it! If we want our system to be used in multiple countries, we may also consider upgrading our Quality Management System to pass MDSAP standards, giving us entry into multiple markets. Although, caution here - we will need to prove it works in multiple languages too, and will additionally need to re-run our clinical investigations in each of our target countries to demonstrate it works on different populations with different disease prevalences, clinical guidelines and benchmark performances.

Updating our system

One tiny detail we haven’t yet mentioned is that current regulatory frameworks do not allow for continuous updating of software and AI-based medical devices. That’s going to be a problem… not only will our LLM be changing regularly, but our system self-feeds back to improve factual accuracy. None of that is straightforward under current regulatory frameworks - but hope is on the horizon. The FDA, Health Canada and the UK MHRA are all working on Predetermined Change Control Plans (PCCPs) to allow for safe, quality-assured ongoing updates of software devices as long as they remain within the confines of their regulatory-cleared intended use. We don’t yet know when these frameworks will be finalised, but at the time of writing there is a public FDA consultation on the subject. In essence, developers will be able to submit plans to update software over time, but they must stick to those plans and not deviate from them, or change their intended use, otherwise a new regulatory submission may be required.

Other considerations


We haven’t covered many other important aspects of the regulatory journey in great detail here, as our hope is simply to inform the interested reader of the general procedures and frameworks when it comes to AI as a Medical Device. We should however make note of the following topics, some of which are still in flux and subject to change:

Conclusion

So there we have it - a roadmap for how to get a medical large language model-based system regulatory cleared to produce a differential diagnosis. It won’t be easy or for the faint-hearted, and it will take millions in capital and several years to get it built, tested and validated appropriately, but it is certainly not outside the realms of future possibility.

To put it all in context, we can vividly remember when deep learning first exploded around 2012 and it took approximately five years before the first regulatory approved AI-driven device came on market (that was one of Dr Harvey’s, a CE marked Class IIa decision support system for breast mammography). Now there are over 500 of AI-enabled devices with regulatory approval! 

There is one big BUT in all this that we feel compelled to mention. Given the lengthy time to build, test, validate and gain regulatory approval, it is entirely possible that LLM technology will have moved on significantly by then, if the current pace of innovation is anything to go by, and this ultimately begs the question - is it even worth it if we are at risk of developing a redundant technology? Indeed, is providing a differential diagnosis to a clinician who will already have a good idea (and has available to them multiple other free resources) even a good business case?

In reality, these risks are simply a fact for all medical devices, as innovation always moves forward, and risks need to be weighed against the potential benefits of improving patient care in the near term. We are excited to see where this goes, and of course, our team at Hardian are ready, willing and able to help anyone who dares go on this adventure into the unknown. Consider yourself warned. You know where to find us.


Hardian Health is a clinical digital consultancy focused on leveraging technology into healthcare markets through clinical evidence, market strategy, scientific validation, regulation, health economics and intellectual property.

Dr Hugh Harvey and Mike Pogose

by Dr Hugh Harvey and Mike Pogose

Previous
Previous

How to Level Up to Digital Therapeutics 2.0

Next
Next

5 questions you need to answer to get your digital health or AI software product to US market quicker