Are LLM-based ambient scribes and clinical summarisers medical devices?
Since the global release of publicly available Large Language Models (LLMs) many companies have been building on their broad general functionality in the medical domain. Current hot topics in clinical AI include clinical summarisers and ambient scribes. Poised to revolutionise the practice of clinical documentation, these technologies hold significant potential benefits, but also risks. This raises the question we aim to address in this article - do they count as medical devices, and should they be regulated as such?
Regulations always lag behind the technological cutting edge. In the grey zone between what is known and what is not lie the unanswered questions around novel technologies which can stretch the very definitions and assumptions that existing legislation is based upon. LLMs are the current AI technology du jour, and they are certainly creating headaches for regulators, potential purchasers and developers in terms of where they may sit in the current regulatory frameworks. Opinions are divided - with some suggesting they are low risk and just general purpose âutility softwareâ (like MS Word for instance), while others are convinced that all AI in healthcare should be strictly regulated.
In this article, we will attempt to answer three questions:
What are the purposes and risks of the clinical documentation processes that are being automated?
Why could such automation by LLMs - at least in the UK, EU and USA jurisdictions - present as medical devices?
How could medical device certification be approached for LLMs?
What are the purposes and risks of the clinical documentation processes that are being automated?
Various companies have come to market offering a variety of different clinical documentation automations using LLMs. Without being exhaustive the broad categories include:
Ambient scribes - software that listens to a clinical consultation and produces a draft structured summary (or SOAP note) of what was discussed between a patient and their clinician or nurse.
Discharge summarisation - software that searches and filters the electronic health record and provides a draft discharge summary covering the patientâs hospital journey and recommended further follow up.
Radiology report summarisation - software that transcribes a radiologistâs dictated report and then provides a draft report summary of the findings.
Of note, all of these offerings provide draft summaries, with the intention that a qualified professional then reviews and/or edits the output before signing off. The intended benefits are widely advertised, and include reduced turn-around time, increased productivity, better accuracy and reduced burn-out.
Letâs take a look at relevant medical purposes and risks for each of these different clinical documents, tabulated as follows:
Clinical Document |
Medical Purpose |
Known Risks |
---|---|---|
Clinical Note |
The Subjective, Objective, Assessment and Plan (SOAP) note is an acronym representing a widely used method of documentation for healthcare providers. The SOAP note is a way for healthcare workers to document in a structured and organised way - Association of Healthcare Journalists |
An unintended consequence of electronic documentation is the ability to incorporate large volumes of data easily. These data-filled notes risk burdening a busy clinician if the data are not useful. As importantly, the patient may be harmed if the information is inaccurate. - Podder et al 2023 |
Discharge Summary |
A discharge summary is a handover document that explains to any other healthcare professional why the patient was admitted, what has happened to them in hospital, and all the information that they need to pick up the care of that patient quickly and effectively. - British Medical Journal 2015 |
High quality discharge communication is critical to patient safety. This is particularly the case for patients who are not able to advocate for themselves or who have complex clinical problems that need to be monitored closely. An important part of discharge communication is the timely handover of diagnostic tests ordered or to be ordered including results received and those requiring follow-up. Breakdown in this aspect of communication is common and contributes to unsafe patient care by increasing the risk of missed or delayed diagnosis which may lead to patient dissatisfaction and sub-optimal patient outcomes with potential medico-legal implications. - NHS England 2016 |
Radiology Report |
The purpose of an imaging report is to provide an accurate interpretation of images in a format that will prompt appropriate care for the patient. Imaging reports should relate the findings, both anticipated and unexpected, to the patient’s current clinical symptoms and signs and to the results of other investigative tests and procedures. When appropriate, the imaging study report should incorporate advice to the referring clinician on further investigation, management or referral to another specialist team. - RCR 2018 |
Problems will remain with false-negative and false-positive identification by AI software, which will require validation by a human reporter. Much needs to be done to define the appropriate use of AI in the reporting of imaging investigations, setting standards for AI interoperability, testing AI algorithms, as well as addressing regulatory, legal and ethical issues. - RCR 2018 |
So, we can see that all three clinical documentation processes have a medical purpose (as opposed to general purpose or wellness), and include common clinical benefits associated with common clinical risks: benefits include that relevant information is summarised quickly, normally for handover from one clinician to another; risks are to patient safety due to delays in diagnosis and/ or treatment caused by delays in information transfer, or incorrect diagnosis or management caused by inaccuracies in the information transferred. Of course, humans are not perfect, and there are often errors and mistakes in human-produced summaries too - for which we have existing professional standards, mechanisms of feedback and methods for reconciliation, mitigation and correction where appropriate, which at a minimum should also apply to LLMs.
Why could LLM automation present as medical devices?
Letâs start off by agreeing that what is being automated by LLMs is currently part of a regulated human activity - the practice of medicine. Medical practice is heavily regulated as a whole, and within that sphere each medical speciality has its own regulatory and professional standards. Doctors, nurses and allied health care professionals must all be accredited, and maintain professional standards in their work, or risk losing their jobs. Since part of their job is to produce clinical documentation, then it stands to reason that that activity is part of their professional duties. Indeed, a non-accredited person is not allowed to make clinical documentation, discharge summaries or radiology reports! A doctor who constantly produced inaccurate summaries would harm patients and eventually be disciplined, re-trained or struck off.
Therefore it does not stretch logic too far to suggest that AI that automates that same part of a regulated professionalâs work (that, as we have established, comes with a medical purpose and comes with risks) should also be regulated. However, software is not human, and is not regulated as humans are - the closest equivalent are specific regulations governing software, either as âgeneral health softwareâ or âmedical device software, both of which require a degree of regulatory oversight.
For the UK, the Medicines and Healthcare products Regulatory Agency (MHRA) has produced Guidance on medical device stand-alone software including apps (including IVDMDs) which includes the following text on decision support:
Decision support software is usually considered a medical device when it applies automated reasoning such as a simple calculation, an algorithm or a more complex series of calculations. For example, dose calculations, symptom tracking, clinicians guides to help when making decisions in healthcare. This is likely to fall within the scope of the UK MDR 2002. Some decision support software may not be considered to be a medical device if it exists only to provide reference information to enable a healthcare professional to make a clinical decision, as they ultimately rely on their own knowledge. However, if the software/ app performs a calculation or interprets or interpolates data and the healthcare professional does not review the raw data, then this software may be considered a medical device. Apps are increasingly being used by clinicians who will rely on the outputs from this software and may not review the source/raw data.
The key point here is that an LLM-based summariser is in fact interpreting/ interpolating data. Its output is not a human-reasoned textual summarisation of the raw data, nor a direct transcript of a clinical event - it is a combination of input text, prompt engineering by the developer and a probabilistic prediction of most likely next words or phrases. By âchoosingâ what to include and what to omit in the summary, some form of black-box probabilistic process is occurring that is far beyond a simple calculation. This is deemed by the MHRA as âhighâ functionality, as laid out in their recent guidance on Digital mental Health Technologies which uses the following example as a high functionality: âGenerative AI converts speech to text transcript and then provides summary which is inputted into patientâs electronic health record to help with decisions related to careâ.
Additionally, the intention of the developer is that the healthcare professional receiving the summary does not review the entire raw data i.e. the entire EHR or the recording of the patient conversation. They will of course have to review the output, edit it, and sign off on it, but that does not exempt software from being a medical device. Plenty of existing AI tools provide outputs for human review and sign off already, as seen in the explosion of radiology and pathology tools, and they are all medical devices. Indeed, as the MHRA point out in their guidance, a âhealthcare professional may not have time to verify all summariesâ which of course introduces risk.
Of note, if LLM summarisers were to go further than pure summarisation and offer, for example, differential diagnoses or treatment plans / recommendations, then they would certainly qualify as a medical devices since the intended purpose would be to provide information to support diagnosis, treatment, monitoring, alleviate, prediction or prognosis of a disease - which is the very definition of medical device.
The MHRA guidance also cites a European Commission guidance document MEDDEV 2.1/6 Guidance document Medical Devices - Scope, field of application, definition - Qualification and Classification of stand alone software which makes the following points:
If the software does not perform an action on data, or performs an action limited to storage, archival, communication, âsimple searchâ or lossless compression (i.e. using a compression procedure that allows the exact reconstruction of the original data) it is not a medical device. Altering the representation of data for embellishment purposes does not make the software a medical device. In other cases, including where the software alters the representation of data for a medical purpose, it could be a medical device.
âSimple searchâ refers to the retrieval of records by matching record metadata against record search criteria, e.g. library functions. Simple search does not include software which provides interpretative search results, e.g. to identify medical findings in health records or on medical images.
Software which is intended to create or modify medical information might be qualified as a medical device. If such alterations are made to facilitate the perceptual and/or interpretative tasks performed by the healthcare professionals when reviewing medical information, (e.g. when searching the image for findings that support a clinical hypothesis as to the diagnosis or evolution of therapy) the software could be a medical device.
Since âsoftware which is intended to create or modify medical information may be qualified as a medical deviceâ, we can start to see that LLM-based summarisation software could be regulated as medical devices. It certainly isnât performing âsimple searchâ. Even more convincingly, LLMs perform a type of lossy compression, not lossless, of medical information - they compress (summarise) from long form to short form, which is irreversible back to the long form.
Turning to the EUâs Borderline Manual for an example which relates to an unrelated software device under Classification of software for information management and patient monitoring, we can see that it reinforces the point shown above from MEDDEV 2.1/6:
[MEDDEV 2.1/6] is applicable for the listed functions performed by the patient monitoring platform with the exception of the alarm filtering function. The software is not considered to be âgeneratingâ an alarm as it is the bedside device that generates the alarm based on its analysis of patient physiological data. The bedside device also assigns a severity to that alarm. While this software does not generate the original patient alarm, it applies user-defined filtering rules to each alarm category (e.g. severity) received by the software. This filter function is considered to be performing a search of the nearly âliveâ data received from the bedside device that results in a specific action being taken on that alarm i.e. alarm is delayed. The action of the filter function is not considered a âsimple searchâ of archived data. The delay to the alarm that results from the filter function is considered to lead to the generation of new or additional information that contributes to the monitoring and follow-up of the patient connected to the bedside device. If an alarm is noted on the system the users are instructed to interact with the bedside device, which would be considered to be influencing the use of the bedside device. Therefore this software, having one of its functions qualified as a medical device, complies with the definition of a medical device and should be qualified as such. When classifying this device implementing rule 2.3 of Annex IX applies.
We can therefore start to understand the thinking here from the regulators. If the software applies âfiltering rulesâ it is a medical device. Since LLMs apply computer-derived inference from medical information (including outputs from medical devices found in the EHR or discussed in a clinical consultation), adjusted and guided by rules in the form of prompts to the LLM, then they could be considered medical devices. Additionally, there is generation of new information - an LLM is a form of generative AI which is surely generating new information (i.e. a new summary that did not exist before), by restructuring and summarising information from the prompts and inputs that it is given.
For the EU, the guidance document MDCG 2019-11 associated with the EU Medical Device Regulation (MDR), reflects what the predecessor guidance MEDDEV 2.1/6 states, but adds:
Software which is intended to process, analyse, create or modify medical information may be qualified as a medical device software if the creation or modification of that information is governed by a medical intended purpose.
It is important to note that âintended purposeâ is a term defined in the MDR (any device intended to be used for monitoring, diagnosis, treatment, prognosis or prevention). Clearly a clinical summary or report has a medical purpose - to transfer clinical information from one clinician to another, or to a patient, for the purpose of communicating information about a diagnosis, treatment, prognosis etc. Going back to the point on how humans are regulated (even NHS medical secretaries must have RSA level 2 and AMSPAR or equivalent qualification) - clearly the fact that only accredited professionals can produce this type of documentation implies it is indeed for a medical purpose.
For the USA, we have the Clinical Decision Support Software Guidance for Industry and Food and Drug Administration Staff Document which sets out four criteria to define non-device Clinical Decision Support systems. All four must be met for the software not to be a medical device.
Criterion number |
Definition |
Met? |
Reason |
---|---|---|---|
1 |
The software is not intended to acquire, process, or analyse a medical image or a signal from an in vitro diagnostic device or a pattern or signal from a signal acquisition system. |
No |
Ambient scribes / LLMs can acquire, process or analyse data from a medical image, IVD or other medical device since such data can either be discussed during a clinical consultation from which a summary is derived, or recorded in the EHR from which the LLM summary is derived.* |
2 |
The software is intended for the purpose of displaying, analysing, or printing medical information about a patient or other medical information (such as peer-reviewed clinical studies and clinical practice guidelines) |
Yes |
LLM summarisers are intended to display or print medical information about a patient. |
3 |
The software is intended for the purpose of supporting or providing recommendations to a health care professional about prevention, diagnosis, or treatment of a disease or condition |
No |
LLM summarisers are not providing recommendations, they are just summarising information.* |
4 |
The software is intended for the purpose of enabling an HCP [healthcare professional] to independently review the basis for the recommendations that such software presents so that it is not the intent that the HCP rely primarily on any of such recommendations to make a clinical diagnosis or treatment decision regarding an individual patient. |
No |
While many LLM summarisers state that the draft summary must be signed off by an appropriately trained professional, the intent is not for that professional each and every time to review all of the relevant information as this wouldn’t reduce workload in the way summariser tools claim.* |
*Our interactions with the FDA during 513(g) classification requests confirms their thinking - LLM-based summarisers do not meet criteria 1,3 & 4 of the CDS guidance.
Therefore, since LLM summarisers only meet one out of the four criteria, they cannot be classified as non-device clinical decision support (CDS) systems in the USA, and must be considered as medical devices.
Finally, the International Medical Device Regulators Forum (IMDRF), consisting of participants from Australia, Brazil, Canada, the EU, Japan, Singapore, South Korea, Switzerland, the UK, the USA, and the World Health Organisation (WHO) have published standardised terminology for reporting of adverse events caused by medical devices - here is a small selection of the health impact codes relating to the IMDRF terminologies for categorised Adverse Event Reporting (AER): terms, terminology structure and codes published in March 2020:
F04 Patient diagnosis was clinically significantly delayed as a consequence of device performance.
F05 Patient treatment was delayed as a consequence of device performance.
F06 Situation in which the use of the device impedes or affects a subsequent medical procedure or use of a medicine or device. The time elapsed between the use of the device and the medical procedure is not a factor. It is not necessary for the device to have broken or malfunctioned.
F07 Use of the device has led to worsening of the existing disease or condition.
So in this selection of the IMDRF codes we can see the potential impacts of errors in the application of LLM summarisers, and how these are anticipated in the coding of adverse events related to the use of medical devices. âDevice performanceâ for an LLM can be taken to mean its accuracy and precision i.e. the absence of the well known effects of LLM hallucination and confabulation.
How could medical device certification be approached for LLMs?
Since there are currently no direct references to, or carve-outs for LLM summarisers in any published medical device regulation, guidance or borderline manual, we must therefore use the closest anteceding precedent which is the current regulations that cover all software and AI medical devices. That means that LLM-based summarisers should undergo the same conformity assessment as any other SaMD or AIaMD, and obtain the relevant jurisdictional regulatory clearance/approval. In the UK, that means UKCA marking, in the EU that means CE marking, and then in the USA that means FDA approval (not clearance since there are at the time of writing no appropriate predicate devices for a 510(k) submission).
Developers should have an appropriate Quality Management System (minimally to ISO 13485), and produce a set of technical and clinical documentation conforming to the various required elements of the medical device regulations. This includes, but is not limited to software development plans, risk/benefit analyses, clinical evaluation reports and solid post market surveillance processes. Additionally, since we are dealing with software, manufacturers are required by law to prove and maintain their cyber-security credentials. If this sounds like a lot, then consider that, in the UK at least, to sell to the NHS manufacturers of general software are already mandated to become DTAC compliant which includes much of this type of documentation, minus the QMS and clinical investigations. Weâve previously outlined the process for âdiagnosticâ LLMs, and many of the same principles apply to LLM summarisers. The difficulty for the manufacturers will be to demonstrate that their software is at least as safe and effective as âstate of the artâ which is currently human-derived documentation. This would need to be done at a significantly powered scale of study across all intended deployment environments. The risks would also need to be outweighed by the benefits in a quantifiable way, meaning that manufacturers would need to measure the LLM error rates, hallucinations and omissions of both LLMs and humans meaningfully and robustly. Herein lies a problem, in that LLM outputs are not generally reproducible - given a certain input the output is not always the same - making the sample sizing even greater.
Another significant problem is that, for the most part, LLM summarisers are calling on APIs to off-the-shelf third party LLMs, such as OpenAI. This would make the actual LLM itself Software of Unknown Provenance (SoUP), which is an issue when it comes to quality management. The developer has little to no control over the training, validation and updating of the underlying LLM, which is unlikely to meet the standards in IEC 62304 for software development, and presents a significant unmitigatable risk. Additionally, the terms of use provided by companies such as OpenAI explicitly prohibit medical use of their technologies, so developers would be better off considering fine-tuning an open source LLM that they at least have fixed version control over, or even better but unlikely, training their own from scratch.
The final question remains then - which risk classification do LLMs summarisers fall under? Understandably LLM developers would want their devices to be considered as low risk as possible to avoid unnecessary regulatory burden. In the UK, since these devices may not be considered to provide a âdirect diagnosisâ there is an argument they would be Class I. However, the EU MDR is stricter on risk classification, so in the EU developers may need to progress towards a full Class II regulatory process which involves audit by accredited organisations, such as Approved Bodies in the UK, Notified Bodies in the EU, or the FDA in the USA. For the EU, such software would also be classified as high risk under the EU AI Act, with the commensurate additional conformity assessment burden.
Conclusion
So, there we have it. A pretty solid argument that LLM software intended for the purpose of summarising clinical information is classed as a medical device. They are performing part of a regulated medical professionalâs work that only accredited professionals can produce, they have a medical purpose, they come with risks to patients, they meet the definition of a medical device on multiple fronts, and for the USA they do not meet the criteria for non-device Clinical Decision Support software. While they may be relatively low risk, they still need to undergo the same level of scrutiny as any other software medical device, especially considering that the magnitude of the benefits and risks are totally unknown.
Our previous experiences with the FDA confirm their thinking on the non-CDS criteria, but we are yet to see an FDA-approved LLM. Only innovators at the coal face who are willing to approach the regulators will be able to set an example - and we hope they do. This technology has much promise to change day-to-day clinical practice - but as Uncle Ben said - remember, with great power, comes great responsibility - so letâs expect that developers are prepared to take on the required responsibilities in order to provide clinicians with the powerful tools they have built.
Hardian Health is a clinical digital consultancy focused on leveraging technology into healthcare markets through clinical strategy, scientific validation, regulation, health economics, and intellectual property.