5 simple questions to ask clinical AI vendors before you buy

ClinicalRegulatory

22 Apr

If you’re a hospital exec, departmental lead, or run a clinical service, you’ve likely been approached by a gazillion AI vendors with all sorts of shiny new tech that’s just bursting with promise. Right?

If so, I know the feeling. My inbox is full of the stuff every single day. To the uninitiated it can be hard to discern which ones are actually going to help (if at all), which are fads and which are just plain dangerous. The wheat needs careful separating from the chaff. Not everyone has a degree in machine learning, understands the regulatory landscape and the clinical pathways and is able to pick apart exaggerated claims (yes, that was a humble brag).

In this blog, I’ll cover the five key questions you need to ask each of these vendors, giving you the low-down on common marketing misunderstandings, statistical obfuscation, and creative stretching of medical device regulation. Here goes….

1 What is your intended use?

This sounds like an obvious question, doesn’t it? (I told you these questions were simple). In fact, it’s not. Not at all. Figuring out an intended use is the keystone of medical device regulation, it’s what manufacturers design their product around, it defines the risk class, sets the scene for all of their design processes, and it sets a limit on the claims they can make. The last one is the most important here - without fully understanding an AI product’s intended use you simply can’t know what its limitations are. I’ll give you an example:

Company X approaches you with an FDA-approved proprietary AI system that can, according to their swish marketing, “diagnose diabetes with 92% accuracy based on a urine test.”

So, what is the intended use? Is it to diagnose diabetes on a urine test?

NO!

The intended use is to support clinical decision making based on a prediction of the risk of a patient having type II diabetes based on AI analysis of urine.

Actually, that’s not quite right. The intended use is to support clinical decision making based on a prediction of the risk of a patient having type II diabetes based on AI analysis of urine in patients who are either at high risk or have presented with clinical symptoms of diabetes.

No, wait, still not quite there…

The intended use is to support clinical decision making of trained physicians based on a prediction of the risk of adult patients between the ages of 21 and 65 having type II diabetes based on AI analysis of a fasted urine sample, processed using samples collected using X or Y or Z techniques only, in patients who are either at high risk or have presented with clinical symptoms of diabetes.

That’s better. Now we know both the intended use AND the indications for use:

Who the product is intended to be used by
In what population of patients
Which disease subtype
Under what circumstances
What other processing is required
Whether it is stand-alone or supports decision making
and more…

The terminology here can vary depending on geography. The FDA like to see intention and indication as separate regulatory terms, but EU regulators are fine with a single Use Specification document. In any case, ‘intended use’ needs to describe exactly what a device is to be used for, and ‘indications for use’ need to describe the exact conditions and situations under which that device can be used.

The point I am making here is that without knowledge of a detailed intention and indication for use, a buyer might simply be led to believe that the AI system being sold can actually diagnose (wrong) diabetes (wrong) on any urine sample (wrong) on any patient (wrong again). That’s why it’s so important to ask!

The good news is that the theoretical product in this example has been regulatory approved - which means the vendor has already set these statements in stone. They have a regulatory document, which you are entitled to ask to see, called their Use Specification which details exactly what their product can do. It also details what it can’t do, as it should include details about exclusions for use, and even more importantly, what the company is doing to prevent ‘foreseeable misuse’ (like when a doctor decides they are going to use it to try and diagnose something else).

Putting ‘AI’ on the box doesn’t make a jot of difference!

If these concepts sound familiar to you, that’s because they should be. The exact same regulatory terms have been used for drugs and physical medical devices for decades. Putting ‘AI’ on the box doesn’t make a jot of difference! If you want to know more about these statements and just how rigorous they need to be, I invite you to read section 8 of IEC 62366-1 on Usability Engineering.

Why does this matter? Currently there is no overseeing body for claims surrounding software as a medical device (SaMD) such as the UK’s ABPI Code of Practice for pharmaceuticals (the code of practice is a ‘visible sign of the commitment of the pharmaceutical industry to benefiting patients by operating in a professional, ethical and transparent manner’). Without such a code, the sector remains a wild west, with vendors able to make whatever claims they like with very little comeback on false advertising. Yes, the regulators are supposed to have some oversight on this - but they are too under-resourced to monitor all marketing materials from all vendors, leaving huge potential for abuse of good faith. Until the market matures and a code of practice is enforced, it’s buyer beware! You might be thinking “surely no-one would dare lie like that?'“. Well my friend, yes they would. And they have. Go read Bad Blood.

2 What evidence is there?

OK, so now you know the intended use and indications for use of this new fangled AI thing-ummy. Time to dig deeper…

For each of the above points, the manufacturer needs to have evidence that their product can do what they claim EXACTLY as it is labelled in their use specification. And I mean EXACTLY.

Taking the above example - the vendor needs to ensure that their evidence supports each of their claims perfectly. All too often I see AI companies claiming that their AI can do X when in reality it has only been investigated for Y. In the example above, this could mean that the AI system has only been tested on urine samples processed using X technique, but not Y or Z. Or maybe they only included adults between the ages of 40 and 60 in their studies. Either way, the reality is that they are severely limiting the potential utility of the algorithm, and mis-marketing it.

In addition to testing against the correct claims, the vendors should have demonstrated performance against an accepted gold standard. This will vary depending on the clinical use case, but in all situations there should have been an attempt at either citing accepted standard pathway performance data, or collecting it.

If an AI vendor of a binary decision system can’t or won’t show you their confusion matrix - walk away!

Buyers should demand all the pre-market evidence available, and scrutinise it. This should come in the form of standard recognised performance metrics, which in the case of many AI systems and prediction models is a confusion matrix, ROC curve or decision-point curve. If an AI vendor of a binary decision system can’t or won’t show you their confusion matrix - walk away!

Back to our above example. I said that company X were claiming 92% accuracy. What does that mean exactly, and how can you tell if they are telling the truth? First things first, 92% is a very reasonable level of accuracy, if it’s true. But how has the vendor defined accuracy? Do they literally mean ‘accuracy’ as per the conventional formula of (TP + TN) / (TP + TN + FP + FN)? In this theoretical example, let’s assume they’ve made the easy marketing mistake of stating the sensitivity only.

Relevant performance metrics — A typical confusion matrix allows easy understanding of relevant performance metrics, including sensitivity, specificity, positive and negative predictive values, as well as total calculated ‘accuracy’.

A sensitivity of 92% means that their AI can find approximately 9 out of 10 positive cases in any given cohort. Sounds good right? Hang on, there’s more. What they haven’t told you in their marketing materials is that their specificity is only 45%. That’s super-low, worse than a coin flip. It means the system essentially is 50/50 at finding negative cases in a cohort, which in the situation company X is applying their system to, makes it essentially redundant. No-one wants a diabetes urine test that flags up negative results correctly less than 50% of the time. Yet, bizarrely, company X may well go to market touting their higher number of 92% only. Call it a communication void between the scientists and the marketeers, or whatever, but it’s potentially misleading and bound to cause confusion amongst the buyer market.

The other thing to watch out for is whether or not the algorithm has been validated on an independent external dataset, or internal validation metrics are being reported from a split dataset. For the purposes of gaining regulatory approval, a retrospective study on internal data can suffice, as long as there is a concurrent reader study to compare performance to humans or current best practice. However, as there is often a noticeable drop in algorithmic performance when new datasets from different sources are used for validation, it is always wise to ask how an algorithm performs on new data, and what a manufacturer is doing to ensure consistency of performance for new deployments.

Bottom line - ‘accuracy’ is a loose term that can be abused (both deliberately and innocently), and can depend on the data being used for validation. Do your due diligence and dig deeper into the evidence, and demand recognised reporting metrics. The medical device regulations do not mandate that clinical investigations of proprietary software need to be made public or to be published in an academic journal, so it’s up to the buyers to insist on seeing the results. This is especially important for devices that have been regulatory approved based only on ‘equivalence’ to pre-existing devices, as these rarely have full clinical investigations performed.

3 What’s your post-market plan?

So far, the questions have been largely similar to what you should ask for any drug or medical device, but this third question takes things a step further. Due to the nature of AI systems being relatively novel, with no long term follow up data, and their propensity to be brittle to new data sources, it is crucially important to understand what post-market monitoring, reporting and corrective plans need to be in place.

As a buyer of clinical AI software, you aren’t just buying a software product, installing it, and letting it run. You are entering into a partnership with the vendor, one that covers the lifetime of the software. You see, the vendor is mandated by the medical device regulations to have both a proactive and reactive Post Market Surveillance plan (PMS) AND a Post Market Clinical Follow up plan (PMCF). This means that you as a buyer need to be aware of what resources you need to be prepared to supply in order for that vendor to carry out their post-market activities.

Let’s go back to our example above. Company X have a PMS plan which entails collecting adverse events, field safety corrective issues, bug reports and customer feedback. To do this, they set up an email address to record user-sourced feedback, they have an automated bug reporting feature in their software, and have a customer support line for any queries. This allows them to produce what are known as Periodic Safety Update Reports (PSUR) submitted on a regular basis to a central regulatory body, which as a buyer you should ask to see every year.

Additionally, according the vendor’s PMCF plan, they also may need to monitor sensitivity and specificity of their AI at least once a year at a variety of deployed locations to ensure no algorithmic drift is taking place, to identify off-label misuse, to identify previously unknown errors and issues, and to check that their performance claims can be upheld. To do this, they will need clinical data from you, the buyer, often provided in a prospective fashion. You will need to work with the vendor to supply de-identified clinical data on patient demographics, indications for use, patient histories, confirmation or rejection of a diagnosis of type II diabetes, and other ancillary information. In all likelihood, you aren’t collecting this kind of data routinely, so new systems will need to be set up. Without this data, the vendor simply can’t have any realistic expectation of monitoring its software’s performance in a live clinical setting. This prospective data gathering will have a cost, and will require active engagement from your clinical and operational staff to set up and manage. The vendor may offer solutions to help streamline this data gathering, but there will always be a resource cost associated with it.

The more complex the AI prediction, the harder it will be to gather prospective data. For example, an AI predicting 5 year mortality will need very careful planning on what 5 year data to gather on each patient in order to monitor the algorithm’s performance. Even simple binary predictors such as ‘normal or abnormal’ tests can have vastly complicated follow up pathways to ensure correct categorisation.

4 What is the risk class?

This question is for EU certified AI clinical decision support products (i.e. those with a CE mark). The question is designed to try and tease out how well the manufacturer understands the concept of clinical risk:benefit, and how engaged they have been with proper regulation. Under current EU regulation (the MDD), AI decision support systems could be classed as low risk (Class I) and could be obtained with self-certification where no-one official looked at any documentation at all! This was a loophole left-over from previous regulatory reform which did not predict the rapid increase in utility of automated software. This loophole was due to be closed by the advent of the new EU MDR in May 2020 - however, recently the EU Parliament voted to delay the MDR by one year to May 2021 as a result of the COVID pandemic. The MDR represents ‘beefed-up’ regulation regarding medical devices, including a vast increase in focus on patient safety and post market monitoring. Under the MDR almost all AI systems would be up-classed to at least medium risk (class II), meaning that a formal external audit would be required by an independent regulatory body on their quality systems and technical documentation. As a result, under the new MDR, AI manufacturers can no longer ‘self-certify’ as Class I and benefit from the regulatory light-touch loophole.

Risk class flow chart for the new MDR, under which most AI software would be Class II

There is however a catch - devices self-certified under MDD as Class I before May 2021 can keep that designation until 2024 - meaning that an AI company can sell its clinical decision support software for the next few years without ever having had a formal regulatory audit or independent scrutiny. This is why, for clinical decision support systems, you should ask ‘what class of device is your product?’ - and if the answer is Class I, then you should be very cautious, as there has been no independent regulatory approval, and no guarantee of quality assurance or proper clinical investigation. Only devices approved as Class II or above have been audited, and it is those that you should consider for procurement.

5 Who is your Notified Body?

Again, this question is for EU products only, as the FDA covers all devices in America. A Notified Body is a third-party organisation licensed by EU countries to assess the conformity of certain products before being placed on the market. There are currently only 12 NBs designated to hand out CE marking under the new MDR.

This question has two purposes - firstly, you can get a feel for how well a company is engaged with regulation by how well they know their Notified Body. If they can’t tell you, or look blankly at the wall, or ask ‘what’s a Notified Body’, it’s a sure sign that they have no idea about regulation, and their company culture is not focussed on it. A Notified Body should have a deep relationship with a manufacturer, and have had multiple on-site visits to check that all is in order. They can do surprise audits at any time, and every year there should be regular reporting to the NB regarding safety reports and updating clinical evaluations. Everyone from the CEO down should be aware who the NB is, and why they are important. Avoid companies who can’t tell you!!

The second reason to ask this question is more controversial - some NBs have a reputation for being more ‘strict’ than others. I’m not going to rate them for you, but let’s just say some can be considered slightly more lenient, so it may be worth finding out which was used, and then checking the current sentiment in regards to their reputation.

So, there you have it - five super simple questions you should be asking AI vendors who approach you with clinical decision support systems. The most important thing to remember is that adding AI into the mix does not necessarily make anything easier in terms of procurement, and that deep due diligence into the value proposition and robust quality assurance should remain fundamental, just as they would for any purchase decision for any other medical product.

If you are considering purchase of such a system, please do get in touch to find out how our expert team can help. Proper process and expert input in the early phases can save significant time (and money) later!

Hardian Health is clinical digital consultancy focussed on leveraging technology into healthcare markets through clinical strategy, scientific validation, regulation, health economics and intellectual property.

Get in touch

Dr Hugh Harvey

By Dr Hugh Harvey, Managing Director