Tools for AIaMD Transparency and Reporting

Over the last few years, several initiatives to create standards, guidelines and tools have been developed in an effort to improve documentation for medical AI applications.

As well as ensuring a minimum quality of published evidence, they are also helpful for building trust, transparency and accountability. In this blog, we’ll be signposting some of these tools and discussing how and when they should be used, focusing on 3 key areas: the evidence, the model and the data.

Reporting Guidelines: documenting your study

Over the last 2 years, an international consortium of researchers have been working to develop new guidance for transparency of evidence for AI in health. For research papers, several reporting guidelines for clinical studies of AI medical devices have been developed to improve transparency in the scientific literature.

Reporting guidelines (often presented as a checklist), are like a shopping list of information authors must include when writing up a scientific paper. The purpose is to make sure the most important information is included, so the reader can easily appraise the evidence and decide whether the findings are valid. Reporting guidelines don’t dictate how a study should be done, they just ask for authors to write down exactly what they did.

There are several reporting guidelines to be aware of:

SPIRIT-AI and CONSORT-AI are reporting guidelines for clinical trials for AI devices. SPIRIT-AI is for clinical trial protocols, and CONSORT-AI (‘this is what we’re going to do’) is for clinical trial reports (‘this is what we did’). SPIRIT 2013 and CONSORT 2010 (the non-AI versions) were developed some time ago and list what should be reported for any clinical trials (no matter what is being evaluated: drugs, medical devices, surgical treatments, psychological therapies…). Most journals already mandate their use in the Instructions to Authors. In the AI-extensions, which were published in 2020, a number of not-so-surprising things were added, including: define how the AI device is intended to sit within a clinical pathway, what the inputs and outputs of the AI device are, how it will integrate into a health setting, whether there are human-computer AI interaction elements, what version of the algorithm are you using…etc. The full lists can be found here. Adherence to these guidelines can be neatly shown using a checklist, which journals will ask you to submit alongside your paper.

Several other AI reporting guidelines are currently in development, depending on the study design: STARD-AI (for AI diagnostic test accuracy studies), TRIPOD-AI (for development and validation of AI prediction and prognosis models) and DECIDE-AI (for studies of early exploratory clinical investigations for AI, including investigations of human factors and usability).

Model facts labels: documenting your model

Users, whether they be front-line clinical practitioners, patients or the public, require information about AI devices in a concise, understandable and easily accessible way. At the point of use, the priority should be to ensure appropriate use and avoid unnecessary risk.

One suggestion for communicating this to end-users is using a ‘Model Card’. Originally proposed in the seminal paper by Mitchell et al. in ‘Model Cards for Model Reporting’, these short documents are intended to summarise key details of the model including training, intended use, performance, training data description and ethical considerations.

Applying this idea to health, Sendak and colleagues proposed the ‘Model Facts’ label (akin to the commonly used ‘Drug Facts’ label) to improve risk communication at the clinical front-line: making the right information available to end-users so they can make sound choices.

Figure from Sendak et al. Nature Digital Medicine 2020: Example Model Facts label for a sepsis prediction model.

As shown in the example from Sendak’s paper, short and concise ‘instructions for use’ make clear to end-users exactly when and how the model is intended to be used. By making this information easily accessible at the point-of-use, it safeguards against intentional and unintentional misuse, and reduces the likelihood of model errors, adverse events, and potential harm. In fact, the Model Facts label is a perfect opportunity to communicate the Intended Use Specification directly to end-users.

It’s quite possible that the majority of clinicians implementing AIaMD either don’t know what an IUS is, or have never asked to see it.


Clear communication about Intended Use Specification is an obvious safeguard to avoiding misuse of AI devices - it seems a no-brainer that users should have access to it; even better, why not make it publicly available?

Datasheets: documenting your data

Key to the performance of AI devices are the datasets that underpin them. The recently published guidance from the FDA, Health Canada and MHRA on Good Machine Learning Practice (GMLP) specifically calls for careful selection of datasets (and how they link to the intended use and intended population) and clear documentation of how data was split for model training. If you missed it before, Hardian Health consultant Mike Pogose’s December blog covers the new GLMP principles.

There are also increasing ethical concerns about the potential for AI systems to underperform in population groups which are poorly represented in the data. Researchers are stressing the importance of purposefully testing for performance relevant subgroups, so that if underperformance does exist (previously described as ‘hidden stratification’) - it can be detected and mitigated.

Unfortunately several review papers looking at publicly available datasets have found poor reporting of key demographic information in many publicly available datasets (such as this one for eye imaging datasets, this one and this one for skin cancer images, and this one for radiology AI models). Unless datasets report these characteristics, it is difficult to judge whether they reflect the intended use population, and whether the algorithm is likely to underperform in certain population groups. You can’t even do subgroup testing using these datasets, because you don’t know what the subgroups are!

‘Datasheets for Datasets’, by Timnit Gebru and team, is a structured template for documenting essential characteristics of datasets, to improve their transparency (and utility!). It suggests you should document processes and decisions relating to the data curation, including why the dataset was created, how, by whom, how labels were created (and the assumptions behind them - a major flaw with many medical datasets), what processes were involved to clean and transform the data and how the dataset is maintained and distributed. Although Datasheets for Datasets is not specifically tailored for healthcare applications, its principles are cross-cutting and speak to many of the problems we have seen with medical datasets so far. Researchers are also working to improve standards for diversity in healthcare datasets and creating datasets focused on health outcomes.

Transparency builds trust

Low adherence to existing model reporting guidelines by commonly used clinical prediction models

Table 1: Low adherence to existing model reporting guidelines by commonly used clinical prediction models, J H. Lu et al. https://doi.org/10.1101/2021.07.21.21260282

You’ll notice hardly any of the guidelines mentioned in this article say what you should or shouldn’t do. They simply ask you to make it really easy to understand what you did. This gives everyone else the ability to make sense of the evidence, the model or the data, and decide how to act. Research shows that the majority fo deployed AI systems do not adhere to the relevant guidance, with only 61% providing adequate documentation for model users to ensure that deployed models are useful, reliable, transparent and fair.

Ultimately, transparency is the first step to building trust, but we still have some way to go to earning that trust in medical AI.

Hardian Health is clinical digital consultancy focussed on leveraging technology into healthcare markets through clinical strategy, scientific validation, regulation, health economics and intellectual property.

Dr Xiao Liu

By Dr Xiao Liu - Hardian Health Industry Fellow, ophthalmologist and AI researcher

Previous
Previous

Re-imagining information and outcomes in healthcare

Next
Next

MHRA/FDA Principles of Good Machine Learning Practice