As artificial intelligence for medical imaging increases in adoption, the need for careful evaluation of the true performance of these algorithms has become more urgent. The deep learning literature typically relies on single metrics such as the area under the receiver operating curve (AUC) to benchmark the performance of AI algorithms in non-medical applications. This approach has also carried over into the medical imaging AI space, and today most AI vendors and papers still report these metrics alone.
However, even as early as 2005, preceding the current rush of deep learning applications, researchers have cautioned against overly relying on these metrics to satisfy ourselves that a model’s performance is sufficient. This fact is steadily being rediscovered in today’s AI medical imaging space, with recent research demonstrating that in populations with causally relevant subclasses, a single AUC number is not sufficient to describe the performance of a model. For instance, when detecting pneumothorax, the high overall accuracy of an AI may mask the fact that the AI is simply relying on the presence or absence of a chest tube and missing the clinically relevant cases of pneumothorax that have yet to be identified and treated. This is sometimes referred to as hidden stratification or shortcut learning, further explained in this paper.
A recent publication in Nature Machine Intelligence by Degrave et al. again rediscovers this phenomenon in the topical application of detecting COVID-19 on chest radiographs. Using saliency maps as well as more advanced explainability techniques such as pseudo-healthy patient synthesis using generative adversarial networks – an approach initially described by Seah et al in ‘Chest Radiographs in Congestive Heart Failure: Visualizing Neural Network Learning’ – the researchers identified several possible confounders such as the laterality markers used in each dataset as well as differences in letterboxing. The impact of these confounders are then examined by synthetically altering the radiographs, leading the researchers to conclude that almost half of the performance of the AI models examined are due to these confounders.
Aware of such pitfalls, our clinical AI team have designed Annalise CXR from the ground up to be resilient to confounders, as well as devised methods to carefully evaluate the true performance of our algorithms and not simply relying on a single metric for each finding. To start with, large scale annotation by radiologists reading each image is performed, where findings and other confounders on the image are collected e.g., whether a pneumothorax and/or an intercostal drain is present. Additionally, Annalise CXR is trained with the benefit of pixel-level annotations where clinically relevant, as radiologists indicate on the image exactly where each finding is. By indicating to the algorithm the relevant parts of the image, this focuses its attention on the true abnormality, reducing its reliance on shortcuts and confounders. This effect has been recently described in the literature by Rueckel et al.
Pixel-level annotations also enable Annalise CXR to not only identify if a finding is present or absent, but also indicate which pixels the algorithm predicts to be involved. This form of explainability enables users to identify if the algorithm is truly focusing on the correct abnormality.
During validation, our clinical AI team have devised a list of confounders for each finding which are thought to be clinically relevant, including patient positioning as well as the concurrent presence of other findings. For instance, for pneumothorax, the AUC for Annalise CXR in a held-out test set is measured, as well as in a subset of cases which do not have an intercostal drain, a common confounder. Any statistically significant difference in AUC values between these two populations prompts a review of the AI model design and outputs.
At Annalise.ai, we believe that development and validation of AI in medical imaging should be clinically led, to ensure that AI products are beneficial and do not harm patients. Accurately measuring the true performance of AI and mitigating shortcut learning and hidden stratification is complex and requires a team with a deep clinical understanding of radiology as well as technical understanding of artificial intelligence.