Uncertainty and confidence in medical imaging AI

4.06.21

Uncertainty and confidence in medical imaging AI

Dealing with uncertainty is central to medicine – medical practitioners are trained in differential diagnosis and how to manage myriad possibilities. For AI systems in medical imaging, this represents both an opportunity and a challenge. On the one hand, AI systems are great at precisely quantifying statistical probabilities, a task at which humans are notoriously bad. On the other hand, displaying the magnitude and sources of uncertainty to the user in an intelligible manner is a user design challenge that should not be underestimated.

Fundamentally, when identifying if a finding is present or absent, AI systems produce a continuous probability score, where a higher score indicates that the finding is more likely to be present. To convert this into a useful prediction, most systems define a threshold above which the finding is taken to be ‘present’. AI systems can convey their degree of confidence in the finding by displaying the score in addition to the absence or presence of the finding. Cases that barely exceed the threshold are less likely to be positive than those with higher scores, which should help users decide whether or not to trust the AI system’s prediction.

A key design feature of many AI systems is that they are usually ensembles of multiple models, meaning that the final prediction is tallied from the votes of multiple models. This reflects the stochastic nature of deep learning systems as retraining the same model with a slightly different dataset, or even the same dataset, can lead to differing predictions on the same data. This is a source of uncertainty analogous to the concept of inter and intra-observer variability in clinical practice; that giving the same image to different readers can result in different reports. AI systems can capture this information by not predicting the numerical score for a case but the uncertainty range around that score.

Doing so helps users troubleshoot predictions with which they may disagree. Low scores with a narrow range indicate that the case is inherently ambiguous even though the AI system has been trained on many similar such images. Whereas a wide uncertainty range – even with a high score – indicates that the AI system may not have seen sufficient similar such cases, leading to disagreement within the models of the AI system.

Managing and displaying confidence and uncertainty correctly can lead to significant benefits by helping users decide when to trust an AI prediction, as well as understand why a particular case may be problematic for the AI, improving user accuracy. However, this challenge is overlooked by many current AI vendors and poorly understood by many end-users.

As this emerging field matures, decision-makers must consider how AI systems deal with the challenge of uncertainty and proactively educate users on this topic to realise the full potential of AI systems and obtain maximum improvements in diagnostic performance when using AI assistance systems.