We collect and use information about you, your devices and interactions to improve our services. For more information about how we collect your personal information and our use of cookies on our website, please see our Privacy Policy.

How to evaluate a diagnostic radiology AI paper?

Dr Jarrel Seah | Associate Clinical AI Director

As artificial intelligence enters practical and daily use, radiologists will increasingly engage with radiology AI algorithms as end-users or stakeholders in deciding whether to implement algorithms in clinical practice. As radiologists, we need to address the most important question: Can we use this AI solution in clinical practice safely? 

Most artificial intelligence algorithms will come with a variety of evidence, either a white paper provided by the manufacturer or a peer-reviewed academic paper published in the literature. Since the most common types of AI algorithms will likely be performing computer-aided diagnosis, most of this evidence will take the form of a diagnostic accuracy study for classification.  

Evaluating diagnostic accuracy studies is a well-known topic that predates artificial intelligence or machine learning. Well-established guidelines such as the Standards for Reporting Diagnostic Accuracy Studies (STARD)[1] are suited for this task and should be included as part of the assessment of any diagnostic radiology AI paper.  

However, when evaluating a diagnostic radiology AI paper, several additional facets should be considered, particularly around the dataset used to train and test the AI algorithm, the AI training procedure, and the results presented in the paper. 


Not all data are equal 

The first questions to consider are based on the data referenced in a paper. 

   1.Training data vs testing data 

Is the source of the data used to test the AI algorithm the same source as the data used to train it? If so, how was the testing data kept separate from the training data? If the AI algorithm was trained on the same patients on which it was evaluated; it could memorise patients’ personal characteristics with particular diseases, falsely improving its accuracy on the testing data.  

   2. Multi-sourced data 

The source of the testing data is also important when considering whether the results of the paper are applicable in your particular setting. The testing data’s inclusion and exclusion criteria will indicate if the patients are similar to those in your practice.  

Example: The testing data may use positive tuberculosis patients from India but negative controls from Australia; a practice known as a Frankenstein dataset [2]. This is unlikely to mirror your patient population since it’s unlikely that your practice sees patients from both India and Australia at the same time.   

   3. Hidden stratification 

Subgroups within the patient population that are of particular interest to you should be evaluated separately. Good overall performance of the AI algorithm may mask underlying deficiencies.  

Example: The AI algorithm may detect pneumothoraces on chest radiographs very well, but may not detect pneumothorax in patients without an intercostal chest drain, otherwise known as hidden stratification[3]. Since intercostal drains are obvious on a chest radiograph, AI algorithms often pick up patients with a pneumothorax who have been treated with a drain quite easily. But it may fail in cases where the patient has yet to be treated, which are more concerning for clinicians.  

4. Plausibility and assumption 

The training procedure should also be assessed for its plausibility. Staying on the topic of pneumothorax, many papers describe AI algorithms that use 256×256 images to detect pneumothorax on chest radiographs[4]. This is implausible from a medical perspective as we know that the radiological signs of pneumothorax are often very subtle and require high resolution images to detect. This suggests that the AI algorithm may be using other features, such as the presence of an intercostal drain, to identify the pneumothorax, instead of the salient features themselves.  

5. Baseline 

Establish a robust baseline for the AI algorithm to compare itself to, and a clinically valid metric that such a comparison could be made. The common metric used in diagnostic radiology is the area under the receiver operating curve (AUC), which measures the ability of the AI algorithm to discriminate between positive and negative cases. An AUC of 0.5 indicates that the algorithm is no better than random guessing, and an AUC of 1.0 indicates perfect discrimination. It is less obvious however what a “good” or “acceptable” AUC is, as different modalities and diseases will have different diagnostic performance.   

A common baseline used in many studies is that the AI algorithm should be non-inferior to the current standard of care – which is typically a radiologist or clinician evaluating the study without AI. Such a baseline may be established retrospectively or prospectively. Retrospective baselines may not reflect true clinical performance as readers may be less motivated than in clinical practice, or conversely, may overestimate clinical performance if readers are primed to search for the pathology being evaluated.  



Many of these principles are discussed in existing checklists such as CLAIM [5], and this article only touches on the basics of evaluating such papers. As the field progresses and matures, the evidence base will grow in size and complexity, but radiologists and clinicians will always hold ultimate responsibility of using AI algorithms in clinical practice safely.   



1.  Cohen JF, Korevaar DA, Altman DG, et al (2016) STARD 2015 guidelines for reporting diagnostic accuracy studies: Explanation and elaboration. BMJ Open 6:e012799. https://doi.org/10.1136/bmjopen-2016-012799 

2.  Roberts M, Driggs D, Thorpe M, et al (2021) Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 3:199–217. https://doi.org/10.1038/s42256-021-00307-0 

3.  Oakden-Rayner L, Dunnmon J, Carneiro G, Re C (2020) Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In: ACM CHIL 2020 – Proceedings of the 2020 ACM Conference on Health, Inference, and Learning. Association for Computing Machinery, Inc, pp 151–159 

4.  Rajpurkar P, Irvin J, Zhu K, et al (2017) CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning 

5.  Mongan J, Moy L, Kahn CE (2020) Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell 2:e200029. https://doi.org/10.1148/ryai.2020200029 

About Dr Jarrel Seah | Associate Clinical AI Director

Dr. Jarrel Seah is a radiology registrar at the Alfred Hospital, Melbourne and an AI engineer at annalise.ai. He has a keen interest in developing and applying novel deep learning algorithms in radiology and particularly in explainable AI. He has published and presented in both medical and technical conferences such as SPIE and Radiology, and is a coordinator of the inaugural RANZCR Catheter and Line Position Kaggle challenge.

Named in Forbes 30 under 30, in 2014 he co-developed Eyenaemia, an app that allows people to use their cell phone to screen for anaemia, while studying medicine at Monash University, which won the World Championship and World Citizenship Competition for the Imagine Cup, an annual competition run by Microsoft.

Jarrel provides clinical and technical guidance in the development and validation of AI models at Annalise.ai and Harrison.ai, and undertakes cutting-edge research within the field of deep learning in radiology.

Never miss an update.

Subscribe to our newsletter.