The Annalise CXR validation study published in Lancet Digital Health outlines the process of model development, training data curation and labelling and the test set curation, as well as the multi-reader, multi-case (MRMC) analysis which validated the model performance.
Annalise CXR is the most comprehensive AI model tested to date, with the model developed to identify 127 findings on chest x-rays (CXR). The range of findings in the ontology tree of the underlying AI model was carefully collated to represent the findings present in the vast majority of emergent, urgent, and non-urgent settings where CXR would be performed. The findings range from technical factor assessment through to the most time-sensitive findings of pneumothorax, pneumoperitoneum and acute bony trauma.
The training dataset incorporated over 800,000 CXR images, each hand labelled multiple times by radiologists using a well-described, well-defined ontology tree. A held out, enriched set of images was curated for testing, resulting in a set of 2,568 CXR studies (comprising 4,568 images) obtained from Australia and the US, which formed the basis of the MRMC test set. Each of these test set studies comprised at least one frontal CXR and optionally up to one lateral projection. Each study was ground-truthed by three subspecialist chest radiologists, with consensus on each of the findings obtained for each study.
The standalone performance of the Annalise CXR model against the ground-truthed test set labels was assessed on a per-study basis for each finding. The performance was measured with the area under the curve (AUC) metric, which is a commonly used summary statistic of diagnostic accuracy independent of the prevalence of the finding in the test set. The Matthews Correlation Coefficient (MCC) was also used as a secondary analysis metric for similar reasons. The low prevalence of several findings precluded them from assessment in the dataset, reducing the number of assessable findings to 124.
In the MRMC protocol, 20 radiologists each read the test set studies and identified the findings present on a scale from 1 to 5, without access to the model output. After a suitably long washout period, the same group of radiologists re-read the test set cases, in different randomised order, with access to the model output. The accuracy performance for the radiologist group was compared between the assisted and unassisted reads, for each finding, across the set of studies in the test set.
Compared to the unassisted radiologist performance, the assisted radiologist performance showed significant improvements across 102 CXR findings, was statistically non-inferior in 19 findings, and showed no significant decrement in performance for the remaining findings. Furthermore, direct comparison of model performance to unassisted radiologist performance indicated that the model AUC was non-inferior to radiologists for all findings and statistically superior for 117 findings. These results were obtained in a non-clinical setting, and further research is necessary to assess if it generalises to a clinical setting.
Whilst this study was performed in a non-clinical environment, it demonstrates the ability of a comprehensive CXR AI model to achieve excellent diagnostic performance across a wide variety of findings compared to radiologists and to improve radiologist performance over a similarly wide range of findings. Thorough, high-quality dataset labelling and carefully considered ontology tree development are integral to this process. A robust methodology for validation – in this case, a multi-reader, multi-case study design – is vital in validating the model’s effect on radiologists. In this study, radiologist performance improved for 102 clinical findings, and the model detected pathology with an AUC superior to those of radiologists in 117 findings. The validation of these non-clinical results in the clinical sphere is ongoing, as well as specific use case and population and disease-specific cohort validations. We eagerly await further literature on the real-world performance of this model.
Further details of the study results can be found at The Lancet Digital Health.