Radiology. First published online 26 September 2023.
Authors: Plesner LL, Müller FC, Brejnebøl MW, Laustrup LC, Rasmussen F, Nielsen OW, Boesen M & Anderson MB
Commercially available artificial intelligence (AI) tools can assist radiologists in interpreting chest radiographs, but their real-life diagnostic accuracy remains unclear.
To evaluate the diagnostic accuracy of four commercially available AI tools for detection of airspace disease, pneumothorax, and pleural effusion on chest radiographs.
Materials and Methods
This retrospective study included consecutive adult patients who underwent chest radiography at one of four Danish hospitals in January 2020. Two thoracic radiologists (or three, in cases of disagreement) who had access to all previous and future imaging labeled chest radiographs independently for the reference standard. Area under the receiver operating characteristic curve, sensitivity, and specificity were calculated. Sensitivity and specificity were additionally stratified according to the severity of findings, number of findings on chest radiographs, and radiographic projection. The χ2 and McNemar tests were used for comparisons.
The data set comprised 2040 patients (median age, 72 years [IQR, 58–81 years]; 1033 female), of whom 669 (32.8%) had target findings. The AI tools demonstrated areas under the receiver operating characteristic curve ranging 0.83–0.88 for airspace disease, 0.89–0.97 for pneumothorax, and 0.94–0.97 for pleural effusion. Sensitivities ranged 72%–91% for airspace disease, 63%–90% for pneumothorax, and 62%–95% for pleural effusion. Negative predictive values ranged 92%–100% for all target findings. In airspace disease, pneumothorax, and pleural effusion, specificity was high for chest radiographs with normal or single findings (range, 85%–96%, 99%–100%, and 95%–100%, respectively) and markedly lower for chest radiographs with four or more findings (range, 27%–69%, 96%–99%, 65%–92%, respectively) (P < .001). AI sensitivity was lower for vague airspace disease (range, 33%–61%) and small pneumothorax or pleural effusion (range, 9%–94%) compared with larger findings (range, 81%–100%; P value range, > .99 to < .001).
Current-generation AI tools showed moderate to high sensitivity for detecting airspace disease, pneumothorax, and pleural effusion on chest radiographs. However, they produced more false-positive findings than radiology reports, and their performance decreased for smaller-sized target findings and when multiple findings were present.