
AbstractPurposeTo validate the performance of autonomous diabetic retinopathy (DR) grading by comparing a human grader and a self‐developed deep‐learning (DL) algorithm with gold‐standard evaluation.MethodsWe included 500, 6‐field retinal images graded by an expert ophthalmologist (gold standard) according to the International Clinical Diabetic Retinopathy Disease Severity Scale as represented with DR levels 0–4 (97, 100, 100, 103, 100, respectively). Weighted kappa was calculated to measure the DR classification agreement for (1) a certified human grader without, and (2) with assistance from a DL algorithm and (3) the DL operating autonomously. Using any DR (level 0 vs. 1–4) as a cutoff, we calculated sensitivity, specificity, as well as positive and negative predictive values (PPV and NPV). Finally, we assessed lesion discrepancies between Model 3 and the gold standard.ResultsAs compared to the gold standard, weighted kappa for Models 1–3 was 0.88, 0.89 and 0.72, sensitivities were 95%, 94% and 78% and specificities were 82%, 84% and 81%. Extrapolating to a real‐world DR prevalence of 23.8%, the PPV were 63%, 64% and 57% and the NPV were 98%, 98% and 92%. Discrepancies between the gold standard and Model 3 were mainly incorrect detection of artefacts (n = 49), missed microaneurysms (n = 26) and inconsistencies between the segmentation and classification (n = 51).ConclusionWhile the autonomous DL algorithm for DR classification only performed on par with a human grader for some measures in a high‐risk population, extrapolations to a real‐world population demonstrated an excellent 92% NPV, which could make it clinically feasible to use autonomously to identify non‐DR patients.
validation, Male, decision support, Diabetic Retinopathy, automated classification, Reproducibility of Results, Middle Aged, deep-learning, Diabetic Retinopathy/diagnosis, Severity of Illness Index, Retina, diabetic retinopathy, Deep Learning, Humans, Original Article, Female, Retina/diagnostic imaging, Algorithms, Aged, Retrospective Studies
validation, Male, decision support, Diabetic Retinopathy, automated classification, Reproducibility of Results, Middle Aged, deep-learning, Diabetic Retinopathy/diagnosis, Severity of Illness Index, Retina, diabetic retinopathy, Deep Learning, Humans, Original Article, Female, Retina/diagnostic imaging, Algorithms, Aged, Retrospective Studies
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
