|
|
|
|
|
|
|
|
|
|
|
Code [GitHub] |
Paper [arXiv] |
Dataset [HuggingFace] |
Overall, the proposed dataset contains 39,026 cases, of 192,675 images from 9 diverse imaging modalities and 7 human anatomy regions, note that, each case may contain images of multiple scans. The data covers 5,568 different disorders, that have been manually mapped into 930 ICD-10-CM codes.
Specifically, cases in our dataset are sourced from the Radiopaedia website -- a growing peer-reviewed educational radiology resource website, that allows the clinicians to upload 3D volumes to better reflect real clinical scenarios. Additionally, all privacy issues have already been resolved by the clinicians at uploading time.
For each cases in Radiopaedia, 'Related Radiopaedia articles' contains links to related articles named with corresponding disorders for radiology images, which are treated as diagnosis labels and have been meticulously peer-reviewed by experts in Radiopaedia Editorial Board.
After article filtering, manual mapping and normal cases adding, we get 39,026 cases containing 192,675 images labeled by 5,568 disorder classes and 930 ICD-10-CM classes. We will continually maintain the dataset, growing the case number.
Analysis of the Cases in RP3D-DiagDS dataset
RP3D-DiagDS comprises images from 9 modalities, namely, computed tomography (CT), magnetic resonance imaging (MRI), X-ray, Ultrasound, Fluoroscopy, Nuclear medicine, Mammography, DSA (angiography), and Barium Enema. Each case may include images from multiple modalities, to ensure precise and comprehensive diagnosis of disorders. Overall, approximately 19.4% of the cases comprise images from two modalities, while around 2.9% involve images from three to five modalities. The remaining cases are associated with image scans from a single modality.
RP3D-DiagDS comprises images from various anatomical regions, including head and neck, spine, chest, breast, abdomen and pelvis, upper limb, and lower limb, providing comprehensive coverage of the entire human body.
For both disorder and disease classification, each case can correspond to multiple disorders, resulting in RP3D-DiagDS a long-tailed, multi-label classification dataset. We define the `head class' category with case counts greater than 100, the `body class' category with case counts between 30 and 100, and the `tail class' category with case counts less than 30.
R1: Classification results on Disorders and ICD-10-CM levels.
In the table ``FM'' represents the fusion module and ``KE'' represents the knowledge enhancement strategy. We report the results on Head/Medium/Tail class sets separately.
R2: ROC curves on Disorders and ICD-10-CM.
As depicts in ROC curves above, the shadow in the figure shown the 95% CI (Confidence interval) and FM, KE are short for Fusion Module and Knowledge Enhancement.
R3: The AUC Score Comparison on Various External Datasets.
For each dataset, we carry out experiments with different training data portions, denoted as 1% to 100% in the table. For example, 30% represents we use 30% of data in the downstream training set for finetuning our model or training from scratch. ``SOTA'' denotes the best performance of former works (pointed with corresponding reference) on the datasets and ``Zero-shot'' denotes directly evaluate our model on external datasets. We mark the gap between ours and training from screatch on the subscript of uparrows in the table.
For more detailed ablation studies and results, please refer to our paper.
@article{zheng2023large,
title={Large-scale Long-tailed Disease Diagnosis on Radiology Images},
author={Zheng, Qiaoyu and Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang,
Ya and Wang, Yanfeng and Xie, Weidi},
journal={arXiv preprint arXiv:2312.16151},
year={2023}
}