Deep-DxSearch: End-to-End Agentic RAG System for Medical Diagnosis

Abstract

Accurate diagnosis remains a central challenge for medical large language models due to inherent knowledge limitations and hallucinations. While retrieval-augmented generation (RAG) and tool-augmented agentic methods show potential in mitigating these issues, their suboptimal utilization of external knowledge and the decoupling of the feedback-reasoning traceability, stemming from insufficient supervision, remain key limitations.

                        We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables traceable retrieval‑augmented reasoning for medical diagnosis.
                    

Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt‑engineering and training‑free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT‑4o, DeepSeek‑R1, and other medical‑specific frameworks for both common and rare disease diagnosis.

Current Limitations in Diagnostic RAG

LLMs do not know how to effectively retrieve from large-scale, complex and long-tailed real-world medical corpus.

Follow rigid inference-only workflows and depend on manually defined query, that lack adaptive mechanisms to incorporate retrieval feedback and reasoning.

Our Solution

                        We present Deep-DxSearch, an end-to-end trained agentic RAG system for traceable diagnostic reasoning. We aim to supervise the model to learn how to orchestrate retrieval-reasoning during diagnosis.
                    

Key Contributions

Comprehensive Medical Retrieval Corpus: We build the most comprehensive medical diagnostic retrieval corpus to date, integrating patient records, clinical guidelines, and general medical knowledge, forming the foundation for diagnostic agentic RAG.

End-to-End RL Training: Treating the LLM as the agentic core and the corpus as the environment, we conduct end-to-end RL training to optimize the RAG policy. We design soft verifiable rewards that guide both reasoning steps and final diagnoses.

Evaluation Results

                        Results demonstrate that our agentic RL training approach substantially boosts diagnostic accuracy from 34.70% to 70.48%. Deep-DxSearch significantly surpasses the previous RAG design powered by GPT-4o by 23.62% in rare disease diagnosis.
                    

In comparative studies with eight state-of-the-art methods, Deep-DxSearch significantly surpasses the previous RAG design powered by the most powerful commercial LLM, GPT-4o, by 23.62% in rare disease diagnosis, even though our agentic LLM is much smaller.

Ablation studies highlight two key aspects: the effectiveness of our reward design and the contribution of our curated retrieval corpus. Our reward designed for co-optimization of retrieval and reasoning policies yields a 17% improvement in top-1 accuracy for common diseases and 22% for rare diseases, outperforming a target-only supervision scheme.

Interpretability analysis of the learned RAG policy further quantifies how agents evolve during training across three critical dimensions: retrieval relevance, differential diagnosis, and irrelevance exclusion.

Conclusion

As highlighted in the well-known "Bitter Lesson" of AI research: while human knowledge and handcrafted strategies may offer short-term gains, long-term advances depend on exploiting statistical regularities from large-scale data.

                        We believe that current training-free agentic AI designs are undergoing an "agentic version" of this "Bitter Lesson". As a concrete step in traceable diagnostic reasoning, our Deep-DxSearch demonstrates how current agentic systems can be evolved by unlocking the power of large-scale healthcare data.
                    

Citation

BibTeX

@article{zheng2025end, title={End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning}, author={Zheng, Qiaoyu and Sun, Yuze and Wu, Chaoyi and Zhao, Weike and Qiu, Pengcheng and Yu, Yongguo and Sun, Kun and Wang, Yanfeng and Zhang, Ya and Xie, Weidi}, journal={arXiv preprint arXiv:2508.15746}, year={2025} }