Accurate diagnosis remains a central challenge for medical large language models due to inherent knowledge limitations and hallucinations. While retrieval-augmented generation (RAG) and tool-augmented agentic methods show potential in mitigating these issues, their suboptimal utilization of external knowledge and the decoupling of the feedback-reasoning traceability, stemming from insufficient supervision, remain key limitations.
Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt‑engineering and training‑free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT‑4o, DeepSeek‑R1, and other medical‑specific frameworks for both common and rare disease diagnosis.
LLMs do not know how to effectively retrieve from large-scale, complex and long-tailed real-world medical corpus.
Follow rigid inference-only workflows and depend on manually defined query, that lack adaptive mechanisms to incorporate retrieval feedback and reasoning.
Comprehensive Medical Retrieval Corpus: We build the most comprehensive medical diagnostic retrieval corpus to date, integrating patient records, clinical guidelines, and general medical knowledge, forming the foundation for diagnostic agentic RAG.
End-to-End RL Training: Treating the LLM as the agentic core and the corpus as the environment, we conduct end-to-end RL training to optimize the RAG policy. We design soft verifiable rewards that guide both reasoning steps and final diagnoses.
In comparative studies with eight state-of-the-art methods, Deep-DxSearch significantly surpasses the previous RAG design powered by the most powerful commercial LLM, GPT-4o, by 23.62% in rare disease diagnosis, even though our agentic LLM is much smaller.
Ablation studies highlight two key aspects: the effectiveness of our reward design and the contribution of our curated retrieval corpus. Our reward designed for co-optimization of retrieval and reasoning policies yields a 17% improvement in top-1 accuracy for common diseases and 22% for rare diseases, outperforming a target-only supervision scheme.
Interpretability analysis of the learned RAG policy further quantifies how agents evolve during training across three critical dimensions: retrieval relevance, differential diagnosis, and irrelevance exclusion.
As highlighted in the well-known "Bitter Lesson" of AI research: while human knowledge and handcrafted strategies may offer short-term gains, long-term advances depend on exploiting statistical regularities from large-scale data.