RAG in 2025-2026 · State of the Art

01 - Overview

The Big Shift

RAG has graduated from a 2023 hack to the dominant enterprise LLM architecture - now being reshaped by reasoning agents.

The 2025 defining shift: techniques like Search-R1, OpenAI Deep Research, and LazyGraphRAG have replaced static "retrieve-then-generate" pipelines with RL-trained reasoning agents that decide when, what, and how to retrieve. RAG didn't die when context windows grew - it specialized.

Adoption Trajectory

Production RAG 2023

Production RAG 2024

Early adopters (Snowflake)

Why RAG Persists vs Long-Context

Long-context windows with 1-10M tokens reignited the "RAG is dead" debate. ICML 2025's LaRA benchmark tested 2,326 cases across 11 LLMs and concluded: neither RAG nor long-context is a silver bullet. RAG dominates when corpora exceed 2M tokens, freshness matters, source attribution is required - and long-context is 8-82x more expensive at scale.

Key Timeline

RAG Evolution

2020

Original RAG Paper

Lewis et al. (Facebook AI) introduce Retrieval-Augmented Generation for knowledge-intensive NLP tasks. Retriever + seq2seq generator as a single system.

2022-23

Pipeline RAG Goes Mainstream

LangChain, LlamaIndex, and vector databases (Pinecone, Weaviate) democratize embedding-based retrieval. HyDE and Self-RAG introduce query augmentation and self-reflection.

2024 Q1

GraphRAG & CRAG

Microsoft Research releases GraphRAG (arXiv:2404.16130) with knowledge-graph community summaries. Corrective RAG adds retrieval quality scoring and web-search fallback.

2024 Q3

Modular RAG + Multimodal

ColPali eliminates OCR for document retrieval. Modular RAG paper formalizes LEGO-like orchestration. Anthropic's Contextual Retrieval reduces failures by 67%.

2025

Agentic RL-Trained Retrieval

Search-R1, ReSearch, and DeepResearcher use PPO/GRPO to train LLMs to reason and search jointly. OpenAI Deep Research scores 26.6% on Humanity's Last Exam. LazyGraphRAG cuts GraphRAG cost to 0.1%.

2026

Hybrid Routing Becomes Standard

The productive question shifts from "RAG or not?" to "what mixture of cached context, sparse/dense retrieval, graph traversal, and agentic search does this query class need?"

02 - Techniques

The 8 Defining Methods

Click any card to expand details and benchmarks.

Graph-Based

GraphRAG

Extracts entity-relationship knowledge graphs via Leiden community detection, generating hierarchical summaries for global "sensemaking" queries.

80% accuracy (vs ~50% baseline)

Microsoft Research (arXiv:2404.16130). On global queries over million-token corpora, GraphRAG wins 70-80% of head-to-head comparisons. Trade-off: 100-1,000x more LLM calls than vector RAG. GraphRAG-Bench (ICLR'26) found only +4.5% on HotpotQA at 2.3x higher latency - suggesting it shines on complex relational queries, not simple lookups.

Graph-Based · Efficient

LazyGraphRAG

Cuts full GraphRAG indexing cost to just 0.1% while achieving 100% win rate (96/96 queries) versus vector RAG, RAPTOR, and LightRAG.

0.1% of GraphRAG cost

Released June 2025 by Microsoft Research. Lazy evaluation defers expensive community summarization until query time - building only what's needed. Particularly impactful for enterprise deployments where full GraphRAG indexing was previously cost-prohibitive. Open-source HippoRAG 2 and LightRAG achieve similar quality at 10-30x lower cost and 6-13x lower latency via Personalized PageRank over LLM-extracted graphs.

Agentic

Agentic RAG

LLM agents with reflection, planning, and tool-use decide when and how to retrieve - replacing fixed pipelines with dynamic multi-step reasoning.

78% vs 34% on complex queries

Singh et al. survey (arXiv:2501.09136) and Li et al. (arXiv:2507.09477) document the shift. Agentic iterative RAG beats Iter-RetGen by +8.3 F1 on HotpotQA. Production frameworks: LangGraph, LlamaIndex Agents, Microsoft Semantic Kernel, Hugging Face smolagents. Deployed at ServiceNow (P1 incident management) and Workday HR.

Self-Improving

Self-RAG

Trains an LM to emit reflection tokens (Retrieve, IsRel, IsSup, IsUse) to adaptively decide when to retrieve and self-critique outputs.

55.8% PopQA vs 14.7% Llama2

Asai et al. (ICLR 2024, arXiv:2310.11511). A 13B Self-RAG model dramatically outperforms same-size baselines. Hallucinations drop from 15-20% to just 2%. The model trains end-to-end with special reflection tokens in vocabulary. Key limitation: tied to specific model architecture, unlike CRAG which is model-agnostic.

Corrective · Self-Improving

CRAG

A lightweight 0.77B T5-Large evaluator classifies retrievals as Correct/Incorrect/Ambiguous, triggering web-search fallback when retrieval fails.

+36.6% on PubHealth

Yan et al. (arXiv:2401.15884, ICLR 2024). Results: +19% PopQA, +14.9% FactScore on Biography, +36.6% PubHealth, +8.1% ARC-Challenge vs baseline RAG. Unlike Self-RAG, CRAG remains effective when the underlying LLM is swapped - making it the most enterprise-practical self-correcting approach. The LangGraph CRAG tutorial is the most-replicated 2025 RAG pattern.

Query Augmentation

HyDE

Generates a hypothetical answer document from the query and embeds that for retrieval - matching fine-tuned retrievers with zero training data.

Zero-shot, no relevance labels

Gao et al. (ACL 2023, arXiv:2212.10496). HyDE matches fine-tuned dense retrievers on TREC DL19/20, BEIR, and Mr. TyDi without any labeled data. Now standard in LangChain, LlamaIndex, Haystack, and Milvus. Trade-off: adds 25-60% latency due to extra generation step. Particularly powerful when the query is short and the relevant documents are verbose.

Hierarchical Indexing

RAPTOR

Recursively clusters and summarizes documents into a multi-level tree, enabling retrieval at different abstraction levels for complex multi-hop questions.

+20% QuALITY accuracy (GPT-4)

Sarthi et al. (Stanford, ICLR 2024, arXiv:2401.18059). RAPTOR builds a tree from leaf chunks to root summaries via Gaussian Mixture Model clustering. Querying at multiple tree levels captures both detail and global context. The +20 absolute point improvement on GPT-4 QuALITY benchmark made this the standard for long-document RAG until GraphRAG arrived.

Speculative · Efficient

Speculative RAG

A small drafter LM generates parallel multi-perspective draft answers from retrieved docs; a larger verifier LM selects the best - cutting latency dramatically.

17-51% latency reduction

Wang et al. (Google Research, ICLR 2025, arXiv:2407.08223). The drafter generates multiple drafts, each grounded in different retrieved subsets. The verifier LM picks the most accurate. Results: 2-13% accuracy gains across TriviaQA, MuSiQue, PubHealth, and ARC-C while simultaneously cutting latency 17-51% versus sequential RAG. Applies speculative decoding ideas from inference optimization to the RAG paradigm.

03 - Architecture

Architecture Wars

Three battles: end-to-end optimization, multimodal retrieval, and long-context vs RAG.

Battle 1

RAG 2.0 vs Modular RAG

RAG 2.0

End-to-End Optimization · Contextual AI

Pretrains parser, embedder, retriever, reranker, and generator jointly
Backpropagates through both retriever and LLM
~10x better parameter efficiency (7B ≈ 70B baseline)
GA 2025 with HSBC, Qualcomm, US DoD deployments

Requires full retraining when knowledge changes
Less flexible to swap individual components
Higher upfront cost

Modular RAG

LEGO-like Orchestration · LangChain / LlamaIndex

Plug-and-play: swap retrievers, rerankers, generators independently
Compose Self-RAG, CRAG, RAPTOR, Search-R1 as flow patterns
Adopted by LangChain and LlamaIndex as reference architecture
Easier A/B testing and incremental improvement

Component interfaces can cause error propagation
Not jointly optimized - each module has its own objectives
Orchestration overhead at scale

Battle 2

Multimodal Retrieval

ColPali (ICLR 2025) indexes PDF page images directly via a PaliGemma-3B vision-language model - eliminating OCR, layout detection, and chunking entirely. Late-interaction MaxSim scoring over 32x32 patch embeddings beats every pipeline on the ViDoRe benchmark. The descendant family (ColQwen2, ColSmol, Jina-ColBERT-v2) has rapidly displaced traditional document-RAG stacks for visually-rich content.

Vision-Language

ColPali

Indexes page images via PaliGemma-3B. No OCR needed. Best on ViDoRe benchmark.

Unified Embedding

Voyage-multimodal-3

+19.63% retrieval accuracy over next-best multimodal baseline (Nov 2024).

200-Page Context

Cohere Embed v4

128K-token context (≈200 pages), interleaved text+image, Matryoshka dimensions (April 2025).

Battle 3

RAG vs Long-Context

RAG wins when...

Retrieval Augmented Generation

Corpus exceeds 2M tokens (all models degrade)
Freshness matters - retrieval over live indexes
Source attribution / citation is required
Per-token cost dominates (RAG is 8-82x cheaper)
Structured knowledge bases with precise facts

Long-Context wins when...

1M-10M token windows

Full document coherence needed (code repos, books)
Queries require dense reading of entire corpus
Retrieval errors would be catastrophic
Latency, not cost, is the primary constraint

Only Gemini 1.5 sustains accuracy past 1M tokens
Open-source models degrade sharply past 32K
8-82x more expensive per query at scale

04 - Research

Breakthrough Papers

Key published works shaping the 2024-2026 RAG landscape.

The most consequential 2025 development: RL-trained agents that learn to interleave reasoning with search. Search-R1, ReSearch, and DeepResearcher use PPO/GRPO reinforcement learning - no supervised reasoning data required. The field's center of gravity has shifted to what Li et al. call "Synergized RAG-Reasoning".

ICLR 2024

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Asai et al. - Reflection tokens (Retrieve, IsRel, IsSup, IsUse) enable adaptive retrieval. 13B model outperforms 70B+ baselines on knowledge-intensive benchmarks.

arXiv:2310.11511

ICLR 2024

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Sarthi et al. (Stanford) - Multi-level document trees via GMM clustering. +20 absolute points on GPT-4 QuALITY benchmark.

arXiv:2401.18059

ICLR 2024

Corrective Retrieval Augmented Generation (CRAG)

Yan et al. - Lightweight T5-Large evaluator triggers web-search fallback for poor retrievals. +36.6% on PubHealth, model-agnostic design.

arXiv:2401.15884

NAACL 2024

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Jeong et al. - T5-Large complexity classifier routes queries to no-retrieval, single-hop, or multi-hop strategies. ~60% of queries need no expensive retrieval.

arXiv:2403.14403

Apr 2024

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge et al. (Microsoft Research) - Knowledge graph + Leiden community detection + hierarchical summaries. 70-80% win rate on global sensemaking queries.

arXiv:2404.16130

ICLR 2025

ColPali: Efficient Document Retrieval with Vision Language Models

Faysse et al. - PDF page image indexing via PaliGemma-3B. Eliminates OCR/chunking. State-of-the-art on ViDoRe benchmark.

arXiv:2407.01449

ICLR 2025

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Wang et al. (Google) - Parallel multi-perspective drafts + verifier selection. 17-51% latency reduction, 2-13% accuracy gains.

arXiv:2407.08223

Jan 2025

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Singh et al. - Comprehensive survey of agent-controlled retrieval systems with planning, reflection, and multi-agent collaboration patterns.

arXiv:2501.09136

Mar 2025

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

UIUC + Google - First open-source PPO/GRPO pipeline for RL-trained search-reasoning agents. +41% on Qwen2.5-7B over RAG baselines across 7 datasets.

arXiv:2503.09516

Apr 2025

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

RL training on the live web - outperforms Search-R1 and R1-Searcher on out-of-distribution datasets. Bridges the gap between benchmarks and real retrieval.

arXiv:2504.03160

Jun 2025

When to use Graphs in RAG: GraphRAG-Bench

Independent benchmark accepted at ICLR 2026. GraphRAG yields only +4.5% reasoning depth on HotpotQA at 2.3x higher latency - underperforms on simple factoid lookups.

arXiv:2506.05690

05 - Enterprise

The $37B Stack

Where RAG actually ships across industries and infrastructure.

76% of AI use cases are now purchased rather than built (vs 53% in 2024) per Menlo Ventures' December 2025 survey of 495 decision-makers. Enterprise AI spending tripled to $37B in 2025. The infrastructure has consolidated around a few dominant frameworks and cloud-managed services.

Framework Adoption

LangChain

130M+

LlamaIndex

↑ fast

Haystack

~18K★

Semantic Kernel

smolagents

Vector DB Market Growth

2025

$2.55B

2028 (est.)

~$7B

2034 (proj.)

$15.1B

22.3% CAGR projected. Pinecone (~4,000 customers), Weaviate, Qdrant, Milvus, MongoDB Atlas post-Voyage acquisition.

Production Deployments

Who's Shipping

Product	Domain	Key Metric	RAG Approach
Glean	Enterprise Search	$200M ARR · 27B+ docs	Knowledge graph + semantic search
Harvey	Legal AI	$190M ARR · $11B valuation	Domain-specific legal document RAG
Hebbia	Legal / Finance	92% vs 68% stock RAG	Multi-agent matrix analysis
Cursor	Code Assistance	$200M ARR	Merkle-tree repo re-indexing (10 min)
Morgan Stanley AI@MS	Financial Advisory	98% adoption · 100K new clients	100,000-doc corpus embedding search
Perplexity	Research Search	~30M queries/day · $21B val.	Live web retrieval + citation grounding
GitHub Copilot	Code	$300M+ ARR	Standard repo embedding RAG
Augment Code	Code (Enterprise)	70.6% SWE-bench Verified	Semantic dependency graphs

06 - Risks & Limits

What Still Breaks

Stubborn failure modes, security vulnerabilities, and unresolved research challenges.

Despite maturation, RAG retains persistent failure modes. Hallucinations persist at 3-27% in production even with retrieval. Cleanlab's 2025 benchmarks found popular detection tools - RAGAS and DeepEval - failed on 83.5% and 58.9% of production examples respectively. The Air Canada chatbot ruling established global legal precedent that companies are liable for AI hallucinations regardless of vendor disclaimers.

🔴 Critical · Accuracy

Persistent Hallucinations

3-27% hallucination rate in production RAG systems. Stanford found specialized legal AI tools hallucinate 17-34% of the time; general ChatGPT 58-82% on legal queries. ICLR 2025 research: models hallucinate even with all relevant info present when it isn't clearly structured.

🔴 Critical · Security

PoisonedRAG Attack

USENIX Security 2025: just 5 crafted documents manipulate RAG responses with >90% success in million-document corpora. Bypasses the fundamental assumption that retrieval improves reliability.

🔴 Critical · Security

Prompt Injection

OWASP LLM Top 10's #1 risk (LLM01:2025). 73% of audited production AI deployments had prompt-injection vulnerabilities; only 34.7% had dedicated defenses. RAG corpus is an attack surface.

🟡 High · Security

EchoLeak / CamoLeak

Microsoft 365 Copilot EchoLeak (CVE-2025-32711, CVSS 9.3) and GitHub Copilot CamoLeak (CVE-2025-53773, CVSS 9.6) - invisible Markdown enabling RCE through RAG-augmented context.

🟡 High · Reliability

Evaluation Tool Failures

RAGAS failed on 83.5% of production examples; DeepEval on 58.9%. Evaluating RAG in production remains an unsolved problem - developers lack reliable signals for system degradation.

🟡 High · Technical

Chunking Strategy

The precision-coherence tradeoff: small chunks enable precise retrieval but fragment context; large chunks preserve coherence but reduce specificity. No universal optimal strategy exists.

🟣 Medium · Quality

Domain Embedding Gap

General-purpose embedding models underperform domain-specific embeddings by 20-40% in legal and medical contexts. Most teams default to general embeddings due to fine-tuning cost.

🟣 Medium · Architecture

Multi-Hop Reasoning Failures

Standard single-pass RAG systematically fails on questions requiring 3+ reasoning hops. GraphRAG and Hebbia's multi-agent approach target this gap, but at significant cost.

🟣 Medium · Operations

Freshness / Staleness

Re-indexing at scale remains costly and error-prone. Reportedly behind Pinecone losing Notion as a customer. Live-web retrieval (Perplexity, Search-R1) bypasses this but introduces new reliability risks.

Agentic security risk: Cisco's State of AI Security 2026 found 83% of organizations plan to deploy agentic AI but only 29% feel ready to secure it. As RAG evolves toward autonomous retrieval agents, the attack surface grows substantially - each tool call is a potential injection point.

RAG in 2025-26 Retrieval-Augmented Generation

The Big Shift

Adoption Trajectory

Why RAG Persists vs Long-Context

RAG Evolution

Original RAG Paper

Pipeline RAG Goes Mainstream

GraphRAG & CRAG

Modular RAG + Multimodal

Agentic RL-Trained Retrieval

Hybrid Routing Becomes Standard

The 8 Defining Methods

GraphRAG

LazyGraphRAG

Agentic RAG

Self-RAG

CRAG

HyDE

RAPTOR

Speculative RAG

Architecture Wars

RAG 2.0 vs Modular RAG

RAG 2.0

Modular RAG

Multimodal Retrieval

ColPali

Voyage-multimodal-3

Cohere Embed v4

RAG vs Long-Context

RAG wins when...

Long-Context wins when...

Breakthrough Papers

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Corrective Retrieval Augmented Generation (CRAG)

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

ColPali: Efficient Document Retrieval with Vision Language Models

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

When to use Graphs in RAG: GraphRAG-Bench

The $37B Stack

Framework Adoption

Vector DB Market Growth

Who's Shipping

What Still Breaks

Persistent Hallucinations

PoisonedRAG Attack

Prompt Injection

EchoLeak / CamoLeak

Evaluation Tool Failures

Chunking Strategy

Domain Embedding Gap

Multi-Hop Reasoning Failures

Freshness / Staleness