Concept

What is RAG (Retrieval-Augmented Generation)?

The technique that turned LLMs from confident hallucinators into research-grade tools. Here is the honest explainer.

The short answer. RAG (Retrieval-Augmented Generation) is a pattern where an AI model retrieves relevant documents from a knowledge base before generating a response, then conditions the response on those documents. The retrieval step usually uses vector search over embeddings (Qdrant, Pinecone, Weaviate). RAG mitigates hallucination by grounding the model in real, citable text, and it lets a model "know" information that was not in its training data — your company's docs, recent news, a specific PDF.

Why RAG exists

Large language models have two problems with raw recall: they hallucinate (confidently invent facts) and they have a knowledge cutoff (cannot know what happened after training). RAG solves both by inserting a retrieval step: the model is given the relevant sources before it writes, so it can paraphrase real text rather than invent plausible text.

How it actually works

Documents are chunked and embedded into a vector store (typically 256-1024 tokens per chunk)
A query is embedded and matched against the store via cosine similarity
Top-k matching chunks are retrieved (usually 4-12)
The chunks are prepended to the prompt as context
The model generates a response that cites or paraphrases the retrieved context

Advanced RAG patterns

Hybrid search (vector + keyword) often beats either alone. HyDE (Hypothetical Document Embeddings) generates a fake answer first and embeds that, which can outperform embedding the query directly. Re-ranking uses a smaller model to re-sort the top results. Self-RAG lets the model decide whether to retrieve at all. Luna's Cortex uses several of these.

When RAG is the wrong tool

For creative writing, RAG often produces worse output — the model latches onto retrieved chunks instead of generating freely. For tasks the model already knows (basic coding, common knowledge), retrieval adds latency without value. For very long documents, fine-tuning may beat RAG. Use RAG when the source of truth is documents the model has not seen.

How Luna uses RAG

Luna's Quantum Cortex uses RAG across three layers: your Memory Pod (semantic recall of past conversations), the Heaven knowledge base (HEAVEN_ECO_HUB_KNOWLEDGE, product docs), and the OmegaKnowledgeCore (a Qdrant-backed index over Wikipedia, arXiv, Stack Overflow, ConceptNet, PubMed).

When you ask Luna a research-heavy question, the swarm runs RAG over multiple corpora in parallel and synthesises. When you ask her about your own life, RAG pulls from your Memory Pod. The user experience is "she just knows" — the engineering is retrieval underneath.

See RAG in action with Luna →

What is RAG (Retrieval-Augmented Generation)?

Why RAG exists

How it actually works

Advanced RAG patterns

When RAG is the wrong tool

How Luna uses RAG

Related questions people ask

Do I need RAG to build an AI product?

What is the cheapest vector database?

How is RAG different from fine-tuning?

Does RAG eliminate hallucination?

What is RAG (Retrieval-Augmented Generation)?

Why RAG exists

How it actually works

Advanced RAG patterns

When RAG is the wrong tool

How Luna uses RAG

Related questions people ask

Do I need RAG to build an AI product?

What is the cheapest vector database?

How is RAG different from fine-tuning?

Does RAG eliminate hallucination?

Related answers