Retrieval-Augmented Generation (RAG) has transformed how we interact with Large Language Models (LLMs). By allowing models to pull in external information, RAG makes them more factual, up-to-date, and trustworthy. A particularly powerful approach is graph-based RAG, where knowledge is organized into a structured knowledge graph (KG)—a web of interconnected facts—that helps LLMs perform complex reasoning.
But there’s a hidden problem in how we build these graphs, a fundamental disconnect that has held back their true potential.
Today, we’re excited to introduce AutoGraph-R1, a new framework from our research that tackles this problem head-on. Published in our recent paper, “AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction,” our work represents a paradigm shift: from building graphs that are merely “good” on paper to building graphs that are demonstrably “useful” in practice.
The Great Disconnect: Why a “Good” Graph Can Be a Bad Tool
The standard way to create a KG for a RAG system is a two-step process:
- Construction: An LLM reads through documents and extracts facts (like
(Subject, Relation, Object)), which are then assembled into a graph. This graph is judged on “intrinsic” metrics like factual precision and recall. - Application: This static, pre-built graph is handed over to a RAG system to help answer user questions.
The problem? A graph that scores high on intrinsic metrics isn’t always useful for answering real-world questions.
Imagine you ask: “What is the government of the state where the Golden Gate Bridge’s city is?”
A standard, fact-focused graph builder might create a chain of facts like this:
(Golden Gate Bridge, connects, San Francisco)(San Francisco, is the center of, Northern California)(Northern California, is in, California)(California, has government, Republic)
This graph is factually correct, but the reasoning path is long and fragile. A retriever might get lost trying to traverse four hops and fail to find the answer.
What if the graph was built differently?
(Golden Gate Bridge, is located in, California)(California, has government, Republic)
This graph is simpler, more direct, and far more useful. The retriever can now find the answer in just two hops. This is the core idea behind AutoGraph-R1: we teach the graph builder to create these useful, optimized structures.

The Solution: Learning from the Final Exam
So, how do we bridge this gap? The challenge is that graph construction is a discrete process, making it impossible to use standard deep learning methods (like backpropagation) to send a “success signal” from the final question-answering task all the way back to the graph builder.
Our solution is Reinforcement Learning (RL).
Think of it like training a student. You don’t just grade them on how well they memorize flashcards (intrinsic quality). You grade them on how well they perform on the final exam (downstream task).
AutoGraph-R1 sets up a training loop where an LLM-based graph constructor learns through trial and error:
- Construct: The constructor model reads source documents and builds a knowledge graph.
- Test: A frozen, off-the-shelf RAG system immediately uses this graph to try and answer a question.
- Reward: The system gets a “reward” based on how useful the graph was for the task.
- Learn: Using this reward signal, the constructor updates its strategy (its policy) to build a better, more useful graph next time.
This “end-to-end” optimization closes the loop between construction and application, forcing the constructor to learn what a “useful” graph actually looks like.
Designing a “Reward” for Usefulness
A key innovation in AutoGraph-R1 is designing reward functions that capture a graph’s utility. We developed two task-aware rewards tailored to how the graph is used:
- Knowledge-Carrying Reward (RC): For when the graph itself is the knowledge source. The reward is simple: after retrieving a subgraph, can the answer be directly deduced from the facts within it? This teaches the model to create graphs that are informationally complete.
- Knowledge-Indexing Reward (RI): For when the graph acts as an index to find relevant text passages. The reward measures retrieval quality: how many of the correct “gold” passages did the graph help us find? This teaches the model to build a clean, high-fidelity index that connects related documents effectively.
Critically, our experiments showed that these specific, functional rewards are far more stable and effective than using a noisy, high-level signal like the final answer’s F1 score.
The Results: Demonstrably More Useful Graphs
We put AutoGraph-R1 to the test across five challenging question-answering benchmarks, using models from the Qwen and Llama families. The results were clear and consistent.
- Significant Performance Gains: KGs built by AutoGraph-R1 enabled state-of-the-art RAG methods to achieve significant F1 score improvements over graphs built with the same models in a standard, task-agnostic way. For some models and tasks, we saw average F1 gains of over +9 points.
- Better Indexing: When used as a knowledge index, our graphs led to much better retrieval. For example, with the Llama-1B model, passage recall improved by over +15 points, proving our framework creates a more effective “map” to the underlying knowledge.
- Utility and Quality Go Hand-in-Hand: Interestingly, optimizing for downstream utility didn’t come at the cost of factual accuracy. In fact, it improved it! The graphs built by AutoGraph-R1 had higher intrinsic precision and recall, showing that the pressure to be useful also encourages the model to be more accurate.
A Real-World Example
Let’s look at a concrete case from the 2WikiMultihopQA dataset.
Question: Who is the child of the director of the film Los Pagares De Mendieta?
- Standard KG: The baseline model correctly extracted
(Los Pagares de Mendieta, directed by, Leopoldo Torres Ríos)but failed to extract the director’s relationship to his child. The reasoning path was broken, and the system failed. - AutoGraph-R1 KG: Our RL-trained model learned that connecting people across relationships is crucial for multi-hop questions. It successfully extracted both
(Los Pagares de Mendieta, directed by, Leopoldo Torres Ríos)AND(Leopoldo Torres Ríos, father of, Leopoldo Torre Nilsson). The path was complete, and the system answered correctly.
The Big Picture
AutoGraph-R1 is the first framework to use reinforcement learning to directly optimize the KG construction process itself for downstream performance. Our work demonstrates that by closing the loop between how a knowledge base is built and how it’s used, we can create AI systems that are not just more knowledgeable, but fundamentally more capable.
We are moving from an era of building intrinsically “good” graphs to one of building demonstrably “useful” ones—and that makes all the difference.
Want to dive deeper?
- Read the full paper on arXiv: https://arxiv.org/pdf/2510.15339
- We believe in open and reproducible research. We have released our source code https://github.com/HKUST-KnowComp/AutoGraph-R1 and chechpoints https://huggingface.co/collections/gzone0111/autograph-r1.