This blog introduces our recent work, PatchWorld. At first glance, it may look like a project about world models, agents, and program synthesis, which seems somewhat different from my previous work on knowledge graph reasoning, knowledge structuring, and GraphRAG. But for me, it is actually a very natural next question along the same research line.
For a long time, I have been studying structured knowledge. In the beginning, I focused more on reasoning over knowledge graphs: complex query answering, logical reasoning, inductive and abductive reasoning, and related problems. After working in this area for a while, I increasingly felt that the bottleneck was often not only “how to reason over a graph”, but also “where the structured data that can be reasoned over comes from”. If the graph itself is sparse and hard to scale, then even a sophisticated reasoning model will eventually be constrained by this data boundary.
That is why I later moved toward data structuring, including works such as AutoSchemaKG and IntentionKG. These projects ask how we can automatically extract and organize structures from raw text, web-scale corpora, and user behavior: which concepts should be preserved, which relations are worth modeling, how schemas should grow dynamically, and how the constructed graph can truly serve downstream tasks, such as in AutoGraph-R1. The question I care about gradually became: a structured knowledge base should not only “look correct”; it should also be genuinely useful for RAG, question answering, and reasoning.
But in the past few years, with the emergence of GraphRAG, agent memory, workflow memory, and many engineering-oriented memory frameworks, I started to repeatedly ask myself a more fundamental question: what is the more general form behind all these systems? Are we building knowledge graphs? Are we building memory? Or are we, in fact, trying to construct some more abstract and executable “description of the world”?
Around 2023, before working on AutoSchemaKG, I already had this thought: knowledge structuring may be viewed as a special form of code generation.
By “code generation”, I do not narrowly mean generating Python or Java. I mean transforming unstructured data into some intermediate representation that can be executed, queried, composed, and verified. In traditional event extraction, we use parsers to parse event structures from natural language; in information extraction, we extract named entities and relations; in knowledge graph construction, we output triples; in schema induction, we output concepts, types, and relation patterns. Formally, all of these look like generating a structured program. More essentially, they are doing abstraction and conceptualization: using a simpler, more compressed, and more operational structure to characterize, represent, and fit the original data as well as possible.
From this perspective, a knowledge graph is not merely a database, nor just a collection of facts. It is closer to a structured model of the world: what objects exist, what attributes they have, how they interact with each other, which relations can be composed, and which rules can be executed.
Following this line of thought, a natural question is: if knowledge structuring is fundamentally a form of abstraction and programming, can it be applied to the more general problem of world modeling?
Today, world models have many schools of thought and many definitions. Some works care about video prediction, some about latent dynamics, some about embodied control, and some, such as JEPA, emphasize energy minimization in latent space. But as the name suggests, they all try to answer one question: how do we model the world, and how do actions change the world?
Most modern world models represent the world with vectors, latent spaces, or neural states, learn state transitions with gradient-based algorithms, and sometimes even use gradients or energy minimization for planning. This line of work is of course very powerful. But I have always been curious: if we do not start from neural latent space for a moment, and instead look back at symbolic methods, structured representations, and program synthesis, can we also build a kind of world model? It may not be as flexible as a neural model, but could it be more readable, more inspectable, and easier to locally repair?
This is the starting point of PatchWorld. I see it as pushing the “structured knowledge” line toward a more general object: structured world dynamics.
Text agents provide a very suitable entry point. They live in partially observable environments: the simulator maintains a hidden state, and after each action the agent only sees the rendered text. In AlfWorld, you cannot see what is inside a closed drawer; in Wordle, the target word is not revealed during interaction; in WebShop, backend relevance scores are also hidden from the agent.
In this setting, world modeling is no longer just about predicting the next sentence. It requires maintaining an implicit belief, meaning a data structure updated from history, understanding how actions change the environment, and supporting later simulation and planning.
The paper and code are now available:
- Paper: PatchWorld: Gradient-Free Optimization of Executable World Models
- Code: HKBU-KnowComp/PatchWorld
PatchWorld asks a direct question: if I only give you a set of offline trajectories, can you write a Python program that behaves like a small simulator, maintains its own belief, predicts what will happen after an action, and can be used for planning?
Why This Problem Is Not Simple
On the surface, offline trajectories are just ordinary sequences: what the current observation is, what action the agent takes, what the next observation is, what the reward is, and whether the task has ended.
The real difficulty is that the same set of trajectories can be explained by many completely different programs.
The most extreme approach is, of course, a lookup table. We can memorize every observation-action-next observation transition and perform well on the training set. But such a “model” has not really understood the environment; it is simply memorizing answers. Once it encounters a new state, a new combination, or a new action sequence, it can easily fail.
What we really want is something else: a more compact program. It should maintain an implicit state from observations, know which information has already been discovered and which remains uncertain; it should know which variables an action will likely change; and it should be able to render its internal state back into a text observation.
This returns to the “structuring” problem I mentioned earlier. We do not want to copy all data as-is. We want to find a more abstract and more compressed structure that still has predictive power. This time, however, the structure is no longer triples or a schema, but executable code.
Text environments make the problem even more interesting because they are essentially POMDPs (partially observable Markov decision processes). Compared with many game environments, text environments can have an extremely small visible surface. Therefore, if a model only predicts the next sentence from the most recent history, it may imitate the surface text reasonably well, but it may not maintain a belief that is useful for multi-step simulation and action comparison. For planning, the key is not only “does the next observation look similar”, but also “can the model distinguish the consequences of different actions”.
Our Approach: Let the LLM Write Programs, Then Repair Them With Counterexamples
The idea behind PatchWorld is simple: if what we want is an executable structure, then let the model directly write an executable world model.
Concretely, we ask an LLM to generate a Python program from the environment description and trajectories. This program implements several core functions: how to initialize belief, how to correct belief using real observations, how to predict belief changes under an action, and how to render the belief into the next observation.
|
|
Then we do something very important: replay.
We run the program back on the trajectories. If it predicts incorrectly, we do not merely say “the model is wrong”; we turn the error into a concrete counterexample: under what belief, after what action, what the program predicted, and what the true observation was. This is similar to counterexample-guided repair in program synthesis.
Next, the LLM does not rewrite the entire model from scratch. It proposes local patches. A patch is accepted only if it truly improves replay performance on the validation set. In other words, the “patch” in PatchWorld is not just a metaphor: we are literally patching the world model.
This is quite different from training a neural network. We do not apply gradient descent to the world model itself, nor do we turn it into an unreadable hidden-state predictor. Improvement comes from discrete program search and local repair. The final result is a Python program that can be opened and inspected: where belief is maintained, where state is updated, where special actions are handled, and where text is rendered.
Two Variants, One Pareto Line
One interesting finding in this work is that making the observation prediction more similar does not necessarily make planning better.
At first, this seems counterintuitive. We often assume that a more accurate world model should lead to better planning. But in text environments, many textual details are not that important for action selection. The phrasing of a room description, the order of a list, or certain templated sentences may affect text matching scores, but may not affect “what should I do next”.
Conversely, a model may fail to reproduce every word perfectly, but still correctly distinguish the consequences of key actions. For planning, that can be more important.
So we build two versions, standing on different sides of this tradeoff:
| Variant | Idea | Strength |
|---|---|---|
| PatchWorld-Simple | Pure symbolic belief and dynamics | Best planning utility among code-based methods |
| PatchWorld-Residual | Adds a constrained residual-memory bias to capture exact textual signatures | Best observation reconstruction among code-based methods |
PatchWorld-Simple is closer to my intuitive picture of a symbolic world model: explain the environment as much as possible through belief and transition rules. PatchWorld-Residual acknowledges that some textual details are hard to generate fully from compact rules, so it allows a constrained residual memory to remember some high-confidence surface patterns.
The results show that the Residual version is indeed better at reconstructing text, while the Simple version achieves better planning performance among code-based world models. This makes me think that world models should not be evaluated by a single metric. Observation fidelity and planning utility are likely a Pareto frontier, not a monotonic relationship.
Results on Seven AgentGym Environments
We evaluate on seven text-agent environments from AgentGym: maze, BabyAI, TextCraft, AlfWorld, SciWorld, Wordle, and WebShop. They cover several different regimes: some environments have relatively deterministic structure, where hidden state can gradually be recovered through exploration; some have intrinsic uncertainty; and in others, the dynamics are learnable but the rendered text is very complex.
Across these environments, PatchWorld-Simple achieves the best code-based planning score. In live one-step lookahead planning, it reaches 76.4% macro success. More importantly, the world-model prediction module makes no LLM calls at inference time. Once the program is induced, it is just an ordinary Python world model.
We also compare against methods such as LLM-Direct, Word2World, PoE-World, and WorldCoder. The repository includes scripts for RQ1 (one-step fidelity), RQ2 (rollout robustness), and RQ3 (live agent planning), as well as trajectory data on Hugging Face and baseline launchers for reproduction.
Why I Think This Matters
Neural world models are of course very powerful. They learn dynamics in latent space; they are continuous, flexible, and scalable. But I think symbolic and executable world models have their own unique value, especially in agent settings.
When an agent behaves incorrectly, we may not only want to know that the loss became larger. We may want to know: what did it believe existed in the world? Why did it think this action would lead to that consequence? Which rule caused the wrong prediction? If there is a concrete Python program, we can at least open it, step through it, change one rule, and write tests.
This is what I find most interesting about PatchWorld: the LLM is not the final world model. It is more like a symbolic optimizer. It helps us search for programs, propose patches, and repair mistakes; but the final system used for prediction and planning is executable, inspectable, and locally editable.
In this sense, PatchWorld is not separate from my previous work on knowledge graphs, schema induction, and GraphRAG. They are all asking the same question: how can we abstract a structure from raw data, so that this structure can be queried, reasoned over, verified, and eventually used for downstream tasks?
The difference is that in PatchWorld, the structure is no longer only static knowledge, but a dynamic world model; no longer only triples and schemas, but executable code that describes how the environment changes.
Code
|
|
For the full RQ1/RQ2/RQ3 runbooks, AgentGym server setup for live planning, and comparative baselines, see EXPERIMENTS.md in the repository.
Citation
If you find PatchWorld useful, please cite:
|
|
Read the paper: https://arxiv.org/pdf/2605.30880
Browse the code: https://github.com/HKBU-KnowComp/PatchWorld
中文版
这篇博客想介绍我们最近的工作 补丁世界模型(PatchWorld)。乍一看,它似乎是一个关于 world model、agent 和 program synthesis 的工作,和我之前做的知识图谱推理、知识结构化、GraphRAG 方向不太一样。但对我自己来说,它其实是同一条研究线往前走之后很自然会遇到的问题。
过去很长一段时间,我一直在研究结构化知识。一开始更多是在知识图谱上做推理:复杂查询回答、逻辑推理、归纳和溯因推理等等。做了一段时间之后,我越来越强烈地感觉到,很多时候问题不只是“怎么在图上推理”,而是“可以被推理的结构化数据到底从哪里来”。如果图谱本身很稀疏、数据很难规模化,那么后面的推理模型再复杂,也会被这个数据边界卡住。
所以后来我开始做数据结构化,比如 AutoSchemaKG、IntentionKG 这些工作。它们关心的是如何从原始文本、网页语料、用户行为里自动抽取和组织结构:哪些概念应该被保留下来,哪些关系值得建模,schema 应该如何动态生长,构建出来的图谱又如何真的服务下游任务, 比如AutoGraph-R1。我关心的问题进一步变成:一个结构化知识库不仅要“看起来对”,还要在 RAG、问答和推理任务里真的有用。
但最近几年,GraphRAG、agent memory、workflow memory、各种偏工程化的 memory 框架越来越多。我也开始反复想一个更本质的问题:这些东西背后更一般的形式到底是什么?我们是在做知识图谱吗?是在做 memory 吗?还是说,我们其实一直在尝试构建某种更抽象的、可执行的“世界描述”?
大概从 2023 年前后,在做 AutoSchemaKG 之前,我就有一个想法:知识结构化或许可以看作一种特殊的代码生成。
这里的“代码生成”不是狭义地生成 Python 或 Java,而是说,把非结构化的数据转换成某种可以被执行、被查询、被组合、被验证的中间表示。传统事件抽取里,我们用 parser 从自然语言里 parse 出事件结构;信息抽取里,我们抽命名实体和关系;知识图谱构建里,我们输出三元组;schema induction 里,我们输出概念、类型和关系模式。形式上看,它们都像是在生成一种结构化程序。更本质地说,它们做的是抽象化和概念化:用更简单、更压缩、更可操作的结构,尽可能好地刻画、表示和拟合原始数据。
如果从这个角度看,知识图谱就不只是一个数据库,也不只是一个 facts collection。它更像是一种对世界的结构化建模:什么对象存在,它们有什么属性,彼此之间如何作用,哪些关系可以被组合,哪些规则可以被执行。
沿着这个思路继续往前走,一个自然的问题就是:如果知识结构化本质上是一种抽象化和程序化,那么它能否被用到更一般的世界建模里?
现在 world model 这个方向有很多流派,也有很多不同定义。有的工作关心视频预测,有的关心 latent dynamics,有的关心 embodied control,有的像 JEPA 一样强调在隐空间里做能量最小化。但顾名思义,它们都在试图回答一个问题:如何建模这个世界,以及行动会如何改变世界。
绝大多数现代世界模型都用向量、隐空间或者神经网络状态来表示世界,用梯度算法学习状态变化,甚至在规划时也用梯度或能量最小化来搜索行为。这条路线当然非常强大。但我一直很好奇:如果我们暂时不从 neural latent space 出发,而是回头看看符号主义、结构化表示、程序合成这些方法,能不能也做出一种世界模型?它也许没有神经模型那么柔性,但它是不是会更可读、更可检查、更容易局部修补?
这就是 补丁世界模型(PatchWorld) 的出发点。我认为这是把“结构化知识”这条线推进到一个更一般的对象:结构化世界动态。
文本智能体给了我们一个很合适的切入点。它们生活在部分可观测的环境里:模拟器维持着一个隐藏状态,智能体每次行动后只能看到渲染出来的文字。AlfWorld 里抽屉里有什么看不见;Wordle 里目标词不会在互动中揭晓;WebShop 的后台相关性分数也对智能体不可见。
这时,世界建模就不再只是预测下一句话,而是要维护一个隐式信念 (belief,具体而言就是一个根据历史信息维护的数据结构),理解行动会如何改变环境,并支持之后的仿真和规划。
这文章的论文和代码已经开源:
补丁世界模型想回答的问题其实很直接:如果我只给你一些离线轨迹,你能不能写出一个 Python 程序,像一个小型模拟器一样,维护自己的 belief,预测行动之后会发生什么,并且可以被拿来做规划?
为什么这个问题并不简单
表面上看,离线轨迹只是一些很普通的序列:当前观测是什么,智能体做了什么动作,下一条观测是什么,奖励是多少,任务有没有结束。
但真正麻烦的地方在于:同一批轨迹可以被很多完全不同的程序解释。
最极端的做法当然是查表。把每个 observation-action-next observation 都记下来,训练集上可以做得很好。但这样的“模型”并没有真的理解环境,它只是在背答案。一旦遇到新的状态、新的组合、新的行动序列,就很容易崩掉。
我们真正想要的是另一种东西:一个更紧凑的程序。它应该能从观测中维护一个隐式状态,知道哪些信息是已经发现的,哪些信息还不确定;它应该知道某个 action 大概会改变哪些变量;它也应该能把内部状态重新渲染成文字观测。
这就回到了我前面说的“结构化”的问题。我们不是要把所有数据原样记下来,而是要找到一个更抽象、更压缩、但仍然有预测力的结构。只不过这次这个结构不再是三元组或 schema,而是一段可以运行的代码。
文本环境还会让这个问题更有意思,因为它们本质上是 POMDP(部分可观测马尔可夫决策过程)。而且相比于很多游戏环境,文本环境的可见范围可能非常非常的小。所以,如果模型只是根据最近几轮 history 去预测下一句话,它也许可以把表面文本模仿得不错,但它不一定真的维护了一个可以用于多步仿真和行动比较的 belief。对于规划来说,关键并不只是“下一条 observation 像不像”,而是“不同 action 会不会被模型区分开”。
我们怎么做:让 LLM 写程序,再用反例修程序
补丁世界模型的做法很朴素:既然我们想要的是一个可执行的结构,那就直接让模型写一个可执行的世界模型。
具体来说,我们让 LLM 先根据环境描述和轨迹,生成一个 Python 程序。这个程序要实现几个核心函数:如何初始化 belief,如何根据真实观测修正 belief,如何根据 action 预测 belief 的变化,以及如何把 belief 渲染成下一条 observation。
|
|
然后我们做一件非常重要的事情:回放。
把这个程序放回轨迹里跑。如果它预测错了,我们不只是说“模型错了”,而是把错误整理成一个具体的反例:在什么 belief 下,执行什么 action,程序预测了什么,真实观测是什么。这有点像程序综合里的 counterexample-guided repair。
接下来,LLM 不再从零开始重写整个模型,而是提出局部补丁。补丁只有在验证集上真的提升回放效果时才会被接受。也就是说,PatchWorld 里的“patch”不是一个比喻,它真的是在不断给世界模型打补丁。
这和训练一个神经网络很不一样。我们没有对世界模型本身做梯度下降,也没有把它变成一个不可读的 hidden state predictor。改进来自离散的程序搜索和局部修复。最后得到的东西是一段可以打开看的 Python:哪里维护 belief,哪里更新状态,哪里处理特殊 action,哪里负责渲染文本,都可以检查。
两个变体,一条帕累托线
做这个工作时,一个很有意思的发现是:把 observation 预测得更像,不一定会让 planning 更好。
这件事一开始看起来有点反直觉。我们通常会想,世界模型越准确,规划应该越好。但在文本环境里,很多文字细节其实对行动选择没那么重要。比如房间描述里的措辞、列表顺序、某些模板化句子,它们会影响文本匹配分数,但未必影响“我下一步应该做什么”。
反过来,一个模型可能没有完美复现每一个词,但它正确地区分了关键 action 的后果。对于 planning 来说,这反而更重要。
所以我们做了两个版本,它们分别站在这个权衡的两边:
| 变体 | 思路 | 优势 |
|---|---|---|
| PatchWorld-Simple | 纯符号信念与动力学 | 基于代码的方法中规划效用最高 |
| PatchWorld-Residual | 增加受约束的残差记忆偏置,捕捉确切文本签名 | 基于代码的方法中观测还原最好 |
PatchWorld-Simple 更像我直觉中想要的符号世界模型:尽量用 belief 和 transition rules 去解释环境。PatchWorld-Residual 则承认有些文本细节很难用紧凑规则完全生成,于是允许一个受约束的 residual memory 去记住一些高置信度的表面模式。
结果显示,Residual 版本确实更擅长还原文本;Simple 版本则在代码世界模型里取得了更好的规划效果。这让我觉得,世界模型不应该只用一个指标来评价。observation fidelity 和 planning utility 很可能是一条帕累托前沿,而不是一个单调关系。
七个 AgentGym 环境上的结果
我们在 AgentGym 的七个文本智能体环境上做了实验,包括 maze、BabyAI、TextCraft、AlfWorld、SciWorld、Wordle、WebShop。它们覆盖了几种不同情况:有些环境结构比较确定,探索之后就能逐渐恢复隐藏状态;有些环境有本质的不确定性;还有一些环境动力学可以学,但渲染文本非常复杂。
在这些环境里,PatchWorld-Simple 取得了最高的基于代码的规划分数。其中,在在线单步前瞻规划里,它达到了 76.4% 的宏观成功率。更重要的是,世界模型预测模块在推理时内部不调用 LLM。也就是说,一旦程序被归纳出来,它就是一个普通的 Python world model。
我们也和 LLM-Direct、Word2World、PoE-World、WorldCoder 等方法做了对比。仓库里放了 RQ1(单步保真度)、RQ2(rollout 鲁棒性)、RQ3(在线智能体规划)的脚本,也包括 Hugging Face 上的轨迹数据和一些 baseline launcher,方便复现。
我为什么觉得这件事重要
神经世界模型当然非常强。它们在隐空间里学习动力学,连续、柔性、可扩展。但我觉得,符号化和可执行的世界模型也有它独特的价值,尤其是在 agent 场景里。
如果一个 agent 的行为出了问题,我们不一定只想知道 loss 变大了。我们可能想知道:它到底以为世界里有什么?它为什么觉得这个 action 会有这样的后果?是哪条规则让它做出了错误预测?如果有一个具体的 Python 程序,我们至少可以打开来看,可以单步执行,可以改一条规则,可以写测试。
这也是我对 PatchWorld 最感兴趣的地方:LLM 在这里不是最终的 world model,而更像是一个符号优化器。它帮助我们搜索程序、提出补丁、修复错误;但最后用于预测和规划的,是一个可执行、可检查、可局部修改的系统。
从这个意义上说,PatchWorld 和我之前做的知识图谱、schema induction、GraphRAG 并不是割裂的。它们都在问同一个问题:如何从原始数据中抽象出一种结构,让这个结构可以被查询、被推理、被验证,并最终服务于任务。
只是在 PatchWorld 里,这个结构不再只是静态知识,而是动态的世界模型;不再只是三元组和 schema,而是一段描述环境变化的可执行代码。
代码
|
|
完整 RQ1/RQ2/RQ3 流程、AgentGym 服务器配置(用于在线规划)及对比基线,见仓库中的 EXPERIMENTS.md。
引用
若觉得补丁世界模型(PatchWorld)对你有帮助,欢迎引用:
|
|
阅读论文: https://arxiv.org/pdf/2605.30880
浏览代码: https://github.com/HKBU-KnowComp/PatchWorld