Disclosure: I work on knowledge graph reasoning and complex query answering (Query2Particles, SQE, NeurIPS 2023, Neural Graph Databases). The JEPA ↔ query encoding connection in Section 5 reflects my own interpretation.
If you follow AI research, you’ve probably heard of Yann LeCun’s vision for the future of AI: World Models and specifically JEPA (Joint Embedding Predictive Architecture).
But what exactly is JEPA, and why is it such a massive breakthrough for AI planning? Let’s break it down without getting lost in the dense academic jargon.
The Problem with Pixels: Why We Need Embeddings
To plan, an AI needs a “world model”—an internal understanding of how the environment works. If I push a glass off a table, it falls. If I turn a steering wheel left, the car goes left.
Historically, AI world models were generative. If an AI was driving a car, it tried to predict the exact pixels of the next video frame. The problem? The real world is incredibly messy. Imagine trying to predict the exact movement of every leaf on a tree blowing in the wind. It’s computationally exhausting, and if the AI guesses the leaves wrong, those errors compound until the AI’s prediction of the future completely falls apart.
The JEPA Solution: JEPA doesn’t predict pixels; it predicts embeddings. It takes the current state of the world and compresses it into an abstract mathematical representation (an embedding). It then predicts the embedding of the future. By operating in this “latent space,” JEPA learns to ignore irrelevant details (the leaves in the wind) and focuses only on what matters (the road, the other cars). This makes it fast, efficient, and immune to compounding hallucinations.
What is “Energy” in an Energy-Based Model?
JEPA is an Energy-Based Model (EBM). In physics, systems naturally settle into the lowest possible energy state, like a ball rolling to the bottom of a valley. In JEPA, Energy is simply a distance metric in the embedding space.
- Low Energy = Things make sense together (the prediction matches reality or the goal).
- High Energy = Things don’t make sense together (the prediction is wrong or far from the goal).
The system is always trying to minimize this energy, but it does so differently depending on what phase it is in:
- During Training (Learning): Energy is the distance between the AI’s predicted future embedding and the actual future embedding. The AI updates its brain (neural network weights) to make this distance zero.
- During Planning (Acting): Energy is the distance between the AI’s predicted future embedding and its desired goal embedding.
The Magic Trick: Planning via Gradient Descent
Here is where JEPA does something truly mind-blowing. In traditional AI (like ChatGPT or standard Reinforcement Learning), gradient descent is only used during training. When the AI is actually acting, it just does a single forward pass and spits out an answer.
In JEPA, planning is treated as Inference by Optimization. The AI uses gradient descent while it is acting.
Imagine playing mini-golf. A standard AI just looks at the ball and swings. A JEPA agent does a mental simulation first:
- It imagines swinging with a certain force and angle.
- Its internal world model predicts the future embedding (where the ball ends up).
- The “Cost Module” calculates the Energy (e.g., “You missed the hole by 2 feet”).
- Because the entire mental simulation is differentiable, the AI uses backpropagation on its actions.
The math tells it exactly how to adjust:
$$ A_{\text{new}} = A_{\text{old}} - \text{learning rate} \times \text{gradient} $$It tweaks its imagined swing, runs the simulation again, and repeats this loop until the Energy (distance to the hole) drops to zero. Then, it takes the actual swing in the real world.
This “inference by optimization” pattern isn’t unique to vision and robotics. As I’ll discuss in Section 5, exactly the same mechanism—gradient descent at inference time over a differentiable energy landscape—appears in complex query answering on knowledge graphs, a problem I’ve worked on for several years.
The Catch: Continuous vs. Discrete Actions
If you’re thinking, “Wait, gradient descent requires smooth, continuous math—how does this work for discrete choices like moving a chess piece?” …you are exactly right!
Pure gradient-based planning is a superpower designed specifically for continuous action spaces (like steering angles, robotic joint torques, or applying physical force). You can smoothly adjust a steering wheel from \(15.5^\circ\) to \(15.3^\circ\).
You cannot take a smooth mathematical step between “Jump” and “Crouch.” So, how does JEPA handle discrete environments?
- Tree Search: Instead of calculus, it uses search algorithms (like Monte Carlo Tree Search). It imagines branching futures (Action A, B, or C), predicts the embeddings for all of them, and picks the branch with the lowest Energy.
- Continuous Relaxation: Engineers sometimes trick the system by turning discrete actions into continuous probabilities (e.g., 80% Jump, 20% Crouch). The AI uses gradient descent to optimize those probabilities, and then snaps to the highest one when it’s time to act.
JEPA on Graphs: The Connection to Complex Query Answering
I’ve spent several years working on complex query answering (CQA) over knowledge graphs—predicting answers to first-order logic queries on incomplete graphs. Looking at JEPA through this lens, I was struck by how deeply the two paradigms share the same bones.
Query Encoding is a Joint Embedding Predictive Architecture
In query encoding (QE), we encode a structured logical query into an embedding \(e_q\), and every entity in the knowledge graph has its own embedding \(e_v\). We then rank candidate answers by computing:
$$ \text{score}(q, v) = \langle e_q, e_v \rangle $$Rewrite this as energy:
$$ E(q, v) = -\langle e_q, e_v \rangle $$Low energy means the entity satisfies the query; high energy means it doesn’t. Two heterogeneous inputs—a logical structure and a graph entity—are mapped into a shared latent space, and the “decoder” is simply a comparison, never a reconstruction. This is precisely the Joint Embedding architecture: predict where in embedding space the answer should land, not what the answer “looks like” in the original graph.
The parallel goes deeper than a surface analogy:
| JEPA (Vision/Planning) | Query Encoding (KG Reasoning) |
|---|---|
| Observation → Encoder → embedding | Anchor entities + structure → Query Encoder → \(e_q\) |
| Predictor maps context embedding to target embedding | Query encoder maps logical pattern to answer-set embedding |
| Energy = distance between predicted and actual embeddings | Energy = negative similarity between \(e_q\) and \(e_v\) |
| No pixel reconstruction | No graph/subgraph reconstruction |
| Self-supervised from unlabeled video/images | Self-supervised from graph structure (no external labels) |
| EMA target encoder prevents collapse | Contrastive softmax over all entities prevents collapse |
Planning = Inference-Time Optimization = CQD
The deepest connection is to planning via gradient descent. In JEPA, the agent imagines actions, predicts future embeddings, computes energy against its goal, and backpropagates through its actions to find the optimal plan.
Complex Query Decomposition (CQD) by Arakelyan et al. does exactly the same thing for logical queries. Given a complex query with existentially quantified variables, CQD:
- Initializes continuous embeddings for the unknown variables.
- Uses a pre-trained link predictor to score each atomic sub-query (the “world model”).
- Aggregates scores via t-norms (the “cost module”).
- Runs gradient descent on the variable embeddings to minimize the total energy.
- Snaps to the nearest discrete entity.
The structural isomorphism is almost perfect:
$$ \underbrace{A_{\text{new}}}_{\text{JEPA: action}} = \underbrace{A_{\text{old}}}_{\text{current plan}} - \eta \cdot \nabla_A \underbrace{E(A)}_{\text{energy}} $$ $$ \underbrace{z_{\text{new}}}_{\text{CQD: variable}} = \underbrace{z_{\text{old}}}_{\text{current assignment}} - \eta \cdot \nabla_z \underbrace{E(z)}_{\text{query energy}} $$JEPA optimizes over actions to reach a goal state; CQD optimizes over variable assignments to satisfy a logical query. Both treat inference as optimization in a differentiable latent space, using a frozen (or pre-trained) world model.
The Discrete-Continuous Gap Shows Up Here Too
Knowledge graphs are discrete—entities are atoms, not vectors. This creates the same challenge as discrete action spaces in JEPA planning. The solutions are also parallel:
- Continuous relaxation (CQD-CO): Represent variables as continuous embeddings, optimize with gradients, snap to the nearest entity at the end. Same trick as softening discrete actions into probability distributions.
- Combinatorial search (CQD-Beam): Enumerate candidate substitutions via beam search. Same principle as tree search over discrete action branches.
- Neural encoding (e.g., our Query2Particles and SQE): Train an encoder to directly map the query structure to the answer embedding in one forward pass—amortizing the optimization. This is analogous to training a policy network that amortizes JEPA’s planning loop into a single inference step.
In fact, the progression from CQD → neural QE mirrors the progression from pure planning → amortized policy learning in the world-model literature. CQD does expensive optimization at inference time; SQE trains a sequence encoder that produces the answer embedding directly, trading inference-time compute for training-time learning—exactly the same tradeoff between model-predictive control and a learned policy.
Self-Supervised Learning on Knowledge Graphs
There’s another angle worth emphasizing: query encoding is self-supervised learning on knowledge graphs. The training signal comes entirely from the graph structure itself—queries are sampled by walking the graph backwards from a random node (no human annotation, no external labels). The “pretext task” is: given a logical pattern extracted from the graph, recover which entities satisfy it. This is structurally identical to how JEPA derives its training signal from unlabeled images or video through masking and prediction.
In our eventuality knowledge graph work (NeurIPS 2023), we extended this to queries with implicit logical constraints over eventualities—pushing the “world model” to reason about events, causality, and temporal relations, all within the same embed-and-compare framework.
Looking Forward: Neural Graph Databases as World Models
In our recent work on neural graph databases, we push this connection further—building systems that combine the structured query interface of a graph database with the inferential power of neural link predictors. From the JEPA perspective, a neural graph database is a world model over relational knowledge, with the query engine serving as the planning module.
The top ten challenges we identified for agentic neural graph databases—including compositional generalization, multi-hop reasoning under uncertainty, and efficient inference-time search—map directly onto open problems in JEPA-style planning. Can we build systems that seamlessly plan over both perceptual (video, image) and symbolic (entity, relation) latent spaces using a unified energy-based framework? I believe the answer is yes, and the architectural convergence between JEPA and neural query answering suggests we’re already partway there.
The Takeaway
JEPA turns planning from a messy, pixel-perfect guessing game into a streamlined, high-level logical deduction process. By combining abstract embeddings with energy minimization, it allows AI to think ahead, simulate outcomes, and course-correct its plans—just like we do.
Key Papers for Further Reading
If you want to dive deeper into the original research behind these concepts, here are the foundational papers:
- A Path Towards Autonomous Machine Intelligence
Yann LeCun (2022)
Published in: OpenReview
This is the foundational position paper where LeCun outlines the entire JEPA architecture, world models, and planning via energy minimization. - Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas (2023)
Published in: CVPR (Computer Vision and Pattern Recognition)
The first major paper demonstrating I-JEPA working successfully on static images. - V-JEPA: Latent Video Prediction for Visual Representation Learning
Adrien Bardes, Jean Ponce, Yann LeCun (2024)
Published in: arXiv (arXiv:2404.08471)
The extension of JEPA into video, which is crucial for building actual world models that understand time and physics. - A Tutorial on Energy-Based Learning
Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, Fu Jie Huang (2006)
Published in: Predicting Structured Data (MIT Press)
A classic paper that explains the math and intuition behind Energy-Based Models long before JEPA was formalized. - Complex Query Answering with Neural Link Predictors
Erik Arakelyan, Daniel Daza, Pasquale Minervini, Michael Cochez (2021)
Published in: ICLR 2021 (Oral)
Demonstrates inference-time gradient optimization for answering logical queries on knowledge graphs—structurally isomorphic to JEPA’s planning loop. - Query2Particles: Knowledge Graph Reasoning with Particle Embeddings
Jiaxin Bai, Zifan Wang, Hongming Zhang, Yangqiu Song (2022)
Published in: Findings of NAACL 2022
Represents query answers as particle distributions in embedding space—a multi-modal energy landscape for query answering. - Sequential Query Encoding for Complex Query Answering on Knowledge Graphs
Jiaxin Bai, Tianshi Zheng, Yangqiu Song (2023)
Published in: Transactions on Machine Learning Research (TMLR)
Shows that a single sequence encoder can amortize the entire query-answering optimization into one forward pass—the “learned policy” counterpart to CQD’s planning approach. - Top Ten Challenges Towards Agentic Neural Graph Databases
Jiaxin Bai et al. (2025)
Published in: arXiv
Outlines the research agenda for neural graph databases as world models over relational knowledge.
Meanwhile, there is a minimal JEPA implementation at https://github.com/keon/jepa. I will try it later to see whether I can do something more concrete to brigding them together.