The World Model (Gemini 2.5 Pro Translated Version)

A simple explanation of the four panels (presumably depicted above, likely arranged in a grid where these might be the top-left or the entirety discussed):

What V-JPEA (Video Joint Predictive Embedding Architecture, or similar) is essentially doing is encoding video content that represents the real world. In an absolutely ideal scenario, these types of models implement the world through their feature space, and points in the feature space can form a bijection with the real world.
The foundational papers on World Models also essentially deal with encoding, but their encoded objects are worlds constructed/derived from the real world. These worlds have a small capacity but are closed systems. The rightmost column (in the diagram) usually lacks this closed nature (in fact, the top two items in the rightmost column also have corresponding academic papers, e.g., MAE on ImageNet).
Models in the next row down (in the diagram) go a step further, directly expressing information perceptible/cognizable by humans. Sora is a typical example of directly expressing the world.
LLMs are in the middle because the world of language is itself an independent world derived from the real world. An LLM might be a god in this linguistic world, but when mapped back to the real world, so-called “hallucinations” appear.

Enjoy Reading This Article?