Aryan V S

AI World Models (Keyon Vafa)

functionalities we may want from an AI system:

all of these can be performed by a model that has learned the "correct" world model.

What does it mean to have a world model?

one of the early tests was "Othello" (similar to Go).

- - - - - -  | - - - - - - | - - - - - -
- - 1 0 - -  | - - 1 0 - - | - 0 0 0 - -
- - 0 1 - -  | - 1 1 1 - - | - 1 1 1 - -
- - - - - -  | - - - - - - | - - - - - -

question: world model recovery possible? can transformer uncover the implicit rules and understanding of the othello board?

two kinds of world models:

testbed: manhattan road taxi dataset

transformer trained on sequences of taxi rides (pick up, drop off, time, directions):

7283 1932 SW SW SW NE SE N N ... end
2919 4885 SW SW NE NE N S ... end

Training objective: predict the next token of each sequence (like language model training). evaluate model's ability to generate new rides:

510 3982 <generate> end

The model looked good

has the model discovered the world model for manhattan?

taxi traversals obey a deterministic finite automaton (DFA)

definition: a generative model recovers a DFA if every sequence in generates is valid in the DFA (and vice-versa)

result: if a model always predicts legal next-tokens, it recovered the DFA

suggests a test: measure how often a model's predicted tokens are valid.

single next-tokens aren't enough to differentiate states.

new metrics motivated by going beyond next-token prediction:

three kinds of training data:

this results in all models have next-token accuracy (>99.9%). but their compression/distinction precision and recall is ~0. A true model should have ~1.

why should we care about world models?

attempt to visualize the implicit world model (map of Manhattan):

while these definitions and tests are specific to today's generative models, we've been here before:

but what if the model gets perfect predictions? is that always good enough?

Foundation model -> adaptation -> tasks

something that provides a good enough base structure to solve new tasks

tasks: question answering, sentiment analysis, information extraction, image captioning, object recognition, instruction following, etc.

no free lunch theorem for learning algorithms: every foundation model has inductive bias toward some set of functions (todo read: Wolpert and Macready, 1997)

world model: restriction over functions described by a state-space

goal: test if a foundation model's inductive bias is towards a given world model

inductive bias probe: test how a foundation model behaves when it is adapted to small amounts of data

example: lattice (1d state tracking). good inductive bias for small states but worsen quickly.

example: foundation model of planetary orbits:

so, what are inductive biases toward?

related ideas:

so far, we've taken a functional approach: evaluate models by their functional performance (todo read: Toshniwal et al., 2021; Patel and Pavlick, 2022; Treutlein et al., 2024; Yan et al., 2024)

Mechanistic approach: evaluate a model's inner workings.

Mechanistic Interpretability: Tools for understanding the internal mechanisms of neural networks

Comprehensive mechanistic understanding would make it easy to evaluate if models understand the world. How feasible is comprehensive understanding?

todo read: "The Dark Matter of Neural Networks" by Chris Olah.

"If you're aiming to explain 99.9% of a model's performance, there's probably going to be a long tail of random crap you need to care about" - Neel Nanda (Google DeepMind)

todo read: "Emergent world representations: exploring a sequence model trained on a synthetic task" - Kenneth Li et al. (Harvard) - uncovers evidence of an emergent nonlinear internal representation of the board state

todo read: "Actually, Othello-GPT has a linear emergent world representation" - Neel Nanda

todo read: "OthelloGPT learned a bag of heuristics" - jylin04, JackS, Adam Karvonen, Can (AI Alignment Forum)

related idea: use of world models in RL: predictive models of an environment's dynamics (todo read: Ha and Schmidhuber, 2018; Hafner, 2019; Guan, 2023; Genie 3 team, 2025; and many more)

So, we've seen:

Links to papers mentioned (by Keyon in youtube comment):