Science June 16, 2026

World Models Grew Up: AI Stopped Generating Scenes and Started Predicting Actions

NVIDIA's Cosmos 3, DeepMind's Project Genie, Waymo's driving simulator, and World Labs' spatial tools point to the same 2026 shift: models that predict actions a machine can execute, not just scenes a human can watch.

For most of the last three years, the public face of artificial intelligence has been a text box. You type, it talks back. That paradigm produced extraordinary tools, but it also hid a structural limitation that researchers have argued about for years: a system trained to predict the next word is very good at sounding right and surprisingly fragile at being right about the physical world. It can describe a falling glass without reliably predicting where the shards land.

In 2026, the industry's most interesting bet is on a different kind of model entirely - one designed not to generate sentences or even pretty video, but to predict what happens next when something acts on the world. The clearest signal came on June 1, when NVIDIA used its GTC Taipei keynote to launch Cosmos 3, which the company describes as an open "foundation model for Physical AI." The headline feature is not photorealism. It is action.

From Watchable To Actionable

The distinction sounds subtle and turns out to be everything. A video model generates frames that look plausible to a human eye. A world model, in the sense the field now uses the term, tries to learn the underlying dynamics of a scene - how objects move, how forces propagate, how an environment changes in response to an action - so it can predict future states rather than merely render them.

NVIDIA leaned hard into that gap. According to the company and reporting from Axios, Cosmos 3 doesn't just output video; it generates robot action data - joint angles, gripper positions, and movement trajectories - the raw material needed to train a machine to physically do something. Ming-Yu Liu, who leads NVIDIA's Cosmos Lab, framed the difference plainly to Axios: the model is built to capture how machines move, not just how scenes look. NVIDIA describes the architecture as a "mixture-of-transformers" that pairs an autoregressive reasoning component with a diffusion-based generator, and says the model was trained on roughly 20 trillion tokens of multimodal data spanning images, real and synthetic video, audio, text, and recorded action from humans and robots. Those are company-reported figures; independent benchmarks will take time.

The practical pitch is about closing a data gap that has quietly throttled robotics. Real-world training data for rare, dangerous events - a robot collision, an unusual road hazard - is expensive or unsafe to collect. A world model can synthesize those scenarios on demand. NVIDIA claims this can compress certain robot training and evaluation cycles "from months to days." Treat the specific number as a vendor projection, but the direction is the point: simulation that a robot can learn to act from, not just footage a person can watch.

NVIDIA Is Not Alone

What makes this more than a single product launch is that three of the most credible labs in AI are converging on the same idea from different directions.

Google DeepMind's Genie 3, unveiled as a general-purpose world model that generates interactive environments in real time - the company cites around 24 frames per second with consistency holding for a few minutes - moved from a tightly held research preview into wider testing on January 29, when Google opened "Project Genie" to AI Ultra subscribers in the U.S. It remains a staged rollout, not open availability, but the trajectory toward a usable product is unmistakable.

The most concrete proof point may be Waymo. In February, the company introduced the Waymo World Model, built on Genie 3 and adapted for driving, to generate photorealistic simulations complete with multi-sensor outputs like camera and lidar. Engineers can summon long-tail edge cases that are nearly impossible to collect on real roads - Waymo's examples reportedly ranged from tornadoes to an elephant in the road - and can turn ordinary dashcam footage into an interactive scenario with the weather or traffic changed. That is a world model already doing load-bearing work inside a safety-critical product, not a demo reel.

And at the startup frontier, Fei-Fei Li's World Labs has spent the past several months turning "spatial intelligence" from a slogan into an interface. Its Marble system reached general availability late in 2025, letting users generate, edit, and export explorable 3D worlds, and in January the company opened a World API so developers can produce navigable 3D scenes from text, images, panoramas, or video. The framing is consistent with everyone else's: build models that understand space and dynamics, then make that understanding programmable.

Why This Answers A Years-Old Complaint

It helps to remember what world models are reacting against. Critics of pure language models - Yann LeCun chief among them - have long argued that next-token prediction does not force a system to learn a persistent, causal model of reality. An LLM can imitate the form of reasoning without building the grounded internal representation that supports planning, object permanence, or counterfactual "what if I push this" thinking. The benchmarks that expose this are exactly the ones that require state tracking, 3D consistency, and long-horizon planning - the places where text-trained systems produce answers that are locally plausible and globally incoherent.

World models are a direct architectural response to that critique. Instead of asking a network to predict the next word, you ask it to predict the next state of an environment. Do that well, and you get something an agent can plan against and a robot can act on. It is not a replacement for language models so much as the missing half - the part that knows how the world behaves when no one is narrating it.

The Catch Worth Keeping In View

None of this is solved. World models inherit hard problems: they can hallucinate physics just as confidently as a chatbot hallucinates citations, and a simulation that is subtly wrong can teach a robot the wrong lesson at scale. "State-of-the-art physics accuracy" is a claim that deserves outside verification, and the gap between an impressive keynote and a robot reliably stocking a warehouse shelf is still wide. The open-versus-closed split matters too: NVIDIA is positioning Cosmos 3 as open for developers to customize, while DeepMind's most capable world model arrives gated behind a premium subscription. Those are different bets about who gets to build on this layer.

But step back and the shape of 2026 is clearer than it was even six months ago. The frontier is quietly migrating from systems that describe the world to systems that model it well enough to act. If the last era of AI taught machines to talk, this one is teaching them to move - and the difference between those two verbs may turn out to be the whole game.

So the question to sit with isn't whether AI can generate a convincing video of a robot picking up a cup. It's whether the model knows enough about cups, hands, and gravity to make a real robot do it - on the first try, in a kitchen it has never seen. That is the bar world models are now openly aiming at, and 2026 is the year they started clearing the warm-up height.

Sources

NVIDIA newsroom, Cosmos 3 announcement at GTC Taipei, June 1, 2026; Axios reporting on Cosmos 3 and NVIDIA's physical AI push, June 1, 2026.

Google DeepMind, Genie 3 world model research update; Google Blog, Project Genie rollout to Google AI Ultra subscribers in the U.S., January 29, 2026.

Waymo, The Waymo World Model, February 2026; Bloomberg reporting on Waymo's world-model simulation work, February 6, 2026.

World Labs, Marble world model general availability and World API launch materials.

Author article handoff: https://docs.google.com/document/d/1k3G-UhjrhDbvIMWoDh41-uzzthbkMETpPXullSb8pjQ/edit