Highlight Research

AI Cinema: Are we ready for “World + Observer”?

What Xi Wang’s, Hi! PARIS Chair, keynote reveals about the real future of filmmaking with AI

From generative video and diffusion models to “text-to-movie” promises, the last two years have flooded the public conversation with a single impression: cinema is about to be automated.

Yet, if there is one idea Xi Wang’s (Hi! PARIS Chair, Assistant Professor at Ecole Polytechnique, IP Paris) keynote made clear at the Computer Vision Workshop, it is this: cinema is not simply about generating frames. It is about structuring meaning through what happens and how it is observed.

His talk, “The Past, the Present, and the Next: Are We Ready for AI Cinema?”, proposed a deceptively simple formula that reorients the entire debate:

Narrative = World + Observer

This shifts the entire discussion. Cinema is not only a sequence of frames. It is a world unfolding through an observer. It is action perceived through a camera. It is storytelling shaped by movement, framing, lens behavior, rhythm, and editing choices. The observer is not a neutral device. It is what makes cinema cinematic.

If AI cinema is ever to be reliable, deployable, and socially acceptable, it cannot be a black box. It must be controllable, interpretable, and aligned with human intention. Spectacular output alone is not a roadmap.

Why bigger models may not be the answer

Public attention increasingly gravitates toward scale. The implicit belief is that sufficiently large models, trained on sufficiently large datasets, will eventually solve cinema as a “generation problem.”

Wang argued that this assumption misses the core of filmmaking. In cinema, progress is not measured only by realism, detail, or resolution. It is measured by control. A director does not want randomness packaged as creativity. They want the ability to guide intent, maintain continuity, and iterate with confidence. In that sense, the demands of cinema resemble the demands of other high-stakes environments: a model must be dependable, steerable, and consistent.

This is why the keynote positioned AI cinema not as a single monolithic generator but as a cinema machine: a structured set of capabilities that can support direction rather than replace it.

Learning to film: the “past” of AI cinema

Wang first looked back at earlier approaches that learn camera behavior from video data without generating full worlds from scratch. The key idea is not to ask a model to invent cinema, but to learn the cinematic logic already present in film.

This work extracts meaningful features from video, capturing signals from character motion and scene dynamics. These signals can include optical flow and other representations of movement. The extracted features are then used to condition a motion controller that generates camera behavior in a controlled environment.

This is a fundamental distinction. Rather than end-to-end generation from nothing, the system learns from examples and transfers style. What matters is not merely what the system produces, but how it can be controlled and directed.

Keyframes as control: why filmmaking needs constraints

A central theme in Xi Wang’s presentation is keyframe-based control. Instead of relying on free-form generation, keyframes define anchor points and constraints that guide the model’s behavior. The system learns how to transition and interpolate between these states in a way that remains consistent with cinematic objectives.

This is not only a technical design choice. It is a view of creative AI that aligns closely with responsibility. When control is lost, authorship becomes unclear, accountability becomes fuzzy, and creative work becomes harder to trust. When control is explicit, AI becomes a tool that can support human direction, rather than a system that unpredictably produces content.

Xi Wang (École polytechnique, Institut Polytechnique de Paris) at the Computer Vision Workshop

The “present”: toward generative cinema

From this foundation, Wang traced the evolution toward what he described as the present: generative cinema. Here the object of learning becomes broader, covering both motion and camera control, including the stylistic choices that make camera behavior cinematic rather than purely mechanical.

A key step in this transition is the rise of diffusion-based text-to-trajectory generation. He highlighted the importance of moving beyond synthetic training setups toward methods that learn from real film data. This shift matters because cinema is not only motion. It is culture, convention, and visual grammar. Those priors are embedded in film itself.

One of the most meaningful changes presented in his talk is joint generation. Rather than treating camera motion as a secondary layer placed on top of an already generated world, recent work aims to generate camera motion and character motion together. This improves coherence because the camera is not independent from the story. It follows action, anticipates movement, and positions itself to convey narrative meaning.

The camera is not a parameter: it is the observer

Perhaps the most distinctive aspect is the insistence that the observer must become a first-class research component. In a world captivated by video generators, camera behavior is often treated as a detail. Wang treated it as central.

The observer encompasses not only motion but also camera-specific properties and cinematic controls, such as changes in zoom, optical characteristics, and stylistic camera behavior. This view leads to a more precise definition of what “cinematic control” actually means: the ability to direct how a scene is seen, not only what exists in it. The implication is clear. AI cinema should not be reduced to “content generation.” It is the construction of an observer that can express film language.

What is missing: “probably everything”

World models remain slow and computationally heavy. Physical plausibility and causality are still unreliable. Training agent-like systems remains difficult. And perhaps most importantly, capturing nuanced directing cues and implicit cinematic knowledge is not solved.

Xi Wang also questioned whether language is sufficient as the interface for direction. Even if we can instruct systems in text, cinema depends on more than verbal intent. It depends on taste, rhythm, editing, and emotional judgment. These are not optional components. They are the substance of filmmaking.

The “next”: world models and AI agents

The future of AI cinema is at the intersection of two research pillars: world models and AI agents.

World models aim to represent environments with memory, causality, and coherent long-term behavior.  AI agents represent a shift in the observer itself, from programmable or conditionable control systems toward intelligent, more spontaneous decision-making, with planning and execution that can follow high-level instructions, learn through interaction, and align with human objectives.

If narrative is built from a world and an observer, the future of AI cinema is not a single generator. It is the integration of a cinematic world model with a cinematic observer agent.

A more realistic definition of readiness

Are we ready for AI cinema? Short-form content may be approaching feasibility. High-quality long-form cinema is not imminent.

For Hi! PARIS, this is the most valuable takeaway. It allows us to move beyond spectacle and toward what matters: controllability, interpretability, and human agency. AI cinema, as Xi Wang presented it, is not about handing creativity over to machines. It is about building systems that enable new forms of direction, while keeping responsibility where it belongs: with humans.