sources:
1. OpenAI's blog piece: Video generation models as world simulators
2. DiTs (Diffusion Transformers): Scalable Diffusion Models with Transformers
SORA
Video generation models as world simulators
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
This is so far the most contentious point for SORA, regarding whether it is "learning" physics and generating reliable simulations.
By the underlying ML training mechanism, e.g. gradient descent, we know as a fact, the model is not learning about physical interactions of objects in a logical manner, but it doesn't need to.
The essence of "learning physics" is being able to make predictions of what's going to happen with consistent precision and accuracy. As human, due to our limited capacity for memory and computational power, we resort to logical breakdown and abstraction of the physical world to understand it; while logical learning is perhaps the most intuitive and efficient method of learning physics, the physically sound snippets generated by SORA does show it is not the only way, a data-driven, rather than logical or analytical algorithm driven physics engine is indeed possible.
Now we follow the blog by OpenAI and see (1) how SORA unifies different types of visual data for training; (2) capability and limitation of the model.
Turning visual data into patches
We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.13,14 The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.15,16,17,18 We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.
Video de/compression network(s)
We train a network that reduces the dimensionality of visual data.20 This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subse