Oasis: an interactive, explorable world model

Read the technical report

Play the demo

We're excited to release Oasis, the first playable AI model that generates open-world games. Unlike many AI video models, which generate video from text, Oasis generates video frame-by-frame from keyboard and mouse inputs. Oasis is the first model in a research partnership between Etched and Decart, a new AI lab focused on building new generative experiences.

We designed Oasis’ architecture to be highly optimized on Sohu, our upcoming Transformer ASIC. Oasis features a Diffusion Transformer backbone, a new Transformer-based autoencoder, and more. You can learn more about the model architecture by checking out our technical report, model weights, code, and our playable demo.‍

Today, you can play Oasis model on H100s in 360p resolution. On Sohu, we can serve next-gen 100B+ parameter models in 4K video, scaling to >10x more users than H100s.

The future of the internet is interactive video

Within a decade, we believe a majority of internet content will be AI-generated. Currently, more than 70% of internet traffic is video, from social media, to video calls, to streaming. Videos are incredibly data-intensive: generating videos requires >10x more FLOPs than text or images. Thus, we believe a majority of AI inference workloads will be video.

As video models start scaling, they are learning to represent entire physical worlds and games, enabling entirely new product categories. Whether gaming, generative content, or education, we believe that large, low-latency, interactive video models will be central to the next wave of AI products.

Today, interactive video models are too slow and expensive to run in production. With specialized chips like Sohu, we can run video models in high-definition, at playable frame-rates, and to many simultaneous users -- all requirements to unlock these new use cases at scale.

‍

Building a new interactive architecture

We ran hundreds of architectural and data experiments to identify the best architecture for fast autoregressive interactive video generation. We chose a Transformer-based architecture (surprise), featuring a Transformer-based variational autoencoder, an accelerated axial, causal spatiotemporal attention mechanism, and new strategies to overcome long-sequence model divergence.

Unlike OpenAI’s Sora, which generates video from text in chunks of 60 seconds, Oasis generates one frame at a time. This makes Oasis highly steerable, letting users control generations with their input.

Oasis demonstrates physical understanding, allowing players to break blocks, build structures, and explore the game. As we scale future generations, we’re excited to explore multi-minute context lengths, deeper world models, and eventually, transition beyond gaming into full interactive multimodal video generation.

The demo contains many more little innovations, such as the use of dynamic noise at inference time to increase stability, and optimized inference kernels. If you’re interested, learn more on our technical report or check out the weights on HuggingFace.

Sohu makes realtime AI video work at scale

While today’s text-to-video models generate great videos, they are extremely slow. The best models average less than one frame per second and can cost up to $1 per minute for each user.

Video models run so poorly on GPUs because they are very compute-intensive. Each frame contains hundreds or thousands of tokens, which must be processed in parallel many times to fully de-noise the frame. This is exactly the problem Sohu was designed to solve.

Sohu can parallelize large models with large batch sizes very efficiently, supporting large models in 4K resolution:

Sohu can also serve an order of magnitude more concurrent users, enabling generative video models to be served globally:

Building models and hardware together

Soon, AI models and products will be co-designed with custom chips. While we're early in our development of Sohu, we're excited to explore new research directions on products that become far faster, cheaper, and more feasible on Sohu. We’re particularly interested in real-time video, voice, inference-time reasoning, agents, search, and more.

This is our first public research partnership in this direction, and we’d like to thank Decart for their collaboration on the project. Their inference engine was crucial to making Oasis real-time on GPUs, which you can try for yourself on our live demo.

If you’re interested in early access to Sohu, compute grants for inference and Sohu-related research, or research partnerships, please fill out the form here.