Can Test-Time Scaling Improve World Foundation Model?

Abstract

World foundation models (WFMs), which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models typically require substantial computational resources for pretraining and are further constrained by limited data availability during post-training.

To address this, we introduce SWIFT, a test-time scaling framework specifically designed for WFMs. Rather than relying on retraining or enlarging the model, SWIFT focuses on scaling computation during inference. It integrates an extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search.

Empirical results on the COSMOS model demonstrate that test-time scaling is not only feasible but can also follow compute-optimal trends. Our findings reveal that test-time scaling laws hold for WFMs, and that SWIFT offers a practical and scalable pathway to improve inference performance without modifying the model itself.

Key Contributions:

WFM Evaluation Toolkit: A modular and extensible toolkit with metrics for evaluating WFMs across diverse aspects.
SWIFT Framework: The first test-time scaling framework for WFMs, featuring a fast tokenizer, Top-K pruning, and beam search for efficient inference.
Empirical Insights: SWIFT enables smaller models to rival or outperform larger ones under the same compute budget by leveraging test-time scaling.

Evaluation Toolkit

Failure Case Example — **Evaluation Toolkit for WFMs:** We introduce a modular and extensible evaluation toolkit for world foundation models (WFMs), supporting diverse domains. It will be open-sourced to promote reproducibility and further research.

SWIFT

We adopt autonomous driving as our primary testbed to demonstrate test-time scaling—an ideal yet challenging domain with high demands for realism, diversity, and efficiency, aligning with COSMOS’s target use case.

We Use Rule-Based Rewards for Robustness and Extensibility since they consistently outperform preference-based ones.

Test-Time Scaling Exists in WFM—even with a naive best-of-N. Simply sampling more candidates improves quality.

Test-Time Scaling Is Surprisingly Compute-Optimal: a 4B model with test-time scaling can match or exceed a 12B model under the same FLOPs.

Due to costly decoding with diffusion models, SWIFT uses a fast tokenizer decoder to accelerate decision process.

Probability-based top-K pruning introduces controlled exploration, better than both top - 1 sampling and deterministic top-K sorting

Adapt Test-Time Scaling Strategy to WFMs Practically and Efficiently by using fast decoding, probability-based top-K pruning, and beam search.

Human Study: outputs from the smaller model enhanced with test-time scaling are often preferred by human over those from the larger baseline.

Video Demos

Left: COSMOS-4B | Right: COSMOS-4B With SWIFT

Left: COSMOS-5B | Right: COSMOS-5B With SWIFT

The video captures a moment at an intersection in a suburban area. A white Mazda is prominently seen from behind, waiting at a green traffic light. The intersection is relatively empty with a few cars visible in the distance. The surroundings include trees and buildings, suggesting a residential or commercial neighborhood. The sky is clear with some clouds, indicating good weather conditions.

The video depicts a sunny day on a suburban street. Several cars are seen driving on a multi-lane road, with a white sedan in the foreground. The street is lined with trees and landscaped medians featuring pink flowering plants. Residential houses and utility poles are visible on the sides of the road. The sky is clear and blue, indicating good weather.

SWIFT: Can Test-Time Scaling Improve World Foundation Model?

Abstract

Evaluation Toolkit

SWIFT

Video Demos