🏆 World-in-World: Building a Closed-Loop World Interface to Evaluate World Models
👁️ Select by First-Frame Observation
👉 Pick a scenario above to see its Target & Prompt here.
📋 Current Scenario
🏁 Final Result
Mission Success!
🗺️ Bird's Eye View
🎥 Final Video
Processing will begin shortly…
Video World Models Leaderboard
Measuring performance on task success and generation quality
Zero-Shot Video Models
Post-Trained Video Models
Performance Rankings
1
Wan2.1†
Alibaba • Video Generation
Task Success
62.6%
Gen Quality
0.380
2
SVD†
Stability AI • Video Generation
Task Success
61.0%
Gen Quality
0.363
3
Cosmos†
Nvidia • Video Generation
Task Success
60.3%
Gen Quality
0.360
4
Wan2.1
Alibaba • Video Generation
Task Success
58.3%
Gen Quality
0.478
5
Hunyuan
Tencent • Video Generation
Task Success
57.7%
Gen Quality
0.396
5
SVD
Stability AI • Video Generation
Task Success
57.7%
Gen Quality
0.371
7
SE3DS
Open Source • 3D Scene
Task Success
57.5%
Gen Quality
0.365
7
LTX†
Lightricks • Video Generation
Task Success
57.5%
Gen Quality
0.340
9
NWM
Meta • World Models
Task Success
57.4%
Gen Quality
0.325
10
Pathdreamer
Open Source • Pano Path Generation
Task Success
57.0%
Gen Quality
0.339
11
Wan2.2†
Alibaba • Video Generation
Task Success
56.3%
Gen Quality
0.380
12
Cosmos
Nvidia • Video Generation
Task Success
55.4%
Gen Quality
0.480
Task Success Rate measures completion of video generation objectives • Generation Quality measures visual fidelity and coherence
World-in-World: Building a Closed-Loop World Interface to Evaluate World Models
This leaderboard showcases performance metrics across different types of AI models in world modeling tasks:
Model Categories
- VLM: Vision-Language Models
- Image Gen.: Image Generation Models
- Video Gen.: Video Generation Models
- Video Gen. Post-Train: Post-training specialized Video Generation Models
Metrics Explained
- Acc. ↑: Accuracy score (higher values indicate better performance)
- Mean Traj. ↓: Mean trajectory error (lower values indicate better performance)
Notes
- † indicates post-training specialized models
- XXX indicates results pending/unavailable
- – indicates not applicable or not available
Results represent performance on world modeling evaluation benchmarks and may vary across different evaluation settings.