TLDR: UrbanVerse is a novel data-driven system that transforms real-world city-tour videos into physics-aware, interactive urban simulation environments. It comprises UrbanVerse-100K, a large database of annotated 3D urban assets, and UrbanVerse-Gen, an automated pipeline that extracts scene layouts from videos to build simulations. This system addresses the need for scalable and realistic training environments for embodied AI agents, demonstrating significant improvements in robot navigation generalization and zero-shot sim-to-real transfer compared to existing methods.
The world of Artificial Intelligence (AI) is rapidly expanding, with embodied AI agents like delivery robots and quadrupeds becoming increasingly common in our cities. These robots need extensive training in diverse and realistic urban environments to navigate complex streets effectively. However, creating such high-fidelity simulation environments has been a significant challenge, often relying on either hand-crafted scenes that lack scalability or procedurally generated scenes that don’t accurately reflect the real world.
A new system called UrbanVerse addresses this challenge by introducing a data-driven approach that converts real-world urban scenes from crowd-sourced city-tour videos into physics-aware, interactive simulation environments. This innovative system promises to enable scalable robot learning in urban spaces with strong real-world generalization capabilities.
What is UrbanVerse?
UrbanVerse is a comprehensive real-to-simulation system built on two main pillars:
- UrbanVerse-100K: This is a vast repository containing over 100,000 annotated urban 3D assets. These assets come with detailed semantic and physical attributes, such as mass and friction, making them suitable for realistic physics simulations. The database also includes diverse ground materials and sky maps to ensure varied and realistic appearances and lighting conditions.
- UrbanVerse-Gen: This is an automatic pipeline that takes city-tour videos as input. It extracts scene layouts, object semantics, ground composition, and sky illumination from these videos. Using this information, it then retrieves matching assets from UrbanVerse-100K to construct metric-scale 3D simulations. These simulations are run in IsaacSim, a powerful simulation platform.
The system has successfully created 160 high-quality scenes from 24 countries, along with a benchmark of 10 artist-designed test scenes. This extensive library allows for robust training and evaluation of AI agents.
Bridging the Gap Between Real and Simulated Worlds
The core idea behind UrbanVerse is to create “digital cousin” scenes that accurately map 2D video footage to a 3D virtual world, preserving real-world layouts, semantics, and physics. This approach combines the rich diversity of real-world data with the interactive capabilities of simulation, allowing for unlimited scene generation while maintaining true-to-life street-level distributions.
Experiments have shown that UrbanVerse scenes maintain real-world semantics and layouts, achieving a level of realism comparable to manually crafted scenes, as evaluated by humans. In urban navigation tasks, AI policies trained within UrbanVerse environments demonstrated significant improvements in success rates. Specifically, they showed a +6.3% improvement in simulation and an impressive +30.1% in zero-shot sim-to-real transfer compared to previous methods. One policy even completed a 300-meter real-world mission with only two human interventions, highlighting its robustness.
Also Read:
- GaussGym: High-Speed Photorealistic Simulation for Vision-Based Robot Learning
- Offline Simulator OffSim Advances Reinforcement Learning Without Real-World Interaction
Impact on Embodied AI
The ability to generate diverse, high-fidelity, and interactive urban simulation environments at scale is crucial for the advancement of embodied AI. UrbanVerse provides a solution to the limitations of existing simulators, which often suffer from a lack of layout realism, limited asset diversity, and insufficient physics annotations.
By leveraging crowd-sourced videos and a sophisticated pipeline, UrbanVerse enables AI agents to learn and generalize across a wide range of real-world urban settings. This means that delivery robots, quadrupeds, and other mobile AI agents can be trained more effectively to navigate the chaotic and unpredictable nature of city streets, ultimately leading to safer and more efficient urban services.
The researchers plan to open-source all assets, scenes, and code of UrbanVerse to further accelerate embodied AI research, making this powerful tool accessible to the wider scientific community.