World Model-based Perception for Visual Legged Locomotion
We present World Model-based Perception (WMP), an end-to-end framework that draws inspiration from the role of the mental model in animal cognition and achieves the best traversal performance on the A1 robot so far.
Abstract
Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness.
Real-World Results
Overall Performance
Our method WMP exhibits a more stable control behavior and successfully traverses more challenging terrains, achieving the best traversal performance on the A1 robot so far.
Gap (85cm)
Climb (55cm)
Crawl (22cm)
Tilt (29cm)
Stair (16cm)
Slope (27°)
One Shot Video
Baseline
The baseline policy fails in difficult levels of Gap and Climb, and can not tackle terrains like Crawl and Tilt due to the limitation of scandots.
Baseline (Gap, 80cm)
Baseline (Climb, 50cm)
Framework
Simulation Results
Gap (90cm)
Climb (54cm)
Crawl (21cm)
Tilt (28cm)
Stair (21cm)
Slope (30°)
Citation
@article{lai2024worldmodelbasedperceptionvisual,
title={World Model-based Perception for Visual Legged Locomotion},
author={Hang Lai and Jiahang Cao and Jiafeng Xu and Hongtao Wu and Yunfeng Lin and Tao Kong and Yong Yu and Weinan Zhang},
journal={arXiv preprint arXiv:2409.16784},
year={2024}
}