Self-supervised AI stack for off-road autonomous vehicles

(Image courtesy of Carnegie Mellon University)
Researchers at Carnegie Mellon University (CMU) in Pittsburgh (USA) have developed an AI system for training off-road autonomous vehicles, writes Nick Flaherty.
Successfully navigating off-road terrain requires a vehicle that can interpret its surroundings in real time. Most current systems require months of data labelling, design and field testing.
To overcome this problem, the TartanDriver team at CMU’s AirLab created a new self-supervised autonomy stack that enables vehicles to safely traverse complex terrains with speed and accuracy without the need for time-consuming human intervention.
This stack can be used for autonomous vehicles designed for mining, search and rescue, wildfire management, exploration, defence and any other uses that might take them into unpredictable, off-road terrain.
“By combining foundation models and the flexibility of self-supervision, the AirLab at CMU is pushing the limits of autonomous driving in challenging terrains,” said Wenshan Wang, a systems scientist with the AirLab and a member of the TartanDriver team. The foundation models can recognise natural features such as tall grasses and trees without the researchers having to label everything, making the data collection process far more efficient.
The team focused on three key principles: self-supervision, multimodality and uncertainty awareness. They equipped the autonomous off-road vehicle with Lidar sensors (to detect objects), cameras, IMUs, shock travel sensors and wheel encoders. The resulting system embraces self-supervised navigation, with inverse reinforcement learning supported by the perception data received from all the sensors, which determines where to go and balances risk and performance.

“Since there is no prior map or GPS, our system relies on SLAM [Simultaneous Localisation and Mapping] to track position and build a local understanding of the environment in real-time,” said Micah Nye, a TartanDriver team member. “This enables consistent perception across terrains that have different visual environments.”
The base of the 3D multimodal mapping is voxels. All the 3D information coming from different perception sensors and components is processed and stored in a voxel. The data structure of the voxel map is composed of 2D grid cells (XY), where each cell contains a certain number of voxels (Z).
Each Lidar point is processed and stored in a corresponding voxel. Ground elevation is estimated based on the lowest reliable voxel point information in a 2D cell and a ground elevation map, height map and slope map are generated. With all the voxel point information in a cell, a singular value decomposition map is generated.
After an image is inferenced with a trained visual model, the corresponding Lidar point cloud and 3D voxel semantic information is projected onto the predicted image to generate a semantic map that is used to drive the vehicle.
The team implemented their system on an all-terrain vehicle, and tested its abilities in several complex terrains that included grassy fields, rocky paths and varying inclines. The vehicle was able to successfully navigate the environments and adjust to changes without human correction. Now the team is adding thermal cameras to the payload to extend the dataset.
“Thermal cameras detect heat instead of light, allowing us to see through smoke and in other visually-degraded conditions,” said Yifei Liu at TartanDriver.
The team has also begun testing the autonomy stack on robot dogs and urban motorised wheelchairs.
UPCOMING EVENTS










