AuthorSelventhiran Rengaraj is an Associate Technical Project Manager in the Mobility & Transportation Business Unit at MulticoreWare. He has hands-on experience in developing Robotics Stacks for Ground & Underwater Robots and is working on cutting-edge AI and ADAS Perception Stack Optimization on leading Automotive Semiconductor platforms.
Introduction
In our previous blog (BEV-A Primer to the paradigm shift in Autonomous Robotics), we delved into the captivating world of Bird’s Eye View (BEV) and its remarkable applications in autonomy. Now, we will take a closer look at the fascinating domain of BEV transformations and explore the various methods used to generate this magical viewpoint.
Perspective View to Bird’s Eye View Transformation
The ability to transform a Perspective View (PV) into a Bird’s Eye View (BEV) is nothing short of a technological marvel. The key approaches that researchers and industrialists employ to perform this PV to BEV transformation are as follows:
1. Homographic method
In recent decades, homography-based methods in computer vision, particularly Inverse Perspective Mapping (IPM), have facilitated Bird’s Eye View (BEV) interpretation by exploiting the geometric projection relationship with Perspective View (PV). IPM efficiently connects PV and BEV, enabling the projection of crucial images and features for perception tasks. Its simplicity, relying on matrix multiplications rather than complex machine learning, makes it a reliable solution in computer vision.
However, there are certain limitations such as:
- IPM-based methods assume all points on a flat ground plane, simplifying the PV-BEV transformation but limiting their use in 3D scenarios due to the absence of depth information.
- The homographic method depends on the intrinsic and extrinsic parameters of the camera. Any modifications to the camera calibration will significantly impact the PV-BEV transformation.
- IPM works well only if the camera remains stationary relative to the road when the road surface is flat & obstacle-free.
2. Depth / Point Cloud Based Approach
To address the limitation of homographic methods, depth wise information is required to elevate 2D pixels and features into a 3D space. Depth-based PV-to-BEV methods are introduced and inherently built on an explicit 3D representation which can be categorised into point-based or voxel-based methods based on the representation used.
- Point-based methods directly utilize depth estimation to convert pixels into point clouds, scattered in continuous 3D space. These methods are more straightforward and can easily integrate mature techniques from monocular depth estimation and LiDAR based 3D detection.
- Voxel-based methods use a uniform depth vector or the explicitly predicted depth distribution to lift 2D features to a 3D voxel space and perform BEV-based perception. This method has been getting significant attention recently due to its computational efficiency and flexibility.
3. NN Transformers Based Approaches
NN-based methods, primarily leveraging Transformer architectures, are on the rise for their capacity to convert Perspective View to Bird’s Eye View. Their growing popularity is due to the impressive performance and robust relationship modeling of Transformer-based view-projectors. These methods employ an encoder-decoder structure based on transformers to translate image features from multiple cameras into a Bird’s Eye View, making use of contextual information within individual images and inter-image relationships from different views.
A typical NN based model for BEV:
- First, it encodes the input data with a backbone network and performs the BEV projection using a transformer.
- Then, the BEV features will be fused temporally and spatially.
- Finally, BEV features will be decoded with different heads such as segmentation and object detection depending on the application.
In the midst of the prevailing trend of creating large transformer models in Natural Language Processing (NLP), researchers in autonomous driving are exploring transformers’ effectiveness in generating robust representations for tasks such as tracking and predictions using BEV perception.
Overall, the Transformers & Depth based approaches are gaining prominence in the computer vision community.
4. Semantic Occupancy Grid – BEV
In our previous blog about BEV, we highlighted the similarity between Occupancy Grids and BEV. Both techniques involve dividing the environment into cells within the grid, with each cell corresponding to the occupancy or a specific location within the environment.
In the context of Semantic Occupancy Grid, each cell in the grid is assigned specific classes, such as road, terrain, sidewalk, and traffic signals, especially in perception applications. Unlike a binary occupancy representation, the Semantic Occupancy Grid incorporates semantic information, which can be further refined through training with advanced deep neural networks. This enables object detection and the generation of Bird’s Eye View Maps.
Occupancy Grids, despite being decades old, remain a highly relevant and potent representation. They seamlessly integrate with modern methods such as BEVs and hold significant potential for enhancing the navigation of autonomous vehicles.
Limitations of BEV Perception
Though the BEV based perception finds numerous applications in the field of autonomy and robotics, there are few limitations such as:
- Occlusion: Overhead views can be obstructed by tall objects or overhangs which may obscure important details and events happening at the ground level.
- Vertical Perception: BEV perception often lacks critical vertical information, like building and vehicle heights, limiting its understanding of the 3D environment.
- Limited Environment Conditions: Camera-based BEV perception may be less reliable in adverse weather conditions such as heavy rain, fog, or snow, leading to reduced visibility and accuracy in perception.
McW Expertise in BEV & SOTA Vision Techniques
- We have optimized SOTA (State-of-the-Art) BEV based Vision Transformers namely BEV-Former, BEV-SegFormer, Lift-Splat-Shoot (LSS) for various Automotive-grade AI accelerators.
- McW possess expertise in quantizing the camera & LiDAR fusion-based NN models such as DeepFusion, BEV-Det, DeepInteraction from FP32 to INT8 without compromising accuracy
- Our expertise extends to building algorithms with BEV projections tailored for Micro Mobility, two-wheeled and four-wheeled vehicles.
- We have in-depth understanding and proficiency in 80+ AI models & we specialize in optimizing them for low-power Edge devices, DSPs, NSPs etc.,
Conclusion
At MulticoreWare, we possess the capability to enhance & accelerate your perception solutions by developing custom Transformer-based Sensor Fusion models designed for your unique use-cases.
For more information, please contact us at info@multicorewareinc.com