7 Big Breakthroughs From Argo Research at CVPR 2022

The 2022 Conference on Computer Vision and Pattern Recognition (CVPR 2022) is nearly here. Thousands of computer scientists, software engineers, and researchers from around the globe will gather in New Orleans to review and discuss their latest work in artificial intelligence and developing other computer programs to advance the way we work, live, play, create, and move about the world.

Scientists affiliated with Argo AI, a leading autonomy products and services company, are among the many people presenting new research at CVPR 2022. From using lidar to predict the future movements of objects to turning 2D drone photography into 3D maps, the papers from the Carnegie Mellon University-Argo Center for Autonomous Vehicle Research, an ongoing collaboration between the company and the esteemed robotics institute, cover a wide range of advancements that could benefit autonomous driving and other important efforts, including search and rescue.

Below, you can find the links to, and read brief descriptions of, all the research from Argo being presented at CVPR 2022:

Predict the Future With Lidar?

In this paper, titled “Forecasting from LiDAR via Future Object Detection,” researchers describe their new method for using lidar data to create multiple future forecasted paths of moving objects. Importantly, the method, called FutureDet, is “end-to-end,” not modular, which means that by using lidar data alone, the program can predict with high accuracy where objects are likely to be in the future and provide enough data for an autonomous vehicle to be able to move in response. It also presents the future paths of moving objects in the same manner as the lidar sensor detects objects in the present, which means existing lidar detection toolboxes can be repurposed for future prediction. Read our summary on Ground Truth and the full paper here.

A.I. Training

In “Long-Tailed Recognition via Weight Balancing,” the authors tackle a problem that appears across the entire field of artificial intelligence and scientific observation more generally — the “long-tailed recognition problem.” Essentially, this problem comes up when a computer program, such as a neural network, is created to study lots of different types of objects (or other data) and recognize them accurately going forward.

For self-driving cars, these objects are usually people, animals, and other actors on the roads — pedestrians, bicyclists, other cars, motorcycles, skateboarders, dogs, squirrels. But what about objects that occur with less frequency on the road, such as hoverboards, unicycles, land skiers, or parade floats? Because these objects are rarer, there is less available data on which to train an artificial intelligence program to recognize them, and consequently, the program is typically less accurate in identifying these objects.

However, these researchers propose a way to “balance” how an AI learns about different objects so it is not significantly worse in recognizing rare ones, while still maintaining the high accuracy it has when identifying more common objects. By using two overlapping methods that give extra “weight” to rare objects but not so much that they overwhelm recognition of common ones, the researchers were able to achieve an overall identification accuracy rate of 53.35%, which they note is “significantly higher” than an unweighted method, 38.38% and various other methods that reach in the 40s or 50s percentage accuracy.

Purging Old Data

Argo creates richly detailed, high-resolution, 3D maps of the cities and other areas in which its AVs drive. Even though the cars are equipped with a whole constellation of sensors for seeing the world in real time, the maps act as an important resource for the vehicles as they provide them with the ability to figure out where they are in the world, and can be accessed even without an internet connection.

These maps are updated before every AV drive to account for new changes to roadway infrastructure — including everything from temporary construction sites to more fixed, longer-term infrastructure changes like the installation of new protected bike lanes, barriers, bike parking, and pedestrian areas. While these frequent updates are helpful for maintaining map accuracy and comprehensiveness, AV companies that rely on maps need to determine what map data is irrelevant and automatically delete it to save space on the AV computers’ memory.

Fortunately, researchers affiliated with Argo and other institutions have an idea. As they describe in their paper, “Long-term Visual Map Sparsification with Heterogeneous GNN,” their method involves turning the existing AV map into a graph (in computer science, a collection of data organized around vertices and edges) and running it through a graph neural network (GNN), a specific category of artificial intelligence algorithm that is optimized to handle these data types. The results? Their algorithm, known as GATConv, performed equal to or better than other algorithms designed to drop old map data while retaining key points of interest.

Turning Large Images Into 3D Maps

This research is all about NeRFs. No, not that kind — in this case, NeRF is an acronym that stands for “neural radiance fields,” a type of artificial intelligence algorithm that can analyze 2D imagery and turn it into a fully fleshed out 3D model that you can virtually walk or fly through.

NeRFs have been largely limited to indoor spaces or relatively smaller areas of up to 4983.69 square feet (463 square meters), according to the researchers behind this work. In their paper, “Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs,” the researchers describe a new method of using a NeRF to scale up still photography to sizes up to nearly 1.4 million square feet (1.3 million square meters). The researchers note that their algorithm could prove tremendously useful in situations where a 3D map needs to be created quickly in a challenging environment — say, a search and rescue on a mountainside — out of instruments only capable of capturing 2D imagery, like a typical drone video camera.

The researchers say that by rewriting their NeRF code in an intelligent way that reduces the load of data that the algorithm needs to process, while still preserving spatial awareness of the area needing mapping, they achieved 3 times faster speed when training the algorithm and preserved high resolution imagery.

Estimating Paths of Moving Objects Smoothly

Using lidar data gathered from AVs driving on public roads and uploaded to Argoverse, Argo’s open source library, a group of researchers have created a new algorithm that estimates the trajectories of all the moving objects in the lidar scenes with high accuracy from the lidar points, while conserving valuable memory on a computer.

Their method produces more detailed, smoother and more continuous trajectory estimates in 3D than other competing methods, including those built around 2D data points, allowing moving objects to be tracked by lidar sensors more accurately.

Furthermore, as they detail in their paper, “Neural Prior for Trajectory Estimation,” when they combine their method with another previous algorithm called PAUL (which stands for “Procrustean Autoencoder for Unsupervised Lifting”), they produce the highest accuracy for estimating the trajectory of any other algorithm they reviewed. Their method also yields an impressive, if alarming, reconstruction of the movements and trajectories of a human face (seen above).

Creating More Accurate 3D Scenes

Back to NeRFs — neural radiance fields, algorithms that can transform 2D imagery into 3D scenes on a computer.

In their paper, “Depth-supervised NeRF: Fewer Views and Faster Training for Free,” Argo-affiliated researchers and their colleagues rely on “structure-from-motion.” Essentially, this means using an artificial intelligence algorithm that can analyze the different angles and locations of the camera(s) that captured 2D imagery, and estimate exactly where those cameras were in relation to the space they photographed or recorded. From these estimates, the algorithm can measure distances and determine sizes of the photographed 3D objects.

However, it’s not always possible to estimate all the distances and “depth” of the objects away from the camera in every photo or 2D video image, so the researchers have developed a workaround — relying on “keypoints,” or established, accurate depth values in the image, which can be used to train an algorithm rapidly and to produce accurate depth and size estimations of the other 3D objects.  Importantly, such depth information can also be directly provided by a perception sensor, such as LiDAR.

After testing on 2D sample imagery, they were able to train their algorithm two to three  times as fast as other previous methods, and with more accurate results using less data.

Can An AV Follow Any Object?

As previously covered on Ground Truth, one of the hardest challenges in developing AVs and safe autonomous driving in the real world is dealing with the “open world.” While a self-driving computer system can be trained to recognize some finite and continuously expanding list of different object categories — bicycles, other cars, pedestrians, animals, street signs, things that would be considered part of its known or “closed world” — there is no known way to prepare it for everything it may possibly encounter on the roads.

So, what happens when an AV sees something it doesn’t recognize?

In their paper “Opening up Open World Tracking,” Argo-affiliated researchers and colleagues propose a new method by which AVs can identify never-seen-before objects and separate them from the background (a non-trivial task), track their movements across a scene even if they don’t know exactly what the object is, and maintain object permanence even if the object re-orients itself or moves around and behind other objects.

They created an algorithm called the Open World Tracking Baseline and trained it on 3,000 videos in which 800 objects were labeled by human observers, including rare objects, as well as 87,358 tracks for those known objects (where they moved throughout the scene) and 20,522 tracks of unknown objects. The result was that their algorithm was able to track both known and unknown objects with greater accuracy than many other algorithms, and significantly, tracks many more objects than other algorithms.

Leave a Reply

Your email address will not be published. Required fields are marked *