Imagine you have two pictures of the same scene taken from different angles. Most of the objects in both pictures are the same, just you look at them from different angles. In computer vision, objects are assumed to have certain features like edges, corners, etc. Matching these features is critical for some applications. But what would it take to match features between two pictures?
Finding correspondence between images is the prerequisite for estimating 3D structure and camera poses in computer vision tasks such as simultaneous localization and mapping (SLAM) and structure-from-motion (SfM). This is done by matching local features, and it’s tricky to achieve due to the changes in lighting conditions, occlusion, blur, etc.
Traditionally, feature matching is done via a two-step approach. First, the front-end step extracts visual features from the images. Second, the back-end step applies bundle adjustment and pose estimation to help match extracted visual features. Once these are done, the features are ready, and the feature matching is modeled as a linear assignment problem.
As in all other domains, deep neural networks have played a crucial role in recent years in feature matching problems. They have been used to learn better sparse detectors and local descriptors from data using convolutional neural networks (CNNs).
However, they were usually a component in the feature matching problem, not an end-to-end solution. What if a single neural network could perform context aggregation, matching, and filtering in a single architecture? Time to introduce the SuperGlue.
SuperGlue approaches feature matching problems in a different way. It learns the matching process from pre-existing local features using a graph neural network structure. This replaces the existing approaches where first, the task-agnostic features are learned, and they are matched using heuristics and simple methods. Being an end-to-end approach gives SuperGlue a strong advantage over existing methods. SuperGlue is a learnable middle-end that could be used to improve existing approaches.
So how does SuperGlue achieve this? It peaks through a new window and views the feature-matching problem as a partial assignment between two sets of local features. Instead of solving a linear assignment problem to match features, it treats it as an optimal transport problem. SuperGlue uses a graph neural network (GNN) that predicts the cost function of this transport optimization.
We all know how transformers achieved massive success in natural language processing and, recently, computer vision tasks. SuperGlue utilizes a transformer to leverage both spatial relationships of key points and their visual appearances.
SuperGlue is trained in an end-to-end manner. Image pairs are used as training data. Priors for pose estimation are learned from a large labeled dataset; therefore, SuperGlue can have an understanding of the 3D scene.
SuperGlue can be applied to multiple problems where a high-quality feature correspondence is required for a multiple-view geometry. It runs in real-time on commodity hardware and can be applied for both classical and learned features. You can find more information about SuperGlue at the links below.
Check out the paper, project, and code. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.