Category Archives: Computer Vision

An open-source implementation of visual SLAM with a very nice related-work section

R. Mur-Artal and J. D. Tardós, ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras, IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255-1262, DOI: 10.1109/TRO.2017.2705103.

We present ORB-SLAM2, a complete simultaneous localization and mapping (SLAM) system for monocular, stereo and RGB-D cameras, including map reuse, loop closing, and relocalization capabilities. The system works in real time on standard central processing units in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city. Our back-end, based on bundle adjustment with monocular and stereo observations, allows for accurate trajectory estimation with metric scale. Our system includes a lightweight localization mode that leverages visual odometry tracks for unmapped regions and matches with map points that allow for zero-drift localization. The evaluation on 29 popular public sequences shows that our method achieves state-of-the-art accuracy, being in most cases the most accurate SLAM solution. We publish the source code, not only for the benefit of the SLAM community, but with the aim of being an out-of-the-box SLAM solution for researchers in other fields.

Survey on visual attention in 3D for robotics

Ekaterina Potapova, Michael Zillich, and Markus Vincze, Survey of recent advances in 3D visual attention for robotics, The International Journal of Robotics Research, Vol 36, Issue 11, pp. 1159 – 1176, DOI: 10.1177/0278364917726587.

3D visual attention plays an important role in both human and robotics perception that yet has to be explored in full detail. However, the majority of computer vision and robotics methods are concerned only with 2D visual attention. This survey presents findings and approaches that cover 3D visual attention in both human and robot vision, summarizing the last 30 years of research and also looking beyond computational methods. First, we present work in such fields as biological vision and neurophysiology, studying 3D attention in human observers. This provides a view of the role attention plays at the system level for biological vision. Then, we cover computer and robot vision approaches that take 3D visual attention into account. We compare approaches with respect to different categories, such as feature-based, data-based, or depth-based visual attention, and draw conclusions on what advances will help robotics to cope better with complex real-world settings and tasks.

Interesting implementation of visual graph SLAM in C++ for educational purposes

Dominik Schlegel, Mirco Colosi, Giorgio Grisetti, ProSLAM: Graph SLAM from a Programmer’s Perspective/strong>, arXiv:1709.04377.

In this paper we present ProSLAM, a lightweight stereo visual SLAM system designed with simplicity in mind. Our work stems from the experience gathered by the authors while teaching SLAM to students and aims at providing a highly modular system that can be easily implemented and understood. Rather than focusing on the well known mathematical aspects of Stereo Visual SLAM, in this work we highlight the data structures and the algorithmic aspects that one needs to tackle during the design of such a system. We implemented ProSLAM using the C++ programming language in combination with a minimal set of well known used external libraries. In addition to an open source implementation, we provide several code snippets that address the core aspects of our approach directly in this paper. The results of a thorough validation performed on standard benchmark datasets show that our approach achieves accuracy comparable to state of the art methods, while requiring substantially less computational resources.

Interleaving segmentation (semantics) and dense 3D reconstruction (metrics)

C. Häne, C. Zach, A. Cohen and M. Pollefeys, Dense Semantic 3D Reconstruction, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1730-1743, DOI: 10.1109/TPAMI.2016.2613051.

Both image segmentation and dense 3D modeling from images represent an intrinsically ill-posed problem. Strong regularizers are therefore required to constrain the solutions from being `too noisy’. These priors generally yield overly smooth reconstructions and/or segmentations in certain regions while they fail to constrain the solution sufficiently in other areas. In this paper, we argue that image segmentation and dense 3D reconstruction contribute valuable information to each other’s task. As a consequence, we propose a mathematical framework to formulate and solve a joint segmentation and dense reconstruction problem. On the one hand knowing about the semantic class of the geometry provides information about the likelihood of the surface direction. On the other hand the surface direction provides information about the likelihood of the semantic class. Experimental results on several data sets highlight the advantages of our joint formulation. We show how weakly observed surfaces are reconstructed more faithfully compared to a geometry only reconstruction. Thanks to the volumetric nature of our formulation we also infer surfaces which cannot be directly observed for example the surface between the ground and a building. Finally, our method returns a semantic segmentation which is consistent across the whole dataset.

A new feature for 3D point clouds that is more efficient than the state-of-the-art SHOT

Sai Manoj Prakhya, Bingbing Liu, Weisi Lin, Vinit Jakhetiya & Sharath Chandra Guntuku, B-SHOT: a binary 3D feature descriptor for fast Keypoint matching on 3D point clouds, Auton Robot (2017) 41:1501–1520, DOI: 10.1007/s10514-016-9612-y.

We present the first attempt in creating a binary 3D feature descriptor for fast and efficient keypoint matching on 3D point clouds. Specifically, we propose a linarization technique and apply it on the state-of-the-art 3D feature descriptor, SHOT to create the first binary 3D feature descriptor, which we call B-SHOT. B-SHOT requires 32 times lesser memory for its representation while being six times faster in feature descriptor matching, when compared to the SHOT feature descriptor. Next, we propose a robust evaluation metric, specifically for 3D feature descriptors. A comprehensive evaluation on standard benchmarks reveals that B-SHOT offers comparable keypoint matching performance to that of the state-of-the-art real valued 3D feature descriptors, albeit at dramatically lower computational and memory costs.

A nice general model for camera calibration

S. Ramalingam and P. Sturm, “A Unifying Model for Camera Calibration,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, pp. 1309-1319, July 1 2017. DOI: 10.1109/TPAMI.2016.2592904.

This paper proposes a unified theory for calibrating a wide variety of camera models such as pinhole, fisheye, cata-dioptric, and multi-camera networks. We model any camera as a set of image pixels and their associated camera rays in space. Every pixel measures the light traveling along a (half-) ray in 3-space, associated with that pixel. By this definition, calibration simply refers to the computation of the mapping between pixels and the associated 3D rays. Such a mapping can be computed using images of calibration grids, which are objects with known 3D geometry, taken from unknown positions. This general camera model allows to represent non-central cameras; we also consider two special subclasses, namely central and axial cameras. In a central camera, all rays intersect in a single point, whereas the rays are completely arbitrary in a non-central one. Axial cameras are an intermediate case: the camera rays intersect a single line. In this work, we show the theory for calibrating central, axial and non-central models using calibration grids, which can be either three-dimensional or planar.

A very good survey of visual saliency methods, with a list of robotic tasks that have benefit from attention

Ali Borji, Dicky N. Sihite, and Laurent Itti, Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study, IEEE Transactions on Image Processing, V. 22, N. 1, 2013, DOI: 10.1109/TIP.2012.2210727.

Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from their surroundings. The latter are often referred to as “visual saliency.” Modeling bottom-up visual saliency has been the subject of numerous research efforts during the past 20 years, with many successful applications in computer vision and robotics. Available models have been tested with different datasets (e.g., synthetic psychological search arrays, natural images or videos) using different evaluation scores (e.g., search slopes, comparison to human eye tracking) and parameter settings. This has made direct comparison of models difficult. Here, we perform an exhaustive comparison of 35 state-of-the-art saliency models over 54 challenging synthetic patterns, three natural image datasets, and two video datasets, using three evaluation scores. We find that although model rankings vary, some models consistently perform better. Analysis of datasets reveals that existing datasets are highly center-biased, which influences some of the evaluation scores. Computational complexity analysis shows that some models are very fast, yet yield competitive eye movement prediction accuracy. Different models often have common easy/difficult stimuli. Furthermore, several concerns in visual saliency modeling,
eye movement datasets, and evaluation scores are discussed and insights for future work are provided. Our study allows one to assess the state-of-the-art, helps to organizing this rapidly growing field, and sets a unified comparison framework for gauging future efforts, similar to the PASCAL VOC challenge in the object recognition and detection domains.

Including selective attention and cortical magnification to improve computer vision

Ala Aboudib, Vincent Gripon, Gilles Coppin, A Biologically Inspired Framework for Visual Information Processing and an Application on Modeling Bottom-Up Visual Attention, Cognitive Computation, December 2016, Volume 8, Issue 6, pp 1007–1026, DOI: 10.1007/s12559-016-9430-8.

An emerging trend in visual information processing is toward incorporating some interesting properties of the ventral stream in order to account for some limitations of machine learning algorithms. Selective attention and cortical magnification are two such important phenomena that have been the subject of a large body of research in recent years. In this paper, we focus on designing a new model for visual acquisition that takes these important properties into account.We propose a new framework for visual information acquisition and representation that emulates the architecture of the primate visual system by integrating features such as retinal sampling and cortical magnification while avoiding spatial deformations and other side effects produced by models that tried to implement these two features. It also explicitly integrates the notion of visual angle, which is rarely taken into account by vision models. We argue that this framework can provide the infrastructure for implementing vision tasks such as object recognition and computational visual attention algorithms.To demonstrate the utility of the proposed vision framework, we propose an algorithm for bottom-up saliency prediction implemented using the proposed architecture. We evaluate the performance of the proposed model on the MIT saliency benchmark and show that it attains state-of-the-art performance, while providing some advantages over other models.

Anticipating human actions through recognition of object affordances and use of Anticiopatory Conditional Random Fields

Koppula, H.S.; Saxena, A., Anticipating Human Activities Using Object Affordances for Reactive Robotic Response, in Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.38, no.1, pp.14-29, Jan. 1 2016, DOI: 10.1109/TPAMI.2015.2430335.

An important aspect of human perception is anticipation, which we use extensively in our day-to-day activities when interacting with other humans as well as with our surroundings. Anticipating which activities will a human do next (and how) can enable an assistive robot to plan ahead for reactive responses. Furthermore, anticipation can even improve the detection accuracy of past activities. The challenge, however, is two-fold: We need to capture the rich context for modeling the activities and object affordances, and we need to anticipate the distribution over a large space of future human activities. In this work, we represent each possible future using an anticipatory temporal conditional random field (ATCRF) that models the rich spatial-temporal relations through object affordances. We then consider each ATCRF as a particle and represent the distribution over the potential futures using a set of particles. In extensive evaluation on CAD-120 human activity RGB-D dataset, we first show that anticipation improves the state-of-the-art detection results. We then show that for new subjects (not seen in the training set), we obtain an activity anticipation accuracy (defined as whether one of top three predictions actually happened) of 84.1, 74.4 and 62.2 percent for an anticipation time of 1, 3 and 10 seconds respectively. Finally, we also show a robot using our algorithm for performing a few reactive responses.

Detecting objects in images through the timing of the changes in the visual sensor, rather than through the analysis of frames (without time information)

Orchard, G.; Meyer, C.; Etienne-Cummings, R.; Posch, C.; Thakor, N.; Benosman, R., HFirst: A Temporal Approach to Object Recognition, in Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.37, no.10, pp.2028-2040, Oct. 1 2015 DOI: 10.1109/TPAMI.2015.2392947.

This paper introduces a spiking hierarchical model for object recognition which utilizes the precise timing information inherently present in the output of biologically inspired asynchronous address event representation (AER) vision sensors. The asynchronous nature of these systems frees computation and communication from the rigid predetermined timing enforced by system clocks in conventional systems. Freedom from rigid timing constraints opens the possibility of using true timing to our advantage in computation. We show not only how timing can be used in object recognition, but also how it can in fact simplify computation. Specifically, we rely on a simple temporal-winner-take-all rather than more computationally intensive synchronous operations typically used in biologically inspired neural networks for object recognition. This approach to visual computation represents a major paradigm shift from conventional clocked systems and can find application in other sensory modalities and computational tasks. We showcase effectiveness of the approach by achieving the highest reported accuracy to date (97.5% ± 3.5%) for a previously published four class card pip recognition task and an accuracy of 84.9% ± 1.9% for a new more difficult 36 class character recognition task.