Category Archives: Computer Vision

Interesting survey on heart-reat detection through conventional cameras

X. Chen, J. Cheng, R. Song, Y. Liu, R. Ward and Z. J. Wang, Video-Based Heart Rate Measurement: Recent Advances and Future Prospects. IEEE Transactions on Instrumentation and Measurement, vol. 68, no. 10, pp. 3600-3615, DOI: 10.1109/TIM.2018.2879706.

Heart rate (HR) estimation and monitoring is of great importance to determine a person’s physiological and mental status. Recently, it has been demonstrated that HR can be remotely retrieved from facial video-based photoplethysmographic signals captured using professional or consumer-level cameras. Many efforts have been made to improve the detection accuracy of this noncontact technique. This paper presents a timely, systematic survey on such video-based remote HR measurement approaches, with a focus on recent advancements that overcome dominating technical challenges arising from illumination variations and motion artifacts. Representative methods up to date are comparatively summarized with respect to their principles, pros, and cons under different conditions. Future prospects of this promising technique are discussed and potential research directions are described. We believe that such a remote HR measurement technique, taking advantages of unobtrusiveness while providing comfort and convenience, will be beneficial for many healthcare applications.

Learning multiple-factors metrics for measuring the similarity between objects

H. Ye, D. Zhan, Y. Jiang and Z. Zhou, What Makes Objects Similar: A Unified Multi-Metric Learning Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 5, pp. 1257-1270, DOI: 10.1109/TPAMI.2018.2829192.

Linkages are essentially determined by similarity measures that may be derived from multiple perspectives. For example, spatial linkages are usually generated based on localities of heterogeneous data. Semantic linkages, however, can come from even more properties, such as different physical meanings behind social relations. Many existing metric learning models focus on spatial linkages but leave the rich semantic factors unconsidered. We propose a Unified Multi-Metric Learning (Um$^2$2l) framework to exploit multiple types of metrics with respect to overdetermined similarities between linkages. In Um$^2$2l, types of combination operators are introduced for distance characterization from multiple perspectives, and thus can introduce flexibilities for representing and utilizing both spatial and semantic linkages. Besides, we propose a uniform solver for Um$^2$2l, and the theoretical analysis reflects the generalization ability of Um$^2$2l as well. Extensive experiments on diverse applications exhibit the superior classification performance and comprehensibility of Um$^2$2l. Visualization results also validate its ability to physical meanings discovery.

Selecting the best visual cues in the next future for reducing the computational cost of localization under limited computational resources

L. Carlone and S. Karaman, Attention and Anticipation in Fast Visual-Inertial Navigation, IEEE Transactions on Robotics, vol. 35, no. 1, pp. 1-20, Feb. 2019 DOI: 10.1109/TRO.2018.2872402.

We study a visual-inertial navigation (VIN) problem in which a robot needs to estimate its state using an on-board camera and an inertial sensor, without any prior knowledge of the external environment. We consider the case in which the robot can allocate limited resources to VIN, due to tight computational constraints. Therefore, we answer the following question: under limited resources, what are the most relevant visual cues to maximize the performance of VIN? Our approach has four key ingredients. First, it is task-driven, in that the selection of the visual cues is guided by a metric quantifying the VIN performance. Second, it exploits the notion of anticipation, since it uses a simplified model for forward-simulation of robot dynamics, predicting the utility of a set of visual cues over a future time horizon. Third, it is efficient and easy to implement, since it leads to a greedy algorithm for the selection of the most relevant visual cues. Fourth, it provides formal performance guarantees: we leverage submodularity to prove that the greedy selection cannot be far from the optimal (combinatorial) selection. Simulations and real experiments on agile drones show that our approach ensures state-of-the-art VIN performance while maintaining a lean processing time. In the easy scenarios, our approach outperforms appearance-based feature selection in terms of localization errors. In the most challenging scenarios, it enables accurate VIN while appearance-based feature selection fails to track robot’s motion during aggressive maneuvers.

Interesting use of RL (deep-RL) for detection – reformulation of detection as a sequential decision process

F. Ghesu et al., Multi-Scale Deep Reinforcement Learning for Real-Time 3D-Landmark Detection in CT Scans, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 176-189, DOI: 10.1109/TPAMI.2017.2782687.

Robust and fast detection of anatomical structures is a prerequisite for both diagnostic and interventional medical image analysis. Current solutions for anatomy detection are typically based on machine learning techniques that exploit large annotated image databases in order to learn the appearance of the captured anatomy. These solutions are subject to several limitations, including the use of suboptimal feature engineering techniques and most importantly the use of computationally suboptimal search-schemes for anatomy detection. To address these issues, we propose a method that follows a new paradigm by reformulating the detection problem as a behavior learning task for an artificial agent. We couple the modeling of the anatomy appearance and the object search in a unified behavioral framework, using the capabilities of deep reinforcement learning and multi-scale image analysis. In other words, an artificial agent is trained not only to distinguish the target anatomical object from the rest of the body but also how to find the object by learning and following an optimal navigation path to the target object in the imaged volumetric space. We evaluated our approach on 1487 3D-CT volumes from 532 patients, totaling over 500,000 image slices and show that it significantly outperforms state-of-the-art solutions on detecting several anatomical structures with no failed cases from a clinical acceptance perspective, while also achieving a 20-30 percent higher detection accuracy. Most importantly, we improve the detection-speed of the reference methods by 2-3 orders of magnitude, achieving unmatched real-time performance on large 3D-CT scans.

Deep reinforcement learning applied to learn both attention and classification in a task of vehicle classification

D. Zhao, Y. Chen and L. Lv, Deep Reinforcement Learning With Visual Attention for Vehicle Classification, IEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 4, pp. 356-367, DOI: 10.1109/TCDS.2016.2614675.

Automatic vehicle classification is crucial to intelligent transportation system, especially for vehicle-tracking by police. Due to the complex lighting and image capture conditions, image-based vehicle classification in real-world environments is still a challenging task and the performance is far from being satisfactory. However, owing to the mechanism of visual attention, the human vision system shows remarkable capability compared with the computer vision system, especially in distinguishing nuances processing. Inspired by this mechanism, we propose a convolutional neural network (CNN) model of visual attention for image classification. A visual attention-based image processing module is used to highlight one part of an image and weaken the others, generating a focused image. Then the focused image is input into the CNN to be classified. According to the classification probability distribution, we compute the information entropy to guide a reinforcement learning agent to achieve a better policy for image classification to select the key parts of an image. Systematic experiments on a surveillance-nature dataset which contains images captured by surveillance cameras in the front view, demonstrate that the proposed model is more competitive than the large-scale CNN in vehicle classification tasks.

Improving the search of matching image features using the usual coherence present in true matches

W. Y. Lin et al, CODE: Coherence Based Decision Boundaries for Feature Correspondence, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 1, pp. 34-47, DOI: 10.1109/TPAMI.2017.2652468.

A key challenge in feature correspondence is the difficulty in differentiating true and false matches at a local descriptor level. This forces adoption of strict similarity thresholds that discard many true matches. However, if analyzed at a global level, false matches are usually randomly scattered while true matches tend to be coherent (clustered around a few dominant motions), thus creating a coherence based separability constraint. This paper proposes a non-linear regression technique that can discover such a coherence based separability constraint from highly noisy matches and embed it into a correspondence likelihood model. Once computed, the model can filter the entire set of nearest neighbor matches (which typically contains over 90 percent false matches) for true matches. We integrate our technique into a full feature correspondence system which reliably generates large numbers of good quality correspondences over wide baselines where previous techniques provide few or no matches.

Interesting survey on Visual SLAM without filtering and of its future lines of research

Georges Younes, Daniel Asmar, Elie Shammas, John Zelek, Keyframe-based monocular SLAM: design, survey, and future directions, Robotics and Autonomous Systems, Volume 98, 2017, Pages 67-88, DOI: 10.1016/j.robot.2017.09.010.

Extensive research in the field of monocular SLAM for the past fifteen years has yielded workable systems that found their way into various applications in robotics and augmented reality. Although filter-based monocular SLAM systems were common at some time, the more efficient keyframe-based solutions are becoming the de facto methodology for building a monocular SLAM system. The objective of this paper is threefold: first, the paper serves as a guideline for people seeking to design their own monocular SLAM according to specific environmental constraints. Second, it presents a survey that covers the various keyframe-based monocular SLAM systems in the literature, detailing the components of their implementation, and critically assessing the specific strategies made in each proposed solution. Third, the paper provides insight into the direction of future research in this field, to address the major limitations still facing monocular SLAM; namely, in the issues of illumination changes, initialization, highly dynamic motion, poorly textured scenes, repetitive textures, map maintenance, and failure recovery.

First end-to-end implementation of (monocular) Visual Odometry with deep neural networks, including output with the uncertainty of the result

Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni, End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks, The International Journal of Robotics Research Vol 37, Issue 4-5, pp. 513 – 542, DOI: 0.1177/0278364917734298.

This paper studies visual odometry (VO) from the perspective of deep learning. After tremendous efforts in the robotics and computer vision communities over the past few decades, state-of-the-art VO algorithms have demonstrated incredible performance. However, since the VO problem is typically formulated as a pure geometric problem, one of the key features still missing from current VO systems is the capability to automatically gain knowledge and improve performance through learning. In this paper, we investigate whether deep neural networks can be effective and beneficial to the VO problem. An end-to-end, sequence-to-sequence probabilistic visual odometry (ESP-VO) framework is proposed for the monocular VO based on deep recurrent convolutional neural networks. It is trained and deployed in an end-to-end manner, that is, directly inferring poses and uncertainties from a sequence of raw images (video) without adopting any modules from the conventional VO pipeline. It can not only automatically learn effective feature representation encapsulating geometric information through convolutional neural networks, but also implicitly model sequential dynamics and relation for VO using deep recurrent neural networks. Uncertainty is also derived along with the VO estimation without introducing much extra computation. Extensive experiments on several datasets representing driving, flying and walking scenarios show competitive performance of the proposed ESP-VO to the state-of-the-art methods, demonstrating a promising potential of the deep learning technique for VO and verifying that it can be a viable complement to current VO systems.

Automatic hierarchization for the recognition of places in images

Chen Fan, Zetao Chen, Adam Jacobson, Xiaoping Hu, Michael Milford, Biologically-inspired visual place recognition with adaptive multiple scales,Robotics and Autonomous Systems, Volume 96, 2017, Pages 224-237, DOI: 10.1016/j.robot.2017.07.015.

In this paper we present a novel adaptive multi-scale system for performing visual place recognition. Unlike recent previous multi-scale place recognition systems that use manually pre-fixed scales, we present a system that adaptively selects the spatial scales. This approach differs from previous multi-scale methods, where place recognition is performed through a non-optimized distance metric in a fixed and pre-determined scale space. Instead, we learn an optimized distance metric which creates a new recognition space for clustering images with similar features while separating those with different features. Consequently, the method exploits the natural spatial scales present in the operating environment. With these adaptive scales, a hierarchical recognition mechanism with multiple parallel channels is then proposed. Each channel performs place recognition from a coarse match to a fine match. We present specific techniques for training each channel to recognize places at varying spatial scales and for combining the place recognition hypotheses from these parallel channels. We also conduct a systematic series of experiments and parameter studies that determine the effect on performance of using different numbers of combined recognition channels. The results demonstrate that the adaptive multi-scale approach outperforms the previous fixed multi-scale approach and is capable of producing better than state of the art performance compared to existing robotic navigation algorithms. The system complexity is linear in the number of places in the reference static map and can realize the online place recognition in mobile robotics on typical dataset sizes We analyze the results and provide theoretical analysis of the performance improvements. Finally, we discuss interesting insights gained with respect to future work in robotics and neuroscience in this area.

An open-source implementation of visual SLAM with a very nice related-work section

R. Mur-Artal and J. D. Tardós, ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras, IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255-1262, DOI: 10.1109/TRO.2017.2705103.

We present ORB-SLAM2, a complete simultaneous localization and mapping (SLAM) system for monocular, stereo and RGB-D cameras, including map reuse, loop closing, and relocalization capabilities. The system works in real time on standard central processing units in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city. Our back-end, based on bundle adjustment with monocular and stereo observations, allows for accurate trajectory estimation with metric scale. Our system includes a lightweight localization mode that leverages visual odometry tracks for unmapped regions and matches with map points that allow for zero-drift localization. The evaluation on 29 popular public sequences shows that our method achieves state-of-the-art accuracy, being in most cases the most accurate SLAM solution. We publish the source code, not only for the benefit of the SLAM community, but with the aim of being an out-of-the-box SLAM solution for researchers in other fields.