Category Archives: Computer Vision

Brief but nice related work about structured prediction (MRFs, CRFs, etc.)

Bratieres, S.; Quadrianto, N.; Ghahramani, Z., GPstruct: Bayesian Structured Prediction Using Gaussian Processes, Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.37, no.7, pp.1514,1520, July 1 2015, DOI: 10.1109/TPAMI.2014.2366151.

We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M ^3 N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct.

Example of both bottom-up and top-down processes that are integrated in a solution for the recognition of shapes

Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos, A Gestaltist approach to contour-based object recognition: Combining bottom-up and top-down cues, The International Journal of Robotics Research April 2015 34: 627-652, first published on March 25, 2015, DOI: 10.1177/0278364914558493.

This paper proposes a method for detecting generic classes of objects from their representative contours that can be used by a robot with vision to find objects in cluttered environments. The approach uses a mid-level image operator to group edges into contours which likely correspond to object boundaries. This mid-level operator is used in two ways, bottom-up on simple edges and top-down incorporating object shape information, thus acting as the intermediary between low-level and high-level information. First, the mid-level operator, called the image torque, is applied to simple edges to extract likely fixation locations of objects. Using the operator’s output, a novel contour-based descriptor is created that extends the shape context descriptor to include boundary ownership information and accounts for rotation. This descriptor is then used in a multi-scale matching approach to modulate the torque operator towards the target, so it indicates its location and size. Unlike other approaches that use edges directly to guide the independent edge grouping and matching processes for recognition, both of these steps are effectively combined using the proposed method. We evaluate the performance of our approach using four diverse datasets containing a variety of object categories in clutter, occlusion and viewpoint changes. Compared with current state-of-the-art approaches, our approach is able to detect the target with fewer false alarms in most object categories. The performance is further improved when we exploit depth information available from the Kinect RGB-Depth sensor by imposing depth consistency when applying the image torque.

Reinforcement learning used for an adaptive attention mechanism, and integrated in an architecture with both top-down and bottom-up vision processing

Ognibene, D.; Baldassare, G., Ecological Active Vision: Four Bioinspired Principles to Integrate Bottom–Up and Adaptive Top–Down Attention Tested With a Simple Camera-Arm Robot, Autonomous Mental Development, IEEE Transactions on , vol.7, no.1, pp.3,25, March 2015. DOI: 10.1109/TAMD.2014.2341351.

Vision gives primates a wealth of information useful to manipulate the environment, but at the same time it can easily overwhelm their computational resources. Active vision is a key solution found by nature to solve this problem: a limited fovea actively displaced in space to collect only relevant information. Here we highlight that in ecological conditions this solution encounters four problems: 1) the agent needs to learn where to look based on its goals; 2) manipulation causes learning feedback in areas of space possibly outside the attention focus; 3) good visual actions are needed to guide manipulation actions, but only these can generate learning feedback; and 4) a limited fovea causes aliasing problems. We then propose a computational architecture (“BITPIC”) to overcome the four problems, integrating four bioinspired key ingredients: 1) reinforcement-learning fovea-based top-down attention; 2) a strong vision-manipulation coupling; 3) bottom-up periphery-based attention; and 4) a novel action-oriented memory. The system is tested with a simple simulated camera-arm robot solving a class of search-and-reach tasks involving color-blob “objects.” The results show that the architecture solves the problems, and hence the tasks, very efficiently, and highlight how the architecture principles can contribute to a full exploitation of the advantages of active vision in ecological conditions.