Voting-Based Computational Framework for Motion Analysis

Overview

A traditional formulation of the visual motion analysis problem is the following: given two or more image frames, the goal is to determine three types of information - a dense velocity field, motion boundaries, and regions. From a computational point of view, one of the most powerful and most often used constraints is the smoothness of motion. Most approaches rely on parametric models that restrict the types of motion that can be analyzed, and also involve iterative methods which depend heavily on initial conditions and are subject to instability. Moreover, previous techniques usually encounter difficulties in image regions where motion is not smooth (i.e., around motion boundaries). This problem has lead to numerous inconsistent methods, with ad-hoc criteria introduced to account for motion discontinuities.

In order to address these difficulties, we developed a novel approach for motion analysis, by formulating it as a motion layers inference from a noisy and possibly sparse point set in a 4-D space. Our method is based on a layered 4-D representation of data and a voting scheme for token communication, within a tensor voting computational framework. From a possibly sparse input consisting of identical point tokens in two frames, the image position (x y) and potential velocity (v_x v_y) of each token are encoded into a 4-D tensor. Within this 4-D space, moving regions are conceptually represented as smooth surface layers, and are extracted through a voting process that enforces the smoothness constraint. By using an additional 2-D voting step that incorporates intensity information (edges) from the original images, we infer accurate boundaries and regions.

Our framework is able to consistently handle both smooth moving regions and motion discontinuities. It also benefits from the fact that it is non-iterative and it does not depend on critical thresholds, the only free parameter being the scale of analysis. Moreover, no assumption is made regarding the type of motion - the only criterion used is the smoothness of image motion.

Tensor Voting Framework

Tensor voting is a non-iterative methodology for the inference of statistically salient features from possibly sparse and noisy data. The input data is encoded as tensors, then support information (including proximity and smoothness of continuity) is propagated by voting. The only free parameter is the scale of analysis, which is an inherent property of visual perception.

For a more intuitive explanation, the framework is summarized here for the 2-D case, where the salient features to be extracted are points and curves. Each token is encoded as a second order symmetric 2-D tensor, geometrically equivalent to an ellipse. It is described by a 2x2 matrix, whose eigenvectors e₁ and e₂ give the ellipse orientation, and eigenvalues l₁ and l₂ give the ellipse size.

An input token that represents a curve element is encoded as an elementary stick tensor, where e₂ represents the curve tangent and e₁ the curve normal, while l₁=1 and l₂=0. An input point element is encoded as an elementary ball tensor, with no preferred orientation, and l₁=l₂=1.

The communication between tokens is performed through a voting process, where each token casts a vote at each site in its neighborhood. The size and shape of this neighborhood, and the vote strength and orientation are encapsulated in predefined voting fields, one for each feature type - there is a stick voting field and a ball voting field in the 2-D case. Vote orientation corresponds to the smoothest local curve continuation from voter to recipient, while vote strength decays exponentially with distance and curvature.


(a) Vote generation	(b) Stick field	(c) Ball field
Fig. 1. Voting in 2-D

Fig. 1(a) shows how votes are generated to build the 2-D stick field. A tensor P where curve information is locally known casts a vote at its neighbor Q. The vote orientation is chosen so that it ensures a smooth curve continuation through a circular arc from voter P to recipient Q. Fig. 1(b) shows the 2-D stick field, with its color-coded strength. When the voter is a ball tensor, with no information known locally, the vote is generated by rotating a stick vote in the 2-D plane and integrating all contributions. The 2-D ball field is shown in Fig. 1(c).

At each receiving site, the collected votes are combined through simple tensor addition, producing generic 2-D tensors. During voting, tokens that lie on a smooth curve reinforce each other, and the tensors deform according to the prevailing orientation. Each tensor encodes the local orientation of geometric features (given by the tensor orientation), and their saliency (given by the tensor shape and size). For a generic 2-D tensor, its curve saliency is given by (l₁-l₂), the curve normal orientation by e₁, while its point saliency is given by l₂. Therefore, the voting process infers curves and junctions simultaneously, while also identifying outlier noise (tokens that receive very little support).

The generality of the voting framework allows for easy extension to higher dimensions. The 3-D case is similar, where the geometric features are points, curves and surfaces, while in the 4-D case the features are points, curves, surfaces and volumes.

Approach

1) Generating candidate matches

We take as input two image frames that involve general motion, as shown in Fig. 2. For every pixel in the first image, the goal at this stage is to produce candidate matches in the second image. We use a normalized cross-correlation procedure, where all peaks of correlation are retained as candidates. Each candidate match is represented as a (x y v_x v_y) point in the 4-D space of image coordinates and pixel velocities, with respect to the first image. Since we want to increase the likelihood of including the correct match among the candidates, we repeat this process at multiple scales, by using different correlation window sizes.

The resulting candidates appear as a cloud of (x y v_x v_y) points in the 4-D space. Fig. 3 shows a 3-D view of the candidate matches - the 3 dimensions shown are x and y (in the horizontal plane), and v_x (the height). The motion layers can be already perceived as their tokens are grouped in smooth surfaces surrounded by noisy matches.

Fig. 2. An input image


Fig. 3. Matching candidates	Fig. 4. Dense layers

2) Extraction of motion layers

Within our 4-D representation, the smoothness constraint is embedded in the concept of surface saliency exhibited by the data. By letting the tokens communicate their mutual affinity through voting, noisy matches are eliminated as they receive little support, and distinct regions are extracted as smooth, salient surface layers.

Selection. Since no information is initially known, each potential match is encoded into a 4-D ball tensor. Then each token casts votes by using the corresponding ball voting field. During voting there is strong support between tokens that lie on a smooth surface (layer), while isolated tokens receive little or no support. For each pixel we retain the candidate match with the highest surface saliency, and we reject the others as outliers.

Orientation refinement. In order to obtain an estimation of the layer orientations as accurate as possible, we perform an orientation refinement through another voting process, but now with the selected matches only. After voting, the eigenvectors give the local layer orientations at each token. The remaining outliers are also rejected at this step, based on their low surface saliency.

Densification. Since the previous step created holes (i.e., pixels where no velocity is available), we infer this information from the neighbors by using a smoothness constraint. This is performed through an additional dense voting step, by generating discrete velocity candidates, collecting votes at each such location, and retaining the candidate with maximal surface saliency. By following this procedure at every image location we generate a dense velocity field. A 3-D view of the dense layers (the height represents v_x) is shown in Fig. 4.

3) Boundary inference

After grouping the tokens into regions, based on the smoothness of both velocities and layer orientations, it becomes apparent that the extracted layers may still be over or under-extended along the true object boundaries. Fig. 5 illustrates the recovered v_x velocities within layers (dark corresponds to low velocity), and Fig. 6 shows the layer boundaries superimposed on the first input image.

This situation typically occurs in areas subject to occlusion, where the initial correlation procedure may generate wrong matches that are consistent with the correct ones, and therefore could not be rejected as outlier noise. However, the key observation is that one should not only rely on motion cues in order to perform motion segmentation. Examining the original images reveals a multitude of monocular cues, such as intensity edges, that can aid in identifying the true object boundaries.

The boundaries of the extracted layers give us a good estimate for the position and overall orientation of the true boundaries. We combine this knowledge with monocular cues (intensity edges) from the original images in order to build a boundary saliency map within the uncertainty zone along the layers margins. The smoothness and continuity of the boundary is then enforced through a 2-D voting process, and the true boundary is extracted as the most salient curve within the saliency map. Finally, pixels from the uncertainty zone are reassigned to regions according to the new boundaries, and their velocities are recomputed. Fig. 7 shows the refined velocities within layers, and Fig. 8 shows the refined motion boundaries, that indeed correspond to the actual objects.


Fig. 5. Layer velocities	Fig. 6. Layer boundaries


Fig. 7. Refined velocities	Fig. 8. Refined boundaries

Results

1) From motion cues only

Our framework is also able to infer structure even in cases when only motion cues are available. Human vision can handle these cases remarkably well, and their study is fundamental for understanding the motion analysis process. Nevertheless they are very difficult from a computational perspective - most existing methods cannot handle such examples in a consistent and unified manner.

The input consists of two sets of 400 points each, representing an opaque rotating disk against a translating background - see Fig. 9(a). After processing, only 2 matches among 400 are wrongly estimated. This is a very difficult case even for human vision, due to the fact that around the left extremity of the disk the two motions are almost identical. The key fact is that we rely not only on the 4-D positions, but also on the local layer orientations that are still different and therefore provide a good affinity measure. Fig. 9(b) shows a 3-D view of the recovered dense set of tokens (the height represents v_x) and their associated layer normals.


(a) Input [AVI]	(b) Dense layers
Fig. 9. Rotating disk - translating background

2) Integrating motion and monocular cues

The example in Fig. 10 illustrates the performance of our approach for boundary inference in a cluttered environment, when texture edges strongly compete with the true object edges. Fig. 10(c) shows the boundary saliency map, with the local curve tangent shown at each tensor. Through voting, the saliency of the spurious texture edges has been diminished by the overall dominance of saliency and orientation of the correct object edges.


(a) An input image	(b) Dense layers

(c) Boundary saliency map	(d) Refined boundaries
Fig. 10. Candy box sequence

3) Handling transparent motion

Since our framework allows for overlapping motion layers, it can successfully handle images containing reflections and transparency. We first estimate the dominant motion a, using the same methodology as described above. A "nulling mechanism" is then used to remove motion a from the sequence, and the remaining motion b is estimated by using again our framework. Finally, we join the two sets of 4-D tokens (with motions a and b), and we continue with densification and segmentation on the joined set. Note that the entire procedure recovers the motions without separating the patterns. In order to show the accuracy of our results, we compute two "temporal average" images after registering the input frames using the two recovered motions - Fig. 11(b) and Fig.11(c). In each of these, the registered pattern is sharp, while the other one is blurred due to the image motion.


(a) An input image	(b) Registered background	(c) Registered foreground
Fig. 11. Transparent motion

Relevant Publications

Mircea Nicolescu, Gerard Medioni, "A Voting-Based Computational Framework for Visual Motion Analysis and Interpretation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pages 739-752, May 2005.
Mircea Nicolescu, Gerard Medioni, "Layered 4D Representation and Voting for Grouping from Motion", IEEE Transactions on Pattern Analysis and Machine Intelligence - Special Issue on Perceptual Organization in Computer Vision, vol. 25, no. 4, pages 492-501, April 2003.
Mircea Nicolescu, Gerard Medioni, "Motion Segmentation with Accurate Boundaries - A Tensor Voting Approach", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. I, pages 382-389, Madison, Wisconsin, June 2003.
Mircea Nicolescu, Gerard Medioni, "4-D Voting for Matching, Densification and Segmentation into Motion Layers", Proceedings of the International Conference on Pattern Recognition, vol. III, pages 303-308, Quebec City, Canada, August 2002. (Best Student Paper Award)
Mircea Nicolescu, Gerard Medioni, "Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D", Proceedings of the European Conference on Computer Vision, vol. III, pages 423-437, Copenhagen, Denmark, May 2002.

Support

This research has been funded in part by the Integrated Media Systems Center (IMSC), a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC-9529152, and by National Science Foundation Grant 9811883.

Created by: Mircea NICOLESCU (e-mail: mircea@cse.unr.edu)