Document Actions

Analyzing bottom-up saliency in natural movies

by Eleonora Vig — last modified 2010-09-03 16:49

Analyzing bottom-up saliency in natural movies (Presented at the European Conference on Visual Perception 2010)

Eleonora Vig, Michael Dorr, and Erhardt Barth

We investigate the contribution of local spatio-temporal variations of image intensity to saliency. To measure different types of variations, we use invariants of the structure tensor. Considering a video to be represented in spatial axes (x,y), and temporal axis t, the n-dimensional structure tensor (nD-ST) can be evaluated for different combinations of axes (2D- and 3D-ST) and also for the (degenerate) case of only one axis (1D-ST).

Eye movements recorded on 18 natural videos are used to label locations as fixated or non-fixated. For each location, we compute the invariants (products of eigenvalues) of the nD-ST and use these to predict eye movements on unseen videos with an SVM classifier. We show that the 3D-ST is optimal (average ROC score of 0.656), which means that the most predictive regions of a movie are those where intensity varies along all spatial and temporal directions. Analyzing 2-dimensional variations, the 2D-ST evaluated on the axes (y,t) gave the best score (0.638), followed by (x,y) (0.626), and (x,t) (0.625). The 1D-ST yielded 0.606 along the temporal, 0.604 for horizontal, and 0.602 for vertical axis. We conclude that bottom-up saliency is determined by spatio-temporal variations of image intensity rather than spatial or temporal variations.

Poster in pdf format.

Sections

Personal tools

Document Actions

Analyzing bottom-up saliency in natural movies