Video Face Recognition

E.G. Ortiz, A. Wright, and M. Shah. "Face Recognition in Movie Trailers via Mean Sequence Spare Representation-based Classification". IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

Motivation

Movie interest is largely correlated to the actors in a movie making annotation of all occurrences of cast members within a movie essential. This work addresses the difficult problem of identifying a video face track with a dictionary of still face images of many people, while rejecting unknown individuals. We employ a large database of still images from the Internet to perform complete video face recognition from face tracking to face track identification.

Face Tracking

Our method performs the difficult task of face track- 281 ing based on face detections extracted using the high-performance SHORE face detection. We generate tracks using two metrics one spatial and the other appearance. The spatial metric computes the percent overlap of the current bounding box with the previous. The appearance metric computes a histogram intersection of the local bounding box, which can handle abrupt changes in the scene and the face. We compare each new face detection to existing tracks; if the location and appearance metric is similar, the face is added to the track, otherwise a new track is created. Finally, we use a global histogram for the entire frame, encoding scene information, to detect scene boundaries and impose a lifespan of 20 frames of no detection to detect the end of tracks.

Mean Sequence Sparse Representation-based Classification

In recent years, Sparse Representation-based Classification (SRC) has received much attention due to its high precision and ability to handle occlusions. More recently, we found that combined with several features SRC works well for real-world face recognition and excels at rejecting unknown identities  (see Face Recognition for Web-Scale Datasets). Now, given a face track  \( Y = [y_1,y_2,...,y_M]\) with \(M\) frames, we make the strong assumption that they will result in a single coefficient vector \(x\) based on the fact that all of the frames belong to the same person and should intuitively be linearly represented by the same people in the dictionary. Based on this assumption we produce the following formulation:

\( \hat{x}_{\ell_1} = \min_x \sum_{i=1}^M \| y_i - Ax \|_2 + \lambda \| x \|_1 \),

in which we minimize the sum residual error between every frame \(i\) and the linear combination \(Ax\) and maximizing the sparsity of \(x\). By analyzing the least-squares formulation of the residual error, we find the interesting result that it reduces to the mean face track vector as follows:

\(\hat{x}_{\ell_1} = \min_x \| \bar{y} - Ax \|_2 + \lambda \| x \|_1\),

where \( \bar{y} =  \sum_{i=1}^M y_i / M \). This formulation results in at least a 5x speedup depending on the average length of the input face tracks over a naive frame-by-frame application of SRC.

Movie Trailer Face Dataset

We built our Movie Trailer Face Dataset using 113 movie trailers from YouTube of the 2010 release year that con tained celebrities present in our supplemented PublicFig+10 dataset. These videos were then processed to generate face tracks using the method described above. The resulting dataset contains 3,585 face tracks, 63% consisting of unknown identities (not present in PubFig+10) and 37% 514 known.

Video Face Recognition Toolbox

For benchmarking of future methods with our or some other custom data, we provide a Video Face Recognition Toolbox. The tool contains implementations of the tested algorithms (NN, SVM, L2, SRC, and MSSRC). There are two principal scripts:

  • trailerExperiments: Entry script for the execution of methods.
  • trailerResults: Consolidates all results and outputs PR curves.

Video Presentation

http://www.youtube.com/watch?v=2GqUu6EViVE

Description