Statistical Science

Learning Active Basis Models by EM-Type Algorithms

Zhangzhang Si, Haifeng Gong, Song-Chun Zhu, and Ying Nian Wu

Full-text: Open access


EM algorithm is a convenient tool for maximum likelihood model fitting when the data are incomplete or when there are latent variables or hidden states. In this review article we explain that EM algorithm is a natural computational scheme for learning image templates of object categories where the learning is not fully supervised. We represent an image template by an active basis model, which is a linear composition of a selected set of localized, elongated and oriented wavelet elements that are allowed to slightly perturb their locations and orientations to account for the deformations of object shapes. The model can be easily learned when the objects in the training images are of the same pose, and appear at the same location and scale. This is often called supervised learning. In the situation where the objects may appear at different unknown locations, orientations and scales in the training images, we have to incorporate the unknown locations, orientations and scales as latent variables into the image generation process, and learn the template by EM-type algorithms. The E-step imputes the unknown locations, orientations and scales based on the currently learned template. This step can be considered self-supervision, which involves using the current template to recognize the objects in the training images. The M-step then relearns the template based on the imputed locations, orientations and scales, and this is essentially the same as supervised learning. So the EM learning process iterates between recognition and supervised learning. We illustrate this scheme by several experiments.

Article information

Statist. Sci., Volume 25, Number 4 (2010), 458-475.

First available in Project Euclid: 14 March 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Generative models object recognition wavelet sparse coding


Si, Zhangzhang; Gong, Haifeng; Zhu, Song-Chun; Wu, Ying Nian. Learning Active Basis Models by EM-Type Algorithms. Statist. Sci. 25 (2010), no. 4, 458--475. doi:10.1214/09-STS281.

Export citation


  • [1] Amit, Y. and Trouve, A. (2007). POP: Patchwork of parts models for object recognition. Int. J. Comput. Vision 75 267–282.
  • [2] Bell, A. and Sejnowski, T. J. (1997). The ‘independent components’ of natural scenes are edge filters. Vision Research 37 3327–3338.
  • [3] Borenstein, E., Sharon, E. and Ullman, S. (2004). Combining top-down and bottom-up segmentation. In IEEE CVPR Workshop on Perceptual Organization in Computer Vision, Washington, DC 4 46.
  • [4] Candes, E. J. and Donoho, D. L. (1999). Curvelets—a surprisingly effective nonadaptive representation for objects with edges. In Curves and Surfaces (L. L. Schumaker et al., eds.) 105–120. Vanderbilt Univ. Press, Nashville, TN.
  • [5] Chen, S., Donoho, D. and Saunders, M. A. (1999). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
  • [6] Daugman, J. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Amer. 2 1160–1169.
  • [7] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • [8] Donoho, D. L., Vetterli, M., DeVore, R. A. and Daubechie, I. (1998). Data compression and harmonic analysis. IEEE Trans. Inform. Theory 6 2435–2476.
  • [9] Fergus, R., Perona, P. and Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of Computer Vision and Pattern Recognition, Madison, WI 2 264–271.
  • [10] Friedman, J. H. (1987). Exploratory projection pursuit. J. Amer. Statist. Assoc. 82 249–266.
  • [11] Geman, S., Potter, D. F. and Chi, Z. (2002). Composition systems. Quartarly Appl. Math. 60 707–736.
  • [12] Little, R. J. A. and Rubin, D. B. (1983). On jointly estimating parameters and missing data by maximizing the complete data likelihood. Amer. Statist. 37 218–220.
  • [13] Mallat, S. and Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Trans. Signal Process. 41 3397–3415.
  • [14] Meng, X.-L. and van Dyk, D. (1997). The EM algorithm—an old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B 59 511–567.
  • [15] Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 607–609.
  • [16] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257–286.
  • [17] Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience 2 1019–1025.
  • [18] Ross, D. A. and Zemel, R. S. (2006). Learning parts-based representations of data. J. Mach. Learn. Res. 7 2369–2397.
  • [19] Simoncelli, E. P., Freeman, W. T., Adelson, E. H. and Heeger, D. J. (1992). Shiftable multiscale transforms. IEEE Trans. Inform. Theory 38 587–607.
  • [20] Sudderth, E. B., Torralba, A., Freeman, W. T. and Willsky, A. S. (2005). Learning hierarchical models of scenes, objects, and parts. Proc. Int. Conf. Comput. Vision 2 1331–1338.
  • [21] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [22] Wu, Y. N., Si, Z., Fleming, C. and Zhu, S. C. (2007). Deformable template as active basis. Proc. Int. Conf. Comput. Vision, Rio de Janeiro, Brazil 1–8.
  • [23] Wu, Y. N., Si, Z., Gong, H. and Zhu, S. C. (2009). Learning active basis model for object detection and recognition. Int. J. Comput. Vision DOI:10.1007/s11263-009-0287-0.
  • [24] Yuille, A. L., Hallinan, P. W. and Cohen, D. S. (1992). Feature extraction from faces using deformable templates. Int. J. Comput. Vision 8 99–111.
  • [25] Zhu, S. C. and Mumford, D. B. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision 2 259–362.