The Annals of Applied Statistics

Maximum likelihood features for generative image models

Lo-Bin Chang, Eran Borenstein, Wei Zhang, and Stuart Geman

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Most approaches to computer vision can be thought of as lying somewhere on a continuum between generative and discriminative. Although each approach has had its successes, recent advances have favored discriminative methods, most notably the convolutional neural network. Still, there is some doubt about whether this approach will scale to a human-level performance given the numbers of samples that are needed to train state-of-the-art systems. Here, we focus on the generative or Bayesian approach, which is more model based and, in theory, more efficient. Challenges include latent-variable modeling, computationally efficient inference, and data modeling. We restrict ourselves to the problem of data modeling, which is possibly the most daunting, and specifically to the generative modeling of image patches. We formulate a new approach, which can be broadly characterized as an application of “conditional modeling,” designed to sidestep the high-dimensionality and complexity of image data. A series of experiments, learning appearance models for faces and parts of faces, illustrates the flexibility and effectiveness of the approach.

Article information

Ann. Appl. Stat. Volume 11, Number 3 (2017), 1275-1308.

Received: July 2016
Revised: January 2017
First available in Project Euclid: 5 October 2017

Permanent link to this document

Digital Object Identifier

Computer vision image models appearance models generative models conditional modeling sufficiency features


Chang, Lo-Bin; Borenstein, Eran; Zhang, Wei; Geman, Stuart. Maximum likelihood features for generative image models. Ann. Appl. Stat. 11 (2017), no. 3, 1275--1308. doi:10.1214/17-AOAS1025.

Export citation


  • Agarwal, S., Awan, A. and Roth, D. (2004). Learning to detect objects in images via a sparse part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26 1475–1490.
  • Aharon, M., Elad, M. and Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Signal Process. 54 4311–4322.
  • Allassonnière, S., Amit, Y. and Trouvé, A. (2007). Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 69 3–29.
  • Amit, Y., Geman, D. and Fan, X. (2004). A coarse-to-fine strategy for multiclass shape detection. IEEE Trans. Pattern Anal. Mach. Intell. 26 1606–1621.
  • Amit, Y. and Trouvé, A. (2006). Generative Models for Labeling Multi-Object Configurations in Images 362–381. Springer Berlin, Heidelberg.
  • Amit, Y. and Trouvé, A. (2007). POP: Patchwork of parts models for object recognition. Int. J. Comput. Vis. 75 267–282.
  • Blanchard, G. and Geman, D. (2005). Hierarchical testing designs for pattern recognition. Ann. Statist. 33 1155–1202.
  • Borenstein, E. and Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV. LNCS 2353 109–122.
  • Chung, K. L. (2001). A Course in Probability Theory, 3rd ed. Academic Press, Inc., San Diego, CA.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.
  • Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist. 12 793–815.
  • Dümbgen, L. and Del Conte-Zerial, P. (2013). On low-dimensional projections of high-dimensional distributions. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 91–104. IMS, Beachwood, OH.
  • Feldman, T. and Younes, L. (2006). Homeostatic image perception: An artificial system. Comput. Vis. Image Underst. 102 70–80.
  • Felzenszwalb, P. (2013). A stochastic grammar for natural shapes. In Shape Perception in Human and Computer Vision (S. J. Dickinson and Z. Pizlo, eds.) 299–310. Springer, London.
  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. and Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 1627–1645.
  • Fergus, R., Perona, P. and Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. CVPR 2 264–271.
  • Frey, B. J. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25 1–17.
  • Frey, B. J. and Jojic, N. (1999). Transformed component analysis: Joint estimation of spatial transformations and image components. In International Conference on Computer Vision 2 1190.
  • Heisele, B., Serre, T. and Poggio, T. (2007). A component-based framework for face detection and identification. Int. J. Comput. Vis. 74 167–181.
  • Heisele, B., Serre, T., Pontil, M., Vetter, T. and Poggio, T. (2001). Categorization by learning and combining object parts. In NIPS.
  • Hinton, G. E. (1999). Products of experts. In Int. Conf. on Art. Neur. Netw. (ICANN) 1 1–6.
  • Jin, Y. and Geman, S. (2006). Context and hierarchy in a probabilistic image model. In CVPR 2145–2152.
  • Kannan, A., Jojic, N. and Frey, B. (2002). Fast transformation invariant factor analysis. In Advances in Neural Information Processing Systems 15.
  • Lee, H., Battle, A., Raina, R. and Ng, A. Y. (2007). Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst. 19 801–808.
  • Leibe, B. and Schiele, B. (2003). Interleaved object categorization and segmentation. In Proceedings of British Machine Vision Conference (BMVC).
  • Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th International Conference on Machine Learning.
  • Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res. 37 3311–3325.
  • Ommer, B. and Buhmann, J. M. (2006). Learning compositional categorization models. In ECCV.
  • Papandreou, G., Chen, L.-C. and Yuille, A. (2014). Modeling image patches with a generic dictionary of mini-epitomes. In Proc. IEEE Int. Conf. on Comp. Vision and Pat. Rec. (CVPR).
  • Rajagopalan, A. N., Chellappa, R. and Koterba, N. T. (2005). Background learning for robust face recognition with PCA in the presence of clutter. IEEE Trans. Image Process. 14 832–843.
  • Reid, N. (1995). The roles of conditioning in inference. Statist. Sci. 10 138–157, 173–189, 193–196. With comments by V. P. Godambe, Bruce G. Lindsay and Bing Li, Peter McCullagh, George Casella, Thomas J. DiCiccio and Martin T. Wells, A. P. Dawid and C. Goutis and Thomas Severini, with a rejoinder by the author.
  • Roth, S. and Black, M. J. (2009). Fields of experts. Int. J. Comput. Vis. 82 205–229.
  • Sabuncu, M. R., Balci, S. K. and Golland, P. (2008). Discovering modes of an image population through mixture modeling. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 5242 381–389.
  • Sali, E. and Ullman, S. (1999). Combining class-specific fragments for object classification. In Proc. 10th British Machine Vision Conference 1 203–213.
  • Si, Z. and Zhu, S.-C. (2012). Learning hybrid image templates (HIT) by information projection. IEEE Trans. Pattern Anal. Mach. Intell. 34 1354–1367.
  • Ullman, S., Sali, E. and Vidal-Niquet, M. (2001). A fragment-based approach to object representation and classification. In International Workshop on Visual Form 85–100.
  • Ullman, S., Vidal-Naquet, M. and Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nat. Neurosci. 5 682–687.
  • Weber, M., Welling, M. and Perona, P. (2000). Unsupervised learning of models for recognition. In Proc. Sixth European Conf. Computer Vision 18–22.
  • Welling, M., Hinton, G. E. and Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In Adv. in Neur. Inf. Proc. Sys. (NIPS) 15 1359–1366.
  • Yuille, A. (2011). Towards a theory of compositional learning and encoding of objects. In Computational Methods for the Innovative Design of Electrical Devices’11 1448–1455.
  • Zhu, L., Chen, Y. and Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Trans. Pattern Anal. Mach. Intell. 31 114–128.
  • Zhu, S.-C. and Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 19 1236–1250.
  • Zhu, S.-C. and Mumford, D. (2006). A stochastic grammar of images. In Foundations and Trends in Computer Graphics and Vision 259–362.
  • Zhu, S.-C., Wu, Y. and Mumford, D. (1998). Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int. J. Comput. Vis. 27.