Maximum likelihood features for generative image models

Lo-Bin Chang; Eran Borenstein; Wei Zhang; Stuart Geman

doi:10.1214/17-AOAS1025

September 2017 Maximum likelihood features for generative image models

Lo-Bin Chang, Eran Borenstein, Wei Zhang, Stuart Geman

Ann. Appl. Stat. 11(3): 1275-1308 (September 2017). DOI: 10.1214/17-AOAS1025

Abstract

Most approaches to computer vision can be thought of as lying somewhere on a continuum between generative and discriminative. Although each approach has had its successes, recent advances have favored discriminative methods, most notably the convolutional neural network. Still, there is some doubt about whether this approach will scale to a human-level performance given the numbers of samples that are needed to train state-of-the-art systems. Here, we focus on the generative or Bayesian approach, which is more model based and, in theory, more efficient. Challenges include latent-variable modeling, computationally efficient inference, and data modeling. We restrict ourselves to the problem of data modeling, which is possibly the most daunting, and specifically to the generative modeling of image patches. We formulate a new approach, which can be broadly characterized as an application of “conditional modeling,” designed to sidestep the high-dimensionality and complexity of image data. A series of experiments, learning appearance models for faces and parts of faces, illustrates the flexibility and effectiveness of the approach.

References

1.

Agarwal, S., Awan, A. and Roth, D. (2004). Learning to detect objects in images via a sparse part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26 1475–1490.Agarwal, S., Awan, A. and Roth, D. (2004). Learning to detect objects in images via a sparse part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26 1475–1490.

2.

Aharon, M., Elad, M. and Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Signal Process. 54 4311–4322.Aharon, M., Elad, M. and Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Signal Process. 54 4311–4322.

3.

Allassonnière, S., Amit, Y. and Trouvé, A. (2007). Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 69 3–29.Allassonnière, S., Amit, Y. and Trouvé, A. (2007). Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 69 3–29.

4.

Amit, Y., Geman, D. and Fan, X. (2004). A coarse-to-fine strategy for multiclass shape detection. IEEE Trans. Pattern Anal. Mach. Intell. 26 1606–1621.Amit, Y., Geman, D. and Fan, X. (2004). A coarse-to-fine strategy for multiclass shape detection. IEEE Trans. Pattern Anal. Mach. Intell. 26 1606–1621.

5.

Amit, Y. and Trouvé, A. (2006). Generative Models for Labeling Multi-Object Configurations in Images 362–381. Springer Berlin, Heidelberg.Amit, Y. and Trouvé, A. (2006). Generative Models for Labeling Multi-Object Configurations in Images 362–381. Springer Berlin, Heidelberg.

6.

Amit, Y. and Trouvé, A. (2007). POP: Patchwork of parts models for object recognition. Int. J. Comput. Vis. 75 267–282.Amit, Y. and Trouvé, A. (2007). POP: Patchwork of parts models for object recognition. Int. J. Comput. Vis. 75 267–282.

7.

Blanchard, G. and Geman, D. (2005). Hierarchical testing designs for pattern recognition. Ann. Statist. 33 1155–1202.Blanchard, G. and Geman, D. (2005). Hierarchical testing designs for pattern recognition. Ann. Statist. 33 1155–1202.

8.

Borenstein, E. and Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV. LNCS 2353 109–122.Borenstein, E. and Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV. LNCS 2353 109–122.

9.

Chung, K. L. (2001). A Course in Probability Theory, 3rd ed. Academic Press, Inc., San Diego, CA.Chung, K. L. (2001). A Course in Probability Theory, 3rd ed. Academic Press, Inc., San Diego, CA.

10.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.

11.

Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist. 12 793–815.Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist. 12 793–815.

12.

Dümbgen, L. and Del Conte-Zerial, P. (2013). On low-dimensional projections of high-dimensional distributions. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 91–104. IMS, Beachwood, OH.Dümbgen, L. and Del Conte-Zerial, P. (2013). On low-dimensional projections of high-dimensional distributions. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 91–104. IMS, Beachwood, OH.

13.

Feldman, T. and Younes, L. (2006). Homeostatic image perception: An artificial system. Comput. Vis. Image Underst. 102 70–80.Feldman, T. and Younes, L. (2006). Homeostatic image perception: An artificial system. Comput. Vis. Image Underst. 102 70–80.

14.

Felzenszwalb, P. (2013). A stochastic grammar for natural shapes. In Shape Perception in Human and Computer Vision (S. J. Dickinson and Z. Pizlo, eds.) 299–310. Springer, London.Felzenszwalb, P. (2013). A stochastic grammar for natural shapes. In Shape Perception in Human and Computer Vision (S. J. Dickinson and Z. Pizlo, eds.) 299–310. Springer, London.

15.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D. and Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 1627–1645.Felzenszwalb, P. F., Girshick, R. B., McAllester, D. and Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 1627–1645.

16.

Fergus, R., Perona, P. and Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. CVPR 2 264–271.Fergus, R., Perona, P. and Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. CVPR 2 264–271.

17.

Frey, B. J. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25 1–17.Frey, B. J. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25 1–17.

18.

Frey, B. J. and Jojic, N. (1999). Transformed component analysis: Joint estimation of spatial transformations and image components. In International Conference on Computer Vision 2 1190.Frey, B. J. and Jojic, N. (1999). Transformed component analysis: Joint estimation of spatial transformations and image components. In International Conference on Computer Vision 2 1190.

19.

Heisele, B., Serre, T. and Poggio, T. (2007). A component-based framework for face detection and identification. Int. J. Comput. Vis. 74 167–181.Heisele, B., Serre, T. and Poggio, T. (2007). A component-based framework for face detection and identification. Int. J. Comput. Vis. 74 167–181.

20.

Heisele, B., Serre, T., Pontil, M., Vetter, T. and Poggio, T. (2001). Categorization by learning and combining object parts. In NIPS.Heisele, B., Serre, T., Pontil, M., Vetter, T. and Poggio, T. (2001). Categorization by learning and combining object parts. In NIPS.

21.

Hinton, G. E. (1999). Products of experts. In Int. Conf. on Art. Neur. Netw. (ICANN) 1 1–6.Hinton, G. E. (1999). Products of experts. In Int. Conf. on Art. Neur. Netw. (ICANN) 1 1–6.

22.

Jin, Y. and Geman, S. (2006). Context and hierarchy in a probabilistic image model. In CVPR 2145–2152.Jin, Y. and Geman, S. (2006). Context and hierarchy in a probabilistic image model. In CVPR 2145–2152.

23.

Kannan, A., Jojic, N. and Frey, B. (2002). Fast transformation invariant factor analysis. In Advances in Neural Information Processing Systems 15.Kannan, A., Jojic, N. and Frey, B. (2002). Fast transformation invariant factor analysis. In Advances in Neural Information Processing Systems 15.

24.

Lee, H., Battle, A., Raina, R. and Ng, A. Y. (2007). Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst. 19 801–808.Lee, H., Battle, A., Raina, R. and Ng, A. Y. (2007). Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst. 19 801–808.

25.

Leibe, B. and Schiele, B. (2003). Interleaved object categorization and segmentation. In Proceedings of British Machine Vision Conference (BMVC).Leibe, B. and Schiele, B. (2003). Interleaved object categorization and segmentation. In Proceedings of British Machine Vision Conference (BMVC).

26.

Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th International Conference on Machine Learning.Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th International Conference on Machine Learning.

27.

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res. 37 3311–3325.Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res. 37 3311–3325.

28.

Ommer, B. and Buhmann, J. M. (2006). Learning compositional categorization models. In ECCV.Ommer, B. and Buhmann, J. M. (2006). Learning compositional categorization models. In ECCV.

29.

Papandreou, G., Chen, L.-C. and Yuille, A. (2014). Modeling image patches with a generic dictionary of mini-epitomes. In Proc. IEEE Int. Conf. on Comp. Vision and Pat. Rec. (CVPR).Papandreou, G., Chen, L.-C. and Yuille, A. (2014). Modeling image patches with a generic dictionary of mini-epitomes. In Proc. IEEE Int. Conf. on Comp. Vision and Pat. Rec. (CVPR).

30.

Rajagopalan, A. N., Chellappa, R. and Koterba, N. T. (2005). Background learning for robust face recognition with PCA in the presence of clutter. IEEE Trans. Image Process. 14 832–843.Rajagopalan, A. N., Chellappa, R. and Koterba, N. T. (2005). Background learning for robust face recognition with PCA in the presence of clutter. IEEE Trans. Image Process. 14 832–843.

31.

Reid, N. (1995). The roles of conditioning in inference. Statist. Sci. 10 138–157, 173–189, 193–196. With comments by V. P. Godambe, Bruce G. Lindsay and Bing Li, Peter McCullagh, George Casella, Thomas J. DiCiccio and Martin T. Wells, A. P. Dawid and C. Goutis and Thomas Severini, with a rejoinder by the author.Reid, N. (1995). The roles of conditioning in inference. Statist. Sci. 10 138–157, 173–189, 193–196. With comments by V. P. Godambe, Bruce G. Lindsay and Bing Li, Peter McCullagh, George Casella, Thomas J. DiCiccio and Martin T. Wells, A. P. Dawid and C. Goutis and Thomas Severini, with a rejoinder by the author.

32.

Roth, S. and Black, M. J. (2009). Fields of experts. Int. J. Comput. Vis. 82 205–229.Roth, S. and Black, M. J. (2009). Fields of experts. Int. J. Comput. Vis. 82 205–229.

33.

Sabuncu, M. R., Balci, S. K. and Golland, P. (2008). Discovering modes of an image population through mixture modeling. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 5242 381–389.Sabuncu, M. R., Balci, S. K. and Golland, P. (2008). Discovering modes of an image population through mixture modeling. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 5242 381–389.

34.

Sali, E. and Ullman, S. (1999). Combining class-specific fragments for object classification. In Proc. 10th British Machine Vision Conference 1 203–213.Sali, E. and Ullman, S. (1999). Combining class-specific fragments for object classification. In Proc. 10th British Machine Vision Conference 1 203–213.

35.

Si, Z. and Zhu, S.-C. (2012). Learning hybrid image templates (HIT) by information projection. IEEE Trans. Pattern Anal. Mach. Intell. 34 1354–1367.Si, Z. and Zhu, S.-C. (2012). Learning hybrid image templates (HIT) by information projection. IEEE Trans. Pattern Anal. Mach. Intell. 34 1354–1367.

36.

Ullman, S., Sali, E. and Vidal-Niquet, M. (2001). A fragment-based approach to object representation and classification. In International Workshop on Visual Form 85–100.Ullman, S., Sali, E. and Vidal-Niquet, M. (2001). A fragment-based approach to object representation and classification. In International Workshop on Visual Form 85–100.

37.

Ullman, S., Vidal-Naquet, M. and Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nat. Neurosci. 5 682–687.Ullman, S., Vidal-Naquet, M. and Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nat. Neurosci. 5 682–687.

38.

Weber, M., Welling, M. and Perona, P. (2000). Unsupervised learning of models for recognition. In Proc. Sixth European Conf. Computer Vision 18–22.Weber, M., Welling, M. and Perona, P. (2000). Unsupervised learning of models for recognition. In Proc. Sixth European Conf. Computer Vision 18–22.

39.

Welling, M., Hinton, G. E. and Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In Adv. in Neur. Inf. Proc. Sys. (NIPS) 15 1359–1366.Welling, M., Hinton, G. E. and Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In Adv. in Neur. Inf. Proc. Sys. (NIPS) 15 1359–1366.

40.

Yuille, A. (2011). Towards a theory of compositional learning and encoding of objects. In Computational Methods for the Innovative Design of Electrical Devices’11 1448–1455.Yuille, A. (2011). Towards a theory of compositional learning and encoding of objects. In Computational Methods for the Innovative Design of Electrical Devices’11 1448–1455.

41.

Zhu, L., Chen, Y. and Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Trans. Pattern Anal. Mach. Intell. 31 114–128.Zhu, L., Chen, Y. and Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Trans. Pattern Anal. Mach. Intell. 31 114–128.

42.

Zhu, S.-C. and Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 19 1236–1250.Zhu, S.-C. and Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 19 1236–1250.

43.

Zhu, S.-C. and Mumford, D. (2006). A stochastic grammar of images. In Foundations and Trends in Computer Graphics and Vision 259–362.Zhu, S.-C. and Mumford, D. (2006). A stochastic grammar of images. In Foundations and Trends in Computer Graphics and Vision 259–362.

44.

Zhu, S.-C., Wu, Y. and Mumford, D. (1998). Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int. J. Comput. Vis. 27.Zhu, S.-C., Wu, Y. and Mumford, D. (1998). Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int. J. Comput. Vis. 27.

Citation Download Citation

Lo-Bin Chang, Eran Borenstein, Wei Zhang, and Stuart Geman "Maximum likelihood features for generative image models," The Annals of Applied Statistics 11(3), 1275-1308, (September 2017). https://doi.org/10.1214/17-AOAS1025

Received: 1 July 2016; Published: September 2017

Access the abstract

JOURNAL ARTICLE
34 PAGES

DOWNLOAD PDF + SAVE TO MY LIBRARY