Most approaches to computer vision can be thought of as lying somewhere on a continuum between generative and discriminative. Although each approach has had its successes, recent advances have favored discriminative methods, most notably the convolutional neural network. Still, there is some doubt about whether this approach will scale to a human-level performance given the numbers of samples that are needed to train state-of-the-art systems. Here, we focus on the generative or Bayesian approach, which is more model based and, in theory, more efficient. Challenges include latent-variable modeling, computationally efficient inference, and data modeling. We restrict ourselves to the problem of data modeling, which is possibly the most daunting, and specifically to the generative modeling of image patches. We formulate a new approach, which can be broadly characterized as an application of “conditional modeling,” designed to sidestep the high-dimensionality and complexity of image data. A series of experiments, learning appearance models for faces and parts of faces, illustrates the flexibility and effectiveness of the approach.
Ann. Appl. Stat.
11(3):
1275-1308
(September 2017).
DOI: 10.1214/17-AOAS1025
Agarwal, S., Awan, A. and Roth, D. (2004). Learning to detect objects in images via a sparse part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26 1475–1490.Agarwal, S., Awan, A. and Roth, D. (2004). Learning to detect objects in images via a sparse part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26 1475–1490.
Aharon, M., Elad, M. and Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Signal Process. 54 4311–4322.Aharon, M., Elad, M. and Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Signal Process. 54 4311–4322.
Allassonnière, S., Amit, Y. and Trouvé, A. (2007). Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 69 3–29.Allassonnière, S., Amit, Y. and Trouvé, A. (2007). Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 69 3–29.
Amit, Y., Geman, D. and Fan, X. (2004). A coarse-to-fine strategy for multiclass shape detection. IEEE Trans. Pattern Anal. Mach. Intell. 26 1606–1621.Amit, Y., Geman, D. and Fan, X. (2004). A coarse-to-fine strategy for multiclass shape detection. IEEE Trans. Pattern Anal. Mach. Intell. 26 1606–1621.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.
Dümbgen, L. and Del Conte-Zerial, P. (2013). On low-dimensional projections of high-dimensional distributions. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 91–104. IMS, Beachwood, OH.Dümbgen, L. and Del Conte-Zerial, P. (2013). On low-dimensional projections of high-dimensional distributions. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 91–104. IMS, Beachwood, OH.
Felzenszwalb, P. (2013). A stochastic grammar for natural shapes. In Shape Perception in Human and Computer Vision (S. J. Dickinson and Z. Pizlo, eds.) 299–310. Springer, London.Felzenszwalb, P. (2013). A stochastic grammar for natural shapes. In Shape Perception in Human and Computer Vision (S. J. Dickinson and Z. Pizlo, eds.) 299–310. Springer, London.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. and Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 1627–1645.Felzenszwalb, P. F., Girshick, R. B., McAllester, D. and Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 1627–1645.
Frey, B. J. and Jojic, N. (1999). Transformed component analysis: Joint estimation of spatial transformations and image components. In International Conference on Computer Vision 2 1190.Frey, B. J. and Jojic, N. (1999). Transformed component analysis: Joint estimation of spatial transformations and image components. In International Conference on Computer Vision 2 1190.
Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th International Conference on Machine Learning.Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th International Conference on Machine Learning.
Papandreou, G., Chen, L.-C. and Yuille, A. (2014). Modeling image patches with a generic dictionary of mini-epitomes. In Proc. IEEE Int. Conf. on Comp. Vision and Pat. Rec. (CVPR).Papandreou, G., Chen, L.-C. and Yuille, A. (2014). Modeling image patches with a generic dictionary of mini-epitomes. In Proc. IEEE Int. Conf. on Comp. Vision and Pat. Rec. (CVPR).
Rajagopalan, A. N., Chellappa, R. and Koterba, N. T. (2005). Background learning for robust face recognition with PCA in the presence of clutter. IEEE Trans. Image Process. 14 832–843.Rajagopalan, A. N., Chellappa, R. and Koterba, N. T. (2005). Background learning for robust face recognition with PCA in the presence of clutter. IEEE Trans. Image Process. 14 832–843.
Reid, N. (1995). The roles of conditioning in inference. Statist. Sci. 10 138–157, 173–189, 193–196. With comments by V. P. Godambe, Bruce G. Lindsay and Bing Li, Peter McCullagh, George Casella, Thomas J. DiCiccio and Martin T. Wells, A. P. Dawid and C. Goutis and Thomas Severini, with a rejoinder by the author.Reid, N. (1995). The roles of conditioning in inference. Statist. Sci. 10 138–157, 173–189, 193–196. With comments by V. P. Godambe, Bruce G. Lindsay and Bing Li, Peter McCullagh, George Casella, Thomas J. DiCiccio and Martin T. Wells, A. P. Dawid and C. Goutis and Thomas Severini, with a rejoinder by the author.
Sabuncu, M. R., Balci, S. K. and Golland, P. (2008). Discovering modes of an image population through mixture modeling. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 5242 381–389.Sabuncu, M. R., Balci, S. K. and Golland, P. (2008). Discovering modes of an image population through mixture modeling. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 5242 381–389.
Sali, E. and Ullman, S. (1999). Combining class-specific fragments for object classification. In Proc. 10th British Machine Vision Conference 1 203–213.Sali, E. and Ullman, S. (1999). Combining class-specific fragments for object classification. In Proc. 10th British Machine Vision Conference 1 203–213.
Ullman, S., Sali, E. and Vidal-Niquet, M. (2001). A fragment-based approach to object representation and classification. In International Workshop on Visual Form 85–100.Ullman, S., Sali, E. and Vidal-Niquet, M. (2001). A fragment-based approach to object representation and classification. In International Workshop on Visual Form 85–100.
Ullman, S., Vidal-Naquet, M. and Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nat. Neurosci. 5 682–687.Ullman, S., Vidal-Naquet, M. and Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nat. Neurosci. 5 682–687.
Welling, M., Hinton, G. E. and Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In Adv. in Neur. Inf. Proc. Sys. (NIPS) 15 1359–1366.Welling, M., Hinton, G. E. and Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In Adv. in Neur. Inf. Proc. Sys. (NIPS) 15 1359–1366.
Yuille, A. (2011). Towards a theory of compositional learning and encoding of objects. In Computational Methods for the Innovative Design of Electrical Devices’11 1448–1455.Yuille, A. (2011). Towards a theory of compositional learning and encoding of objects. In Computational Methods for the Innovative Design of Electrical Devices’11 1448–1455.
Zhu, L., Chen, Y. and Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Trans. Pattern Anal. Mach. Intell. 31 114–128.Zhu, L., Chen, Y. and Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Trans. Pattern Anal. Mach. Intell. 31 114–128.
Zhu, S.-C., Wu, Y. and Mumford, D. (1998). Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int. J. Comput. Vis. 27.Zhu, S.-C., Wu, Y. and Mumford, D. (1998). Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int. J. Comput. Vis. 27.