The Annals of Statistics

Rates of convergence in active learning

Steve Hanneke

Full-text: Open access

Abstract

We study the rates of convergence in generalization error achievable by active learning under various types of label noise. Additionally, we study the general problem of model selection for active learning with a nested hierarchy of hypothesis classes and propose an algorithm whose error rate provably converges to the best achievable error among classifiers in the hierarchy at a rate adaptive to both the complexity of the optimal classifier and the noise conditions. In particular, we state sufficient conditions for these rates to be dramatically faster than those achievable by passive learning.

Article information

Source
Ann. Statist. Volume 39, Number 1 (2011), 333-361.

Dates
First available in Project Euclid: 3 December 2010

Permanent link to this document
http://projecteuclid.org/euclid.aos/1291388378

Digital Object Identifier
doi:10.1214/10-AOS843

Mathematical Reviews number (MathSciNet)
MR2797849

Zentralblatt MATH identifier
1274.62510

Subjects
Primary: 62L05: Sequential design 68Q32: Computational learning theory [See also 68T05] 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20] 68T05: Learning and adaptive systems [See also 68Q32, 91E40]
Secondary: 68T10: Pattern recognition, speech recognition {For cluster analysis, see 62H30} 68Q10: Modes of computation (nondeterministic, parallel, interactive, probabilistic, etc.) [See also 68Q85] 68Q25: Analysis of algorithms and problem complexity [See also 68W40] 68W40: Analysis of algorithms [See also 68Q25] 62G99: None of the above, but in this section

Keywords
Active learning sequential design selective sampling statistical learning theory oracle inequalities model selection classification

Citation

Hanneke, Steve. Rates of convergence in active learning. Ann. Statist. 39 (2011), no. 1, 333--361. doi:10.1214/10-AOS843. http://projecteuclid.org/euclid.aos/1291388378.


Export citation

References

  • [1] Alexander, K. S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12 1041–1067.
  • [2] Alexander, K. S. (1985). Rates of growth for weighted empirical processes. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer II 475–493. Wadsworth, Belmont, CA.
  • [3] Alexander, K. S. (1986). Sample moduli for set-indexed gaussian processes. Ann. Probab. 14 598–611.
  • [4] Alexander, K. S. (1987). Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Related Fields 75 379–423.
  • [5] Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge.
  • [6] Balcan, M.-F., Beygelzimer, A. and Langford, J. (2006). Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning. ACM, New York.
  • [7] Balcan, M.-F., Broder, A. and Zhang, T. (2007). Margin based active learning. In Proceedings of the 20th Conference on Learning Theory. Lecture Notes in Computer Science 4539 35–50. Springer, Berlin.
  • [8] Balcan, M.-F., Hanneke, S. and Wortman, J. (2008). The true sample complexity of active learning. In Proceedings of the 21st Conference on Learning Theory. Omnipress, Madison, WI.
  • [9] Balcan, M.-F., Beygelzimer, A. and Langford, J. (2009). Agnostic active learning. J. Comput. System Sci. 75 78–89.
  • [10] Beygelzimer, A., Dasgupta, S. and Langford, J. (2009). Importance weighted active learning. In International Conference on Machine Learning. ACM, New York.
  • [11] Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach. 36 929–965.
  • [12] Castro, R. and Nowak, R. (2008). Minimax bounds for active learning. IEEE Trans. Inform. Theory 54 2339–2353.
  • [13] Cohn, D., Atlas, L. and Ladner, R. (1994). Improving generalization with active learning. Machine Learning 15 201–221.
  • [14] Dasgupta, S., Hsu, D. and Monteleoni, C. (2007). A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
  • [15] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • [16] Friedman, E. (2009). Active learning for smooth problems. In Proceedings of the 22nd Conference on Learning Theory. Montreal, Qebec.
  • [17] Giné, E. and Koltchinskii, V. (2006). Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34 1143–1216.
  • [18] Giné, E., Koltchinskii, V. and Wellner, J. (2003). Ratio limit theorems for empirical processes. In Stochastic Inequalities (E. Giné, C. Houdré and D. Nualrt, eds.) 249–278. Birkhäuser, Basel.
  • [19] Hanneke, S. (2007). A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning (Z. Ghahramani, ed.) 353–360. ACM, New York.
  • [20] Hanneke, S. (2010). Proofs and supplements to “Rates of convergence in active learning.” DOI: 10.1214/10-AOS843SUPP.
  • [21] Kääriäinen, M. (2006). Active learning in the non-realizable case. In Proceedings of the 17th International Conference on Algorithmic Learning Theory. Springer, Berlin.
  • [22] Kearns, M. J., Schapire, R. E. and Sellie, L. M. (1994). Toward efficient agnostic learning. Mach. Learn. 17 115–141.
  • [23] Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [24] Koltchinskii, V. (2008). Oracle inequalities in empirical risk minimization and sparse recovery problems: Lecture notes. Technical report, Ecole d’ete de Probabilités de Saint-Flour.
  • [25] Li, Y. and Long, P. M. (2007). Learnability and the doubling dimension. In Advances in Neural Information Processing. MIT Press, Cambridge, MA.
  • [26] Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
  • [27] Massart, P. and Nédélec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
  • [28] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • [29] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • [30] Vapnik, V. (1982). Estimation of Dependencies Based on Empirical Data. Springer, New York. Translated from the Russian by Samuel Kotz.
  • [31] Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
  • [32] Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–280.
  • [33] Wang, L. (2009). Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems 22. MIT Press, Cambridge, MA.

Supplemental materials

  • Supplementary material: Proofs and Supplements for “Rates of Convergence in Active Learning”. The supplementary material contains three additional Appendices, namely, Appendices B, C and D. Specifically, Appendix B provides detailed proofs of Theorems 5–9, as well as several abstract lemmas from which these results are derived. Appendix C discusses the use of estimators in Algorithm 1. Finally, Appendix D includes a proof of a general minimax lower bound ∝ n^(−κ ∕ (2κ − 2)) for any nontrivial hypothesis class, generalizing a result of Castro and Nowak [12].