Electronic Journal of Statistics

Fast Bayesian hyperparameter optimization on large datasets

Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter

Full-text: Open access

Abstract

Bayesian optimization has become a successful tool for optimizing the hyperparameters of machine learning algorithms, such as support vector machines or deep neural networks. Despite its success, for large datasets, training and validating a single configuration often takes hours, days, or even weeks, which limits the achievable performance. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods or the recently proposed bandit strategy Hyperband.

Article information

Source
Electron. J. Statist. Volume 11, Number 2 (2017), 4945-4968.

Dates
Received: June 2017
First available in Project Euclid: 15 December 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1513306864

Digital Object Identifier
doi:10.1214/17-EJS1335SI

Rights
Creative Commons Attribution 4.0 International License.

Citation

Klein, Aaron; Falkner, Stefan; Bartels, Simon; Hennig, Philipp; Hutter, Frank. Fast Bayesian hyperparameter optimization on large datasets. Electron. J. Statist. 11 (2017), no. 2, 4945--4968. doi:10.1214/17-EJS1335SI. https://projecteuclid.org/euclid.ejs/1513306864


Export citation

References

  • A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast Bayesian optimization of machine learning hyperparameters on large datasets., 2017a.
  • G. Montavon, G. Orr, and K.-R. Müller, editors., Neural Networks: Tricks of the Trade - Second Edition. LNCS. Springer, 2012.
  • J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems (NIPS’11), pages 2546–2554, 2011.
  • F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In C. Coello, editor, Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION’11), volume 6683 of Lecture Notes in Computer Science, pages 507–523. Springer-Verlag, 2011.
  • J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization., Journal of Machine Learning Research, 13:281–305, 2012.
  • J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems (NIPS’12), pages 2960–2968, 2012.
  • R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter tuning. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML’13), pages 199–207. Omnipress, 2014.
  • J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML’13), pages 115–123. Omnipress, 2014.
  • K. Swersky, J. Snoek, and R. Adams. Multi-task Bayesian optimization. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Proceedings of the 27th /ginsInternational Conference on Advances in Neural Information Processing Systems (NIPS’13), pages 2004–2012, 2013.
  • K. Swersky, J. Snoek, and R. Adams. Freeze-thaw bayesian optimization. arXiv :1406.3896, 2014.
  • J. Snoek, K. Swersky, R. Zemel, and R. Adams. Input warping for Bayesian optimization of non-stationary functions. In E. Xing and T. Jebara, editors, Proceedings of the 31th International Conference on Machine Learning, (ICML’14), pages 1674–1682. Omnipress, 2014.
  • J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, Prabhat, and R. Adams. Scalable Bayesian optimization using deep neural networks. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML’15), volume 37, pages 2171–2180. Omnipress, 2015.
  • A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Q. Yang and M. Wooldridge, editors, Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’15), pages 3460–3468, 2015.
  • L. Bottou. Stochastic gradient tricks. In Grégoire Montavon, Genevieve B. Orr, and Klaus-Robert Müller, editors, Neural Networks, Tricks of the Trade, Reloaded. Springer, 2012.
  • K. Kandasamy, G. Dasarathy, J. Oliva, J. Schneider, and B. Póczos. Gaussian process optimisation with multi-fidelity evaluations. In, Proceedings of the 30th /International Conference on Advances in Neural Information Processing Systems (NIPS’30), 2016.
  • M. U. Gutmann and J. Corander. Bayesian optimization for likelihood-free inference of simulator-based statistical models., Journal of Machine Learning Research, 17(125):1–47, 2016.
  • T. Nickson, M. A Osborne, S. Reece, and S. Roberts. Automated machine learning on big data using stochastic algorithm tuning., CoRR, 2014.
  • T. Krueger, D. Panknin, and M. Braun. Fast cross-validation via sequential testing., JMLR, 2015.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In, Proceedings of the International Conference on Learning Representations (ICLR’17), 2017. Published online: iclr.cc.
  • P. Hennig and C. Schuler. Entropy search for information-efficient global optimization., Journal of Machine Learning Research, 98888(1):1809–1837, 2012.
  • E. Brochu, V. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv :1012.2599, 2010.
  • B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization., Proceedings of the IEEE, 104(1):148–175, 2016.
  • J. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust bayesian neural networks. In D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors, Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems (NIPS’16), 2016.
  • C. Rasmussen and C. Williams., Gaussian Processes for Machine Learning. The MIT Press, 2006.
  • B. Matérn. Spatial variation., Meddelanden fran Statens Skogsforskningsinstitut, 1960.
  • D. J. C. MacKay and R. M. Neal. Automatic relevance detection for neural networks. Technical report, University of Cambridge, 1994.
  • J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum., Towards Global Optimization, 2(117–129), 1978.
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In J. Fürnkranz and T. Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML’10), pages 1015–1022. Omnipress, 2010.
  • J. Hernández-Lobato, M. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Proceedings of the 28th International Conference on Advances in Neural Information Processing Systems (NIPS’14), 2014.
  • Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In, Proceedings of the 30th conference on Uncertainty in Artificial Intelligence (UAI’01). Morgan Kaufmann Publishers Inc., 2001.
  • J. Cunningham, P. Hennig, and S. Lacoste-Julien. Approximate gaussian integration using expectation propagation. pages 1–11, January, 2012.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Processing, pages 306–351. IEEE Press, 2001. URL http://www.iro.umontreal.ca/~lisa/pointeurs/lecun-01a.pdf.
  • A. Klein, S. Bartels, S. Falkner, P. Hennig, and F. Hutter. Towards efficient bayesian optimization for big data. In, NIPS 2015 Bayesian Optimization Workshop, December 2015.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Efficient hyperparameter optimization and infinitely many armed bandits., 2016.
  • K. Jamieson and A. Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
  • B. Williams, T. Santner, and W. Notz. Sequential design of computer experiments to minimize integrated response functions., Statistica Sinica, 2000.
  • D. Foreman-Mackey, D. W. Hogg, D. Lang, and J. Goodman. emcee: The MCMC Hammer., PASP, 2013.
  • D. R. Jones. A taxonomy of global optimization methods based on response surfaces., Journal of Global Optimization, 21:345–383, 2001.
  • Peter Sollich. Gaussian process regression with mismatched models. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press, 2002.
  • K. Eggensperger, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Efficient benchmarking of hyperparameter optimizers via surrogates. In B. Bonet and S. Koenig, editors, Proceedings of the Twenty-nineth National Conference on Artificial Intelligence (AAAI’15), pages 1114–1120. AAAI Press, 2015.
  • J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine learning., SIGKDD Explor. Newsl., 15(2):49–60, June 2014.
  • J. P. Siebert., Vehicle Recognition Using Rule Based Methods. Turing Institute, 1987.
  • J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables., Comput. Electron. Agric., 1999.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In, NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML’15), volume 37. Omnipress, 2015.
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv :1412.6980, 2014.
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition., CoRR, 2015.
  • R. Neal. Bayesian learning for neural networks., PhD thesis, University of Toronto, 1996.
  • J. Hernández-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML’15), volume 37. Omnipress, 2015.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML’15), volume 37, pages 1613–1622. Omnipress, 2015.
  • A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. Learning curve prediction with Bayesian neural networks. In, Proceedings of the International Conference on Learning Representations (ICLR’17), 2017b. Published online: iclr.cc.