The Annals of Applied Statistics

Clustering correlated, sparse data streams to estimate a localized housing price index

You Ren, Emily B. Fox, and Andrew Bruce

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Understanding how housing values evolve over time is important to policy makers, consumers and real estate professionals. Existing methods for constructing housing indices are computed at a coarse spatial granularity, such as metropolitan regions, which can mask or distort price dynamics apparent in local markets, such as neighborhoods and census tracts. A challenge in moving to estimates at, for example, the census tract level is the scarcity of spatiotemporally localized house sales observations. Our work aims to address this challenge by leveraging observations from multiple census tracts discovered to have correlated valuation dynamics. Our proposed Bayesian nonparametric approach builds on the framework of latent factor models to enable a flexible, data-driven method for inferring the clustering of correlated census tracts. We explore methods for scalability and parallelizability of computations, yielding a housing valuation index at the level of census tract rather than zip code, and on a monthly basis rather than quarterly. Our analysis is provided on a large Seattle metropolitan housing dataset.

Article information

Ann. Appl. Stat., Volume 11, Number 2 (2017), 808-839.

Received: April 2015
Revised: December 2016
First available in Project Euclid: 20 July 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Bayesian nonparametrics clustering housing price index multiple time series state space models


Ren, You; Fox, Emily B.; Bruce, Andrew. Clustering correlated, sparse data streams to estimate a localized housing price index. Ann. Appl. Stat. 11 (2017), no. 2, 808--839. doi:10.1214/17-AOAS1019.

Export citation


  • Bailey, M. J., Muth, R. F. and Nourse, H. O. (1963). A Regression Method for Real Estate Price Index Construction. J. Amer. Statist. Assoc. 58 933–942.
  • Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353–355.
  • Brunauer, W., Lang, S. and Umlauf, N. (2013). Modelling house prices using multilevel structured additive regression. Stat. Model. 13 95–123.
  • Case, B. and Quigley, J. M. (1991). The dynamics of real estate prices. Rev. Econ. Stat. 73 50–58.
  • Case, K. E. and Shiller, R. J. (1987). Prices of single family homes since 1970: New indexes for four cities. N. Engl. Econ. Rev. 45–56.
  • Case, K. E. and Shiller, R. J. (1989). The efficiency of the market for single-family homes. Amer. Econ. Rev. 79 125–137.
  • Cleveland, R. B., Cleveland, W. S., McRae, J. E. and Terpenning, I. (1990). STL: A seasonal-trend decomposition procedure based on loess (with discussion). J. Off. Stat. 6 3–73.
  • Englund, P., Quigley, J. M. and Redfearn, C. L. (1999). The choice of methodology for computing housing price indexes: Comparisons of temporal aggregation and sample definition. J. Real Estate Finance Econ. 19 91–112.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Gatzlaff, D. H. and Haurin, D. R. (1997). Sample selection bias and repeat-sales index estimates. J. Real Estate Finance Econ. 14 33–50.
  • Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457–472.
  • Goetzmann, W. N. and Peng, L. (2002). The bias of the RSR estimator and the accuracy of some alternatives. Real Estate Econ. 30 13–39.
  • Iacoviello, M. (2011). Housing wealth and consumption. Board of Governors of the Federal Reserve System, International Finance Discussion Papers 1027.
  • Liao, T. W. (2005). Clustering of time series data—A survey. Pattern Recognit. 38 1857–1874.
  • MacLaurin, D. and Adams, R. (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In Uncertainty in Artificial Intelligence.
  • Meese, R. A. and Wallace, N. E. (1997). The construction of residential housing price indices: A comparison of repeat-sales, hedonic-regression, and hybrid approaches. J. Real Estate Finance Econ. 14 51–73.
  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. J. Soc. Indust. Appl. Math. 5 32–38.
  • Nagaraja, C. H., Brown, L. D. and Zhao, L. H. (2011). An autoregressive approach to house price modeling. Ann. Appl. Stat. 5 124–149.
  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • Nieto-Barajas, L. E. and Contreras-Cristán, A. (2014). A Bayesian nonparametric approach for time series clustering. Bayesian Anal. 9 147–169.
  • Palla, K., Ghahramani, Z. and Knowles, D. (2012). A nonparametric variable clustering model. Adv. Neural Inf. Process. Syst. 25 2987–2995.
  • Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Math. 1875. Springer, Berlin.
  • Ren, Y., Fox, E. B. and Bruce, A. (2017). Supplement to “Clustering correlated, sparse data streams to estimate a localized housing price index.” DOI:10.1214/17-AOAS1019SUPP.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Shiller, R. (1991). Arithmetic repeat sales price estimators. J. Housing Econ. 1 110–126.
  • Smith, P. L. (1979). Splines as a useful and convenient statistical tool. Amer. Statist. 33 57–62.
  • Williamson, S., Dubey, A. and Xing, E. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In International Conference on Machine Learning 98–106.
  • Zillow (2014). Zillow home value index: Methodology.

Supplemental materials

  • Supplement to “Clustering correlated, sparse data streams to estimate a localized housing price index”. We detail aspects of our MCMC sampler, including: (i) the required likelihood calculation via Kalman filtering variants and (ii) a parallel implementation of sampling the cluster memberships. We also include further synthetic data experiments and results from our Seattle City analysis, and specify the various settings used in our experiments. Finally, we provide additional details on our model selection, specification, and computations for the joint global trend analysis. A link to our code base and related housing data sources is included.