Source: Ann. Appl. Stat.
Volume 6, Number 4
Deeds, or charters, dealing with property rights, provide a
continuous documentation which can be used by historians to
study the evolution of social, economic and political changes.
This study is concerned with charters (written in Latin) dating
from the tenth through early fourteenth centuries in England. Of
these, at least one million were left undated, largely due to
administrative changes introduced by William the Conqueror in
1066. Correctly dating such charters is of vital importance in
the study of English medieval history. This paper is concerned
with computer-automated statistical methods for dating such
document collections, with the goal of reducing the considerable
efforts required to date them manually and of improving the
accuracy of assigned dates. Proposed methods are based on such
data as the variation over time of word and phrase usage, and on
measures of distance between documents. The extensive (and
dated) Documents of Early England Data Set (DEEDS) maintained at
the University of Toronto was used for this purpose.
Berry, M. W. and Browne, M. (2005). Understanding Search Engines—Mathematical Modeling and Text Retrieval, 2nd ed. SIAM, Philadelphia.
Broder, A. Z. (1998). On the resemblance and containment of documents. In International Conference on Compression and Complexity of Sequences (SEQUENCES’97), June 11–13 1997, Positano, Italy 21–29. IEEE Comput. Soc., Los Alamitos, CA.
de Jong, F., Rode, H. and Hiemstra, D. (2005). Temporal language models for the disclosure of historical text. In Proc. 16th Int. Conf. of the Assoc. for History and Computing 161–168. KNAW, Amsterdam.
Djeraba, C. (2003). Multimedia Mining—A Highway to Intelligent Multimedia Documents. Kluwer, Boston.
Domingos, P. and Pazzani, M. (1996). Beyond independence: Conditions for optimality of the Bayes classifier. In Proceedings of the 13th International Conference on Machine Learning 105–112. Association for Computing Machinery, New York.
Fan, J. and Gijbels, I. (2000). Local polynomial fitting. In Smoothing and Regression: Approaches, Computation, and Application (M. G. Schimek, ed.) 229–276. Wiley, New York.
Feuerverger, A., He, Y. and Khatri, S. (2012). Statistical significance of the Netflix challenge. Statist. Sci. 27 202–231.
Feuerverger, A., Hall, P., Tilahun, G. and Gervers, M. (2005). Distance measures and smoothing methodology for imputing features of documents. J. Comput. Graph. Statist. 14 255–262.
Feuerverger, A., Hall, P., Tilahun, G. and Gervers, M. (2008). Using statistical smoothing to date medieval manuscripts. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen (N. Balakrishnan, E. Pena, M. J. Silvapulle, eds.). Inst. Math. Stat. Collect. 1 321–331. Inst. Math. Statist., Beachwood, OH.
Fiallos, R. (2000). An overview of the process of dating undated medieval charters: Latest results and future developments. In Dating Undated Medieval Charters (M. Gervers, ed.). Boydell Press, Woodbridge.
Gervers, M. (2000). Dating Undated Medieval Charters. Boydell Press, Woodbridge.
Gervers, M. and Hamonic, N. (2010). Pro amore dei: Diplomatic evidence of social conflict during the reign of King John. Preprint.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
Kanhabua, N. and Norvag, K. (2008). Improving Temporal Language Models for Determining Time of Non-Timestamped Documents. Lecture Notes in Computer Science 5173. Springer, Berlin.
Kanhabua, N. and Norvag, K. (2009). Using Temporal Language Models for Documents Dating. Lecture Notes in Computer Science 5782. Springer, Berlin.
Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38. Cambridge Univ. Press, Cambridge.
Loader, C. (1999). Local Regression and Likelihood. Springer, New York.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM J. Res. Develop. 2 159–165.
Mathematical Reviews (MathSciNet): MR90905
Manning, C., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge Univ. Press, New York.
McGill, M., Koll, M. and Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Technical Report. School of Information Studies, Syracuse Univ., Syracuse, NY.
Mosteller, F. and Wallace, D. (1963). Inference in an authorship problem. J. Amer. Statist. Assoc. 58 275–302.
Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl. 10 186–190.
Quang, P. X., James, B., James, K. L. and Levina, L. (1999). Document similarity measure for the vector space model in information retrieval. NSASAG Problem 99-5.
Salton, G., Wang, A. and Yang, C. (1975). A vector space model for information retrieval. J. Amer. Soc. Inf. Sci. 18 613–620.
Simonoff, J. S. (1996). Smoothing Methods in Statistics. Springer, New York.
Tan, P. N., Steinbach, M. and Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley, Reading.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Monographs on Statistics and Applied Probability 60. Chapman & Hall, London.
Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A 26 359–372.
Mathematical Reviews (MathSciNet): MR185765
Zhang, J. and Korfhagen, R. (1999). A distance and angle similarity measure method. J. Amer. Soc. Inf. Sci. 50 772–778.