Institute of Mathematical Statistics Collections

Characteristics of hand and machine-assigned scores to college students’ answers to open-ended tasks

Stephen P. Klein

Source: Deborah Nolan and Terry Speed, eds., Probability and Statistics: Essays in Honor of David A. Freedman (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2008), 76-89.

Abstract

Assessment of learning in higher education is a critical concern to policy makers, educators, parents, and students. And, doing so appropriately is likely to require including constructed response tests in the assessment system. We examined whether scoring costs and other concerns with using open-end measures on a large scale (e.g., turnaround time and inter-reader consistency) could be addressed by machine grading the answers. Analyses with 1359 students from 14 colleges found that two human readers agreed highly with each other in the scores they assigned to the answers to three types of open-ended questions. These reader assigned scores also agreed highly with those assigned by a computer. The correlations of the machine-assigned scores with SAT scores, college grades, and other measures were comparable to the correlations of these variables with the hand-assigned scores. Machine scoring did not widen differences in mean scores between racial/ethnic or gender groups. Our findings demonstrated that machine scoring can facilitate the use of open-ended questions in large-scale testing programs by providing a fast, accurate, and economical way to grade responses.

Primary Subjects: 62P99
Keywords: constructed response; hand scoring; machine scoring essay answers; open-ended tasks; reasoning tasks

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.imsc/1207580079
Digital Object Identifier: doi:10.1214/193940307000000392

References

[1] Burstein, J., Kaplan, R., Wolff, S. and Lu, C. (1996). Using lexical semantic techniques to classify free responses. In Proceedings of the ACL SIFLEX Workshop on Breadth and Depth of Semantic Lexicons.

[2] Daigon, A. (1966). Computer grading of English composition. English J. 55 46–52.

[3] Erwin, D. and Sebrell, K. (2003). Assessment of critical thinking: ETSs tasks in critical thinking. J. General Education 1 50–70.

[4] Ewell, P. T. (1994). A policy guide for assessment: Making good use of the tasks in critical thinking. Technical report, Educational Testing Service, Princeton.

[5] Klein, S. and Bolus, R. (2003). Factors affecting score reliability on high stakes essay exams. Technical report, American Educational Research Association.

[6] Klein, S., Kuh, G., Chun, M., Hamilton, L. and Shavelson, R. (2003). The search for “Value-Added”: Assessing and validating selected higher education outcomes. Technical report, American Educational Research Association.

[7] Klein, S., Kuh, G., Chun, M., Hamilton, L. and Shavelson, R. (2005). An approach to measuring cognitive outcomes across higher-education institutions. Research in Higher Education 46 251–276.

[8] Klein, S., Shavelson, R., Benjamin, R. and Bolus, R. (2007). The collegiate learning assessment: Facts and fantasies. Evaluation Review 31 415–439.

[9] Kukich, K. (2000). Beyond automated essay scoring. IEEE Intelligent Systems 15 22–27.

[10] Laudauer, T. K., Laham, D. and Foltz, P. W. (2003). Automatic Essay Assessment. Assessment in Education 10 295–308.

[11] Leacock, C. and Chodorow, M. (2003). C-rater: Scoring of short answer questions. Computers and the Humanities 37 389–405.

[12] Meyer, R. (1997). Value-added indicators of school performance: A primer. Economics of Education Review 16 183–301.

[13] Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan 48 238–243.

[14] Powers, D., Burstein, J., Chodorow, M., Fowles, M. and Kukich, K. (2000a). Comparing the validity of automated and human essay scoring. Technical Report ETS RR-00-10, Educational Testing Service, Princeton, NJ. GRE No. 98-08a.

[15] Powers, D., Burstein, J., Chodorow, M., Fowles, M. and Kukich, K. (2000b). Stumping e-rater: Challenging the validity of automated scoring. Technical Report ETS RR-01-03, Educational Testing Service, Princeton, NJ. GRE No. 98-08Pb.

[16] Schaeffer, G., Briel, J. and Fowles, M. (2001). Psychometric evaluation of the new gre writing assessment. Technical Report ETS Research Report 01-08, Educational Testing Service, Princeton, NJ. GRE Board Professional Report No. 96-11P.

[17] Shavelson, R. and Huang, L. (2003). Responding responsibly to the frenzy to assess learning in higher education. Change 35 10–19.

[18] Swygert, K., Margolis, M., King, A., Siftar, T., Clyman, S., Hawkins, R. and Clauser, B. (2003). Evaluation of an automated procedure for scoring patient notes as part of a clinical skills examination. Academic Medicine 78 S75–S77.

[19] Wainer, H. and Thissen, D. (1993). Combining multiple choice and constructed response test scores: Toward a Marxist theory of test construction. Appl. Measurement in Education 6 103–118.

2010 © Institute of Mathematical Statistics

Institute of Mathematical Statistics Collections

Institute of Mathematical Statistics Collections