Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and poly chotomous response data. J. Amer. Statist. Assoc. 88 669-679.
Albert, J. H. and Chib, S. (1995). Bayesian residual analysis for binary response regression models. Biometrika 82 747-759.
Besag, J. E. (1986). On the statistical analysis of dirty pictures. J. Roy. Statist. Soc. Ser. B 48 259-302.
Bradlow, E. T. (1994). Analy sis of ordinal survey data with "no answer" responses. Ph.D. dissertation, Dept. Statistics, Harvard Univ.
Bradlow, E. T. and Zaslavsky, A. M. (1997). A hierarchical latent variable model for ordinal data with "no answer" responses. Preprint.
Caulkin, J., Larkey, P. and Wei, J. (1996). Adjusting GPA to reflect course difficulty. Working paper, Heinz School of Public Policy and Management, Carnegie Mellon Univ.
Cowles, M. K., Carlin, B. P. and Connett, J. E. (1996). Bayesian Tobit modeling of longitudinal ordinal clinical trial compliance data with nonignorable missingness. J. Amer. Statist. Assoc. 91 86-98.
Elliott, R. and Strenta, A. (1988). Effects of improving the reliability of the GPA on prediction generally and on comparative predictions for gender and race particularly. Journal of Educational Measurement 25 333-347.
Fischer, G. H. and Molenaar, I. W., eds. (1995). Rasch Models: Foundations, Recent Developments, and Applications. Springer, New York.
Gelfand, A. E. (1996). Model determination using samplingbased methods. In Markov Chain Monte Carlo in Practice (W. Gilks, S. Richardson and D. J. Spiegelhalter, eds.) 145- 161. Chapman and Hall, London.
Gilks, W., Richardson, S. and Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov Chain
Goldman, R., Schmidt, D., Hewitt, B. and Fisher, R. (1974). Grading practices in different major fields. American Education Research Journal 11 343-357.
Goldman, R. and Widawski, M. (1976). A within-subjects technique for comparing college grading standards: implications in the validity of the evaluation of college achievement. Educational and Psy chological Measurement 36 381-390.
Johnson, V. E. (1996). On Bayesian analysis of multirater ordinal data. J. Amer. Statist. Assoc. 91 42-51.
Larkey, P. (Jan. 25, 1991). A better way to find the top scorer. Golf World 72-74.
Larkey, P. and Caulkin, J. (1992). Incentives to fail. Working Paper 92-51, Heinz School of Public Policy and Management, Carnegie Mellon Univ.
Linn, R. (1966). Grade adjustments for prediction of academic performance: a review. Journal of Educational Measurement 3 313-329.
Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley, Reading, MA.
Muraki, E. (1990). Fitting a poly tomous item response model to Likert-ty pe data. Applied Psy chological Measurement 14 59-71.
Nandram, B. and Chen, M.-H. (1996). Accelerating Gibbs sampler convergence in the generalized linear models via a reparameterization. J. Statist. Comput. Simulation 45. To appear.
Samejima, F. (1969). Estimation of latent ability using a pattern of graded scores. Psy chometrika, Monograph Supplement No. 17.
Strenta, A. and Elliott, R. (1987). Differential grading standards revisited. Journal of Educational Measurement 24 281-291.
Tanner, M. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82 528-549.
van der Linden, W. J. and Hambleton, R. K., eds. (1997). Handbook of Modern Item Response Theory. Springer, New York.
Wang, X., Wainer, H. and Thissen, D. (1995). On the viability of some untestable assumptions in equating exams that allow examinee choice. Applied Measurement in Education 8 211-225.
Young, J. W. (1989). Developing a universal scale for grades: investigating predictive validity in college admissions. Ph.D. dissertation, School of Education, Stanford Univ.
Young, J. W. (1990). Adjusting the cumulative GPA using item response theory. Journal of Educational Measurement 12 175-186.
Young, J. W. (1992). A general linear model approach to adjusting cumulative GPA. Journal of Research in Education 2 31-37.
Young, J. W. (1993). Grade adjustment methods. Review of Educational Research 63 151-165.
Council (Gose, 1997). The initiative is apparently dead at Duke for the moment.
(Gose, 1997). The second obstacle is communicating the diagnosis and the designed solution to audiences including members who have difficulty understanding analytic arguments and whose interests may be to not understand them. The source of opposition at Duke is not surprising. The few studies that have been done all indicate that there has been relatively more grade inflation in "softer" subjects. While there has been inflation in the physical sciences and mathematics, there is apparently more resistance to inflation in domains with more sharply and logically defined right and wrong answers. Grade inflation has been an important edge for some fields in competing for students as core curricula have waned and student choices have waxed. It is a perverse form of price competition; they have been able to offer higher grades for equivalent or lesser amounts of work. Not incidentally, they also get higher levels of student satisfaction and better external appraisals of their "teaching quality" which increasingly influence promotions and salary adjustments in elite research institutions. The general analytic problem of which correcting grade point average is a specific instance is:
ginning of the 20th century (Young, 1993). What is the evidence on grade inflation? At the high school level, more than twice (32% versus 15%) as many first-year college students reported in a 1996 national survey that their overall high school grade point average (GPA) was an A-minus or better as compared with 30 years earlier (Hornblower,
1997). Yet if students are doing better in schools, why are there now more complaints about the poor quality of schools? In many school districts, it is not uncommon for top-ranking students to flaunt GPA's of 4.2 on a grade scale that supposedly ranges from 0 to 4. One has to wonder what interpretation
Statistics, 1995). Scholastic Assessment Test (SAT) scores declined over a 20-year period starting in the late 1960's and have risen only slightly since. Longitudinal changes in SAT scores are more difficult to interpret due to demographic changes in the composition of the test-taking population and because of self-selection factors in choosing to take the test. However, it is clear that the trend in SAT scores has not been commensurate with the corresponding rise in grades. Furthermore, because of less rigorous grading standards and grade inflation in high schools, the advantage of high school GPA over SAT scores in predicting college GPA has essentially van
ished (Bejar and Blew, 1981). The correlation of high school GPA with college GPA has declined while the correlation of SAT scores with college GPA has held constant so that both are now equally good predictors. Data from colleges and universities are no less convincing about grade inflation. For example, the average undergraduate grade at Harvard rose from the midpoint between a Band a B to better than a B+ during the 25-year period from 1967 to 1992 without a corresponding rise in test scores (Lam
bert, 1993). In addition, grades in the humanities at Harvard rose much more than in the social or natural sciences. The so-called gentleman's C is now at least the "gentleman's B" if not higher. Other firsttier universities, such as Stanford, report equally dramatic increases in the average grade assigned. With continuing grade inflation, one unfortunate impact is that admissions to graduate and professional schools will be made primarily on test results rather than on grades since students can no longer be distinguished on the basis of grades. Solutions to the problem of grade inflation such as the one developed by Professor Johnson are much needed. All of the pressures on grading are upward; remediating grade inflation cannot be accomplished individually but must be carried out collectively. Some institutions are attempting to deal with grade inflation by reinstating low grades; for example, after a long absence from students's transcripts, the F grade is again part of the Stanford grading sy s
tem (Fetter, 1995). However, this approach is likely to have an impact only at the bottom end of the grade distribution without materially affecting the average grade assigned. An alternative approach at Dartmouth, begun two years ago, is to report the median course grade along with the student's grade so that some context is available for interpreting student transcripts. Though worthy of consideration, both the Stanford and Dartmouth solutions are crude instruments being applied globally. The implementation of Johnson's achievement index at Duke will lead to fairer comparisons among students. Grading differences among instructors, courses and departments will impact students's GPA's to a much lesser degree than is true now. Students who major in departments with rigorous grading standards (ty pically, in the natural and mathematical sciences) will not be at so large a disadvantage in the competition for employ ment, for admissions to graduate school and for fellowships and scholarships. Furthermore, the incentive sy stem for instructors will be converted to one of assigning high grades to deserving students in a course rather than the present sy stem of assigning high grades to every one. Grade inflation is a serious and pervasive problem throughout education today. In the course of time, we will learn if Johnson's achievement index has served its purpose or whether further refinements are necessary, but this solution moves us in the right direction. In conclusion, Professor Johnson is to be commended for developing his achievement index and the faculty at Duke is to be praised for their courage in implementing it.
Patz (1996) and Zwick (1992). In fact, this identical model has appeared in research on general methods for ordinal data structures (e.g., grades) using latent variables in a Bayesian framework: see, for example, Albert and Chib (1993), Bradlow (1994) and
Bradlow and Zaslavsky (1996). Hemker, Sijtsma, Molenaar and Junker (1997) provide a comparative review of such models that explores the relationship between total score (unadjusted GPA) rankings and latent variable (adjusted GPA) rankings in some generality.
ature (see, e.g., Holland and Rubin, 1982). Equating problems also arise when SAT scores are compared from one year to the next; when students's scores on sequentially designed tests such as the current computerized version of the GRE are compared; and when scores in complex educational survey s with incomplete block designs for administering test questions are analyzed. How do we compare performances of different students based on different stimuli? Equating is clearly an experimental design question. In the examples above, the experimenter has enough control over the design to make the missingness plausibly ignorable (in the sense of Rubin, 1987; summarized in Gelman, Carlin, Stern and Rubin, 1995, Chapter 7) or modelable (e.g., Chang and
Ying, 1996). When Val distinguishes his Xi from the latent traits usually defined in item response models, calling it the "mean classroom achievement of the ith student, in classes selected by student i," the point is that there is a missingness process, in contrast to most general treatments of item response models, and it is not ignorable: clearly, students' self-selection process as well as the grades is informative for student rankings. A model such as Val's which treats this missing data mechanism as ignorable or missing at random is vulnerable to severe biases in estimation of the Xi, as illustrated recently
by Bradlow and Thomas (1997). Mislevy and Wu (1996) provide general conditions needed for ignorability in various inferences from educational testing data, and Wang, Wainer and Thissen (1995) explore this problem in the context of equating exam scores when students are allowed to choose among several essay questions to answer.
linear model tried by Larkey and Caulkin (1992). On the other hand, Val can and did expand to a multicomponent model, which is not possible with the Larkey-Caulkin approach. Such expansions are interesting substantively (we even found ourselves wondering if a single composite of "math" and "verbal" ability would adequately capture performance across the spectrum of undergraduate courses our statistics departments offer) as well as pragmatically; and we are happy to see that Val is exploring further possibilities along these lines as well. Stricker et al. (1994) considered a related multicomponent model for GPA. On a pragmatic level the multicomponent model provided a meaningful improvement in the correlation of adjusted GPA ranks with other predictors. So it was disappointing to us that such extensions are apparently being dismissed in the initial proposal for the adjusted GPA at Duke as being too hard for clients and users to understand. Though Val writes that a multicomponent model may be introduced later, after a certain "comfort level" with
1994). Val's weighted-average achievement index based on the two-component model is a modelbased version of this compromise; perhaps this would provide an opportunity to make room for the multicomponent model as it becomes more well understood. 2.3 Computation We thought the data augmentation approach and corresponding prior for collapsing grade categories, discussed at the end of Section 2, was clever and interesting. The related discussion in Section 1.2 indicates the delicacy of the problem; for example, Bradlow (1994) shows that changing the grade category cutoff you fix can have dramatic effects on the MCMC convergence rates. We want to raise two concerns about this prior choice, however. First, while suppressing unobserved grade categories works well for fitting observed data for one cohort of students, it may be necessary, for the related problem of predicting future grades, to allow for the occurrence of such grade categories by putting nonnegligible prior probabilities on them. Second, it appears that classes in which all students are assigned the same grade will be ignored by the model; there is little comparative information from which to assess the class's difficulty. It may be possible to use historical grade data to supply prior information on the difficulty of such classes, as long as the problem of equating across years can be overcome. Turning to a different matter, when Val shows in Section 4.1 that the achievement index gives the correct ranking in the Larkey-Caulkin data, we would like to have seen posterior probabilities for all rankings. The beauty of obtaining Monte Carlo posterior samples is that simple counts can give interesting insights: was the "reverse" order-or some other order-a close competitor in posterior probability, for example?
1996). As noted in the text, fixing any of the grade cutoffs makes the model unsuitable for analyzing undergraduate grade data, and I suspect the same is true for many other multirater ordinal datasets as well. Finally, Junker and Bradlow's comments regarding EB models are well taken. As White notes, the model was criticized at Duke for not "counting" independent study courses or small seminar courses in which only one grade was assigned (90% of independent study grades at Duke are either an A+, A-, or A; the median grade in classes containing fewer than five students is an A). In response to
Algina, J. (1992). Special issue: the National Assessment of Educational Progress (Editor's note). Journal of Educational Measurement 29 93-94.
Bejar, I. I. and Blew, E. O. (1981). Grade inflation and the validity of the Scholastic Aptitude Test. American Educational Research Journal 18 143-156.
Bradlow, E. T. and Thomas, N. (1997). Item response theory models applied to data allowing examinee choice. Journal of Educational and Behavioral Statistics. To appear. Caulkins, J. P., Barnett, A., Larkey, P. D., Yuan, Y. and
Goranson, J. (1993). The on-time machines: some analyses of airline punctuality. Oper. Res.
Chang, H.-H. and Ying, Z. (1996). Nonlinear sequential designs for logistic item response theory models with applications to computerized adaptive tests. Ann. Statist. To appear.
Fetter, J. H. (1995). Questions and admissions: reflections on 100,000 admissions decisions at Stanford. Stanford Univ. Press.
Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1995). Bayesian Data Analy sis. Chapman and Hall USA, New York.
Gose, B. (1997). The Chronicle of Higher Education, March 21, A53. Hemker, B. T., Sijtsma, K., Molenaar, I. W. and Junker, B. W.
(1997). Stochastic ordering using the latent trait and the sum score in poly tomous IRT models. Psy chometrika. To appear.
Holland, P. W. and Rubin, D. B., eds. (1982). Test Equating. Academic Press, New York.
Hornblower, M. (1997). Learning to earn. Time 149 (February 24) 34.
Johnson, E. G., Mislevy, R. J. and Thomas, N. (1994). Theoretical background and philosophy of NAEP scaling procedures. In Technical Report of the NAEP 1992 Trial State Assessment Program in Reading (E. G. Johnson, J. Mazzeo and D. L. Kline, eds.) Chapter 8, 133-146. Office of Educational Research and Improvement, U.S. Dept. Education, Washington, D.C.
Junker, B. W. (1991). Essential independence and likelihoodbased ability estimation for poly tomous items. Psy chometrika 56 255-278.
Junker, B. W. and Stout, W. F. (1994). Robustness of ability estimation when multiple traits are present with one trait dominant. In Modern Theories of Measurement: Problems and Issues ( D. Laveault, B. D. Zumbo, M. E. Gessaroli and M. W. Boss, eds.) Chap. 2. Univ. Ottawa.
Lambert, C. (1993). Desperately seeking summa. Harvard Magazine 95 (May-June) 36-40.
Mislevy, R. J. and Wu, P. K. (1996). Missing Responses and IRT ability estimation: omits, choice, time limits, and adaptive testing. Technical Report RR-96-30-ONR, Educational Testing Service, Princeton, NJ.
National Center for Education Statistics (1995). Digest of Education Statistics 1995. U.S. Dept. Education, Washington, D.C.
Patz, R. J. (1996). Markov chain Monte Carlo methods for item response theory models with applications for the National Assessment of Educational Progress. Ph.D. dissertation, Carnegie Mellon Univ.
Pedersen, D. (1997). When an A is average. Newsweek March 3, 64.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Survey s. Wiley, New York.
Mathematical Reviews (MathSciNet): MR899519
Samejima, F. (1997). Graded response model. In Handbook of Modern Item Response Theory (W. J. van der Linden and R. K. Hambleton, eds.) 85-100. Springer, New York.
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psy chometrika 55 293-325. Stricker, L. J., Rock, D. A., Burton, N. W., Muraki, E. and
Jirele, T. J. (1994). Adjusting college grade point average criteria for variations in grading standards: a comparison of methods. Journal of Applied Psy chology 79 178-183.
Zwick, R. (1992). Special issue on the National Assessment of Educational Progress. Journal of Educational Statistics 17 93-94.