Comparative approaches to the assessment of writing: Reliability and validity of benchmark rating and comparative judgement


  • Renske Bouwer Utrecht University
  • Marije Lesterhuis University of Antwerp
  • Fien De Smedt Ghent University
  • Hilde Van Keer Ghent University
  • Sven De Maeyer University of Antwerp



writing assessment, benchmark rating, comparative judgement, reliability, convergent validity


In the past years, comparative assessment approaches have gained ground as a viable method to assess text quality. Instead of providing absolute scores to a text as in holistic or analytic scoring methods, raters in comparative assessments rate text quality by comparing texts either to pre-selected benchmarks representing different levels of writing quality (i.e., benchmark rating method) or by a series of pairwise comparisons to other texts in the sample (i.e., comparative judgement; CJ). In the present study, text quality scores from the benchmarking method and CJ are compared in terms of their reliability, convergent validity and scoring distribution. Results show that benchmark ratings and CJ-ratings were highly consistent and converged to the same construct of text quality. However, the distribution of benchmark ratings showed a central tendency. It is discussed how both methods can be integrated and used such that writing can be assessed reliably, validly, but also efficiently in both writing research and practice.


Barkaoui, K. (2011). Effects of Marking Method and Rater Experience on ESL Essay Scores and Rater Performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293.

Blok, H. (1986). Essay rating by the comparison method. Tijdschrift Voor Onderwijsresearch, 11, 169–176.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061-1071.

Bouwer, R., & Koster, M. (2016). Bringing writing research into the classroom. The effectiveness of Tekster, a newly developed writing program for elementary students (Unpublished doctoral dissertation). Utrecht University.

Bramley, T. (2015). Investigating the reliability of Adaptive Comparative Judgment (pp. 1-17). Cambridge: Cambridge Assessment.

Coertjens, L., Lesterhuis, M., Verhavert, S., Van Gasse, R., & De Maeyer, S. (2017). Teksten beoordelen met criterialijsten of via paarsgewijze vergelijking: een afweging van betrouwbaarheid en tijdsinvestering. Pedagogische Studiën, 94(4), 283–303.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: design & analysis issues for field settings. Houghton Mifflin.

Coward, A. F. (1952). A comparison of two methods of grading English Composition. The Journal of Educational Research, 46(2), 81-93.

De Milliano, I., van Gelderen, A., & Sleegers, P. (2012). Patterns of cognitive self-regulation of adolescent struggling writers. Written Communication, 29(3), 302-325. doi:10.1177/0741088312450275

De Smedt, F., Graham, S., & Van Keer, H. (2020). ’It takes two’ : the added value of structured peer-assisted writing in explicit writing instruction. Contemporary Educational Psychology, 60, 101835.

De Smedt, F., Merchie, E., Barendse, M., Rosseel, Y., De Naeghel, J., & Van Keer, H. (2017). Cognitive and motivational challenges in writing: studying the relation with writing performance across students' gender and achievement level. Reading Research Quarterly, 53(2), 249–272. doi:10.1002/rrq.193

De Smedt, F., & Van Keer, H. (2018). Fostering writing in upper primary grades: a study into the distinct and combines impact of explicit instruction and peer assistance. Reading and Writing, 31(2), 325-354. doi:10.1007/s11145-017-9787-4

De Smedt, F., Van Keer, H., & Merchie, E. (2015). Student, teacher and class-level correlates of Flemish late elementary school children’s writing performance. Reading and Writing, 29(5), 1–36.

Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability. Princeton, New Jersey: Educational Testing Service.

Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185.

Eckes, T. (2012). Operational Rater Types in Writing Assessment: Linking Rater Cognition to Rater Behavior. Language Assessment Quarterly, 9(3), 270–292.

Goossens, M., & De Maeyer, S. (2017). How to obtain efficient high reliabilities in assessing texts: rubrics vs comparative judgement (pp. 1–12). Presented at the Communications in Computer and Information Science.

Graham, S., Harris, K. R., & Hebert, M. (2011). It is more than just the message: Presentation effects in scoring writing. Focus on Exceptional Children, 44(4), 1–12.

Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.

Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37(2), 1–20.

Heldsinger, S. A., & Humphry, S. M. (2013). Using calibrated exemplars in the teacher-assessment of writing: an empirical study. Educational Research, 55(3), 219–235.

Humphry, S. M., & Heldsinger, S. A. (2014). Common structural design features of rubrics may represent a threat to validity. Educational Researcher, 43(5), 253-263.

Huot, B. (1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, 41(2), 201–213.

Jones, I., & Alcock, L. (2014). Peer assessment without assessment criteria. Studies in Higher Education, 39(10), 1774–1787.

Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144.

Laming, D. (2004). Marking university examinations: some lessons from psychophysics. Psychology Learning & Teaching, 3(2), 89–96.

Lesterhuis, M. (2018). The validity of comparative judgement for assessing text quality. An assessor’s perspective (Unpublished doctoral dissertation). University of Antwerp.

Lesterhuis, M., Verhavert, S., Coertjens, L., Donche, V., & De Maeyer, S. (2017). Comparative Judgement as a promising alternative. In E. Cano & G. Ion (Eds.), Innovative practices for higher education assessment and measurement (pp. 119–138). IGI Global. doi:10.4018/978-1-5225-0531-0

Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Eds.), Evaluating writing: Describing, measuring, judging (pp. 33-68). Urbana: National Council of Teachers of English.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: what do they really mean to the raters? Language Testing, 19(3), 246–276.

Lumley, T., & McNamara, T. F. (1995). Rater Characteristics and Rater Bias: Implications for Training. Language Testing, 12(1), 54–71.

Mabry, L. Writing to the rubric: Lingering effects of traditional standardized testing on direct writing assessment. The Phi Delta Kappan, 80(9), 673-679.

McColly, W. (1970). What Does Educational Research Say About the Judging of Writing Ability? Journal of Educational Research, 64(4), 147–156.

McGrane, J. A., Humphry, S. M., & Heldsinger, S. (2018). Applying a thurstonian, two-stage method in the standardized assessment of writing. Applied Measurement in Education, 31(4), 297-311. doi:10.1080/08957347.2018.1495216

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–104). Washington, DC: American Council on Education and National Council on Measurement in Education.

Moss, P. A., Cole, N. S., & Khampalikit, C. (1982). A comparison of procedures to assess written language skills at grades 4, 7, and 10. Journal of Educational Measurement, 19(1), 37–47.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

Osborn Popp, S. E., Ryan, J. M., & Thompson, M. S. (2009). The critical role of anchor paper selection in writing assessment. Applied Measurement in Education, 22(3), 255–271. doi:10.1080/08957340902984026

Pollitt, A. (2004). Let’s stop marking exams (pp. 1–21). Philadelphia.

Pollitt, A. (2012). The method of Adaptive Comparative Judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281–300.

Post, M. W. (2016). What to do with “moderate” reliability and validity coefficients. Archives of Physical Medicine and Rehabilitation, 97, 1051-10522.

Rezaei, A. R., & Lovorn, M. (2010). Reliability and validity of rubrics for assessment through writing. Assessing Writing, 15(1), 18–39.

Sadler, D. R. (2009). Indeterminacy in the use of preset criteria for assessment and grading. Assessment & Evaluation in Higher Education, 34(2), 159-179.

Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30.

Schoonen, R. (2012). The validity and generalizability of writing scores: The effect of rater, task and language. In E. Van Steendam, M. Tillema, G. Rijlaarsdam, & H. van den Bergh (Eds.), Measuring Writing: Recent Insights into Theory, Methodology and Practice (Vol. 27, pp. 1-22). Leiden: Brill Publishers.

Suppes, P., & Zinnes, J. L. (1963). Basic measurement theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. 1, pp. 39-74). New York: John Wiley & Sons.

Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273–286.

Tillema, M., van den Bergh, H., Rijlaarsdam, G., & Sanders, T. (2012). Quantifying the quality difference between L1 and L2 essays: A rating procedure with bilingual raters and L1 and L2 benchmark essays. Language Testing, 30(1), 71–97.

Van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59–74.

Van den Bergh, H., De Maeyer, S., van Weijen, D., & Tillema, M. (2012). Generalizability of text quality scores. In E. Van Steendam, M. Tillema, G. Rijlaarsdam, & H. van den Bergh (Eds.), Measuring Writing: Recent Insights into Theory, Methodology and Practice (Vol. 27, pp. 23-32). Leiden: Brill Publishers.

Van Steendam, E., Tillema, M., Rijlaarsdam, G., & Van den Bergh, H. (2012). Measuring Writing: Recent Insights into Theory, Methodology and Practice. Brill Publishers.

Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541–562.

Verhavert, S., De Maeyer, S., Donche, V., & Coertjens, L. (2017). Scale Separation Reliability: What Does It Mean in the Context of Comparative Judgment? Applied Psychological Measurement, 9, 014662161774832–18.

Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.

Wesdorp, H. (1981). De evaluatie van de schrijfvaardigheid. In Evaluatietechnieken voor het moedertaalonderwijs. 's-Gravenhage: Stichting voor Onderzoek van het Onderwijs.



How to Cite

Bouwer, R., Lesterhuis, M. ., De Smedt, F., Van Keer, H., & De Maeyer, S. (2023). Comparative approaches to the assessment of writing: Reliability and validity of benchmark rating and comparative judgement. Journal of Writing Research, 15(3), 497–518.