Augmenting AI scoring of essays with GPT-generated responses
DOI:
https://doi.org/10.17239/jowr-2026.17.03.06Keywords:
AI scoring, writing assessment, large language models, GPT, sample augmentationAbstract
In this study, we examine the feasibility of augmenting student-written essays with those generated by large language models (LLMs) for scoring essays. We found that with correct instructions, generative AI systems such as GPT-4 and GPT-4o can generate essays similar to those written by students in terms of surface-level linguistic features, although material differences may still exist. Systematic analyses revealed that scoring models trained with synthetic data perform comparably to models trained using student essays, but the performance varies across prompts and the sizes of the model training sample. The augmented models could alleviate large discrepancies between human and AI scores on the subgroup level that may be introduced by a lack of training samples for a particular subgroup or due to inherent biases in LLMs. We also explored an established method – DecompX – on token importance to identify and explain AI predictions. Future research directions and limitations of this study are also discussed.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: AERA.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In International conference on learning representations. https://arxiv.org/abs/1409.0473
Bejar, I.I., Mislevy, R.J. and Zhang, M. (2016). Automated scoring with validity in mind. In The Wiley Handbook of Cognition and Assessment (eds A.A. Rupp and J.P. Leighton).
Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. In D. M. Williamson, I. I. Bejar, & R. J. Mislevy (Eds.), Automated scoring of complex tasks in computer-based testing (p. 49-79). Mahwah, NJ: Laurence Erlbaum Associates.
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17.
Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (p. 142-173). Taylors & Francis.
Bibal, A., Marion, R., von Sachs, R., & Frénay, B. (2021). BIOT: Explaining multidimensional nonlinear MDS embeddings using the best interpretable orthogonal transformation. Neurocomputing, 453, 109-118.
Chen, J., Tam, D., Raffel, C., Bansal, M., & Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211.
Chen, J., Zhang, M., & Bejar, I. I. (2017). An investigation of the e-rater® automated scoring engine’s grammar, usage, mechanics, and style microfeatures and their aggregation model. ETS RR-17-04. Princeton, NJ: Educational Testing Service.
Crossley, S. (2024). Persuade_corpus_2.0. Github. https://github.com/scrosseye/persuade_corpus_2.0
Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Benner, M., & Boser, U. (2024). A large- scale corpus for assessing written argumentation: PERSUADE 2.0. Assessing Writing, 61, 100865.
Dai, H., Liu, Z., Liao, W., Huang, X., Wu, Z., Zhao, L., . . . Li, X. (2023). AugGPT: Leveraging ChatGPT for text data augmentation. Retrieved from https://arxiv.org/pdf/2302.13007
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arxiv.org/pdf/1810.04805
Dikli, S. (2006). An overview of automated essay scoring. The Journal of Technology, Learning, Assessment, 5(1).
Ding, B., Qin, C., Zhao, R., Luo, T., . . . Joty, S. (2024). Data augmentation using large language models: Data perspectives, learning paradigms and challenges. Retrieved from https://arxiv.org/pdf/2403.02990.pdf
ETS. (2021). Best practices for constructed-response scoring. Educational Testing Service, Princeton, NJ. Retrieved from https://www.ets.org/pdfs/about/cr_best_practices.pdf
ETS. (2025). Responsible use of AI for measurement and learning: Principles and practices. RR-25-03. Princeton, NJ: Educational Testing Service.
Fang, L., Lee, G.-G., & Zhai, X. (2023). Using GPT-4 to augment unbalanced data for automatic scoring. Retrieved from https://arxiv.org/abs/2310.18365
Haberman, S. (2019). Measures of agreement versus measures of prediction accuracy. RR-19-20. Princeton, NJ: Educational Testing Service.
He, P., Gao, J., & Chen, W. (2021). DeBERTaV3: Improving DeBERTa using ELECTRA-Style pre-training with gradient-disentangled embedding sharing. Retrieved from https://arxiv.org/abs/2111.09543
Hernandez, D., Brown, T. B., Conerly, T., DasSarma, N.,. . . McCandlish, S. (2022). Scaling laws and interpretability of learning from repeated data. Retrieved from https://arxiv.org/abs/2205.10487
Hinton, G. E., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. Retrieved from http://arxiv.org/abs/1503.02531
Johnson, M., & Zhang, M. (2024). Examining the responsible use of zero-shot AI approaches to scoring essays. Nature, 14, 30064.
Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In V. Yaneva and M. von Davier (Eds.), Advancing natural language processing in educational assessment (1st ed.). New York: Routledge.
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2022). Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
Liu, Y., Ott, M., Goyal, N., Du, J., . . . Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. Retrieved from: https://arxiv.org/abs/1907.11692.
Long, L., Wang, R., Xiao, R., Zhao, J., …& Wang, H. (2024). On LLMs-driven synthetic data generation, curation, and evaluation: A survey. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11065–11082, Bangkok, Thailand. Association for Computational Linguistics.
Loshchilov, I., & Hutter, F. (2017, Nov). Decoupled weight decay regularization. Retrieved from https://arxiv.org/abs/1711.05101
Modarressi, A., Fayyaz, M., Aghazadeh, E., Yaghoobzadeh, Y., & Pilehvar, M. T. (2023). DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st annual meeting of the association for computational linguistics (ACL). Vol. 1, Long Papers, pp. 2649-2664.
Morris, W., Holmes, L., Choi, J. S., & Crossley, S. (2025). Automated scoring of constructed response items in math assessment using large language models. International Journal of Artificial Intelligence in Education. 35, 559-586.
OpenAI. (2023). GPT-4 technical report. Retrieved from https://openai.com/index/gpt-4-research/
Page, E. B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238–243.
Raheja, V., Kumar, D., Koo, R., & Kang, D. (2023). Coedit: Text editing by task-specific instruction tuning. In Conference on empirical methods in natural language processing.
Sun, S., Cheng, Y., Gan, Z., & Liu, J. (2019). Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics.
Tirumala, K., Simig, D., Aghajanyan, A., & Morcos, A. (2023). D4: Improving LLM pretraining via document de-duplication and diversification. In NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems. Article No.: 2348, pp. 53983–53995.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., . . . Lample, G. (2023). LLaMA: Open and efficient foundation language models. Retrieved from https://arxiv.org/abs/2302.13971
Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., . . . Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
Whitehouse, C., Choudhury, M., & Aji, A. F. (2023). LLM-powered data augmentation for enhanced crosslingual performance. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 671–686.
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13.
Yao, L., Haberman, S. J., & Zhang, M. (2019a). Penalized best linear prediction of true test scores. Psychometrika, 84(1), 186-211.
Yao, L., Haberman, S. J., & Zhang, M. (2019b). Prediction of writing true scores in automated scoring of essays by best linear predictors and penalized best linear predictors. RR-19-13. Princeton, NJ: Educational Testing Service.
Yoo, K. M., Park, D., Kang, J., Lee, S.-W., & Park, W. (2021). GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Yu, Z., He, L., Wu, Z., Dai, X., & Chen, J. (2023). Towards better chain-of-thought prompting strategies: A survey. Retrieved from https://arxiv.org/abs/2310.04959
Yuan, L., Tay, F. E. H., Li, G., Wang, T., & Feng, J. (2021). Revisiting knowledge distillation via label smoothing regularization. Retrieved from https://arxiv.org/pdf/1909.11723
Zhang, M. (2013). Contrasting automated and human scoring of essays ETS-RDC-21. Princeton, NJ: Educational Testing Service.
Zhang, M., & Bennett, R. E. (2022). Automated scoring of constructed-response items in educational assessment. In International encyclopedia of education (4th edition) (p. 397-403). Elsevier.
Zhang, M., Johnson, M., & Ruan, C. (2024). Investigating sampling impacts on an LLM-based AI scoring approach: Prediction accuracy and fairness. Journal of Measurement and Evaluation in Education and Psychology, 15, 348-360.
Zhang, M., Williamson, D. M., Breyer, F. J., & Trapani, C. (2012). Comparison of e-rater® automated essay scoring model calibration methods based on distributional targets. International Journal of Testing, 12, 345–364.
Published
Issue
Section
License
Copyright (c) 2026 Mo Zhang, Akshay Badola, Matthew Johnson, Chen Li

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 Unported License.