LLM feedback for academic writing: Effects on students’ performance and engagement Robert Glüsing, Johanna Fleckenstein, Fabian T.C. Schmidt, Jens Möller Contemporary Educational Psychology, 2026 Writing and revising academic texts is a demanding task that benefits significantly from feedback provided by teachers or peers. However, providing elaborated formative feedback on students’ academic writing is time-intensive and therefore hard to implement in educational practice. As a supplementary resource, large language models (LLMs) offer the potential to support the writing process by generating automated feedback to help students enhance their texts. The present study examined the accuracy of LLM-generated feedback on student texts and its effectiveness in improving university students’ revision performance and engagement in academic writing. In a randomized controlled experiment, a sample of N = 144 university students wrote an abstract summarizing a research article. All participants were then instructed to revise their abstracts; half received individualized feedback generated by GPT-4 using a standardized prompting procedure. Controlling for the quality of the initial drafts, regression analyses revealed that LLM-generated feedback led to higher revision quality and increased behavioral engagement, as measured by revision time and edit distance. Furthermore, behavioral engagement partially mediated the effect of feedback on revision quality. These findings demonstrate that LLMs can provide high-accuracy, effective feedback on academic writing. The study discusses the potential applications and implications of this technology within higher education contexts.
On the role of engagement in automated feedback effectiveness: Insights from keystroke logging Ronja Schiller, Johanna Fleckenstein, Lars Höft, Andrea Horbach, Jennifer Meyer Computers and Education, 2025 Feedback research increasingly focuses on the role of learners’ engagement in the feedback process. Process measures from technology-based learning environments that reflect writing behavior can provide new insights into the mechanisms underlying feedback effectiveness by making engagement visible. Previous research has shown that log data and similarity measures mediate the effects of automated feedback on learners’ revision performance. In the present study, we aimed to replicate and extend previous research using measures obtained from keystroke logging that represent the revision process on a more fine-grained level. We considered behavioral engagement (i.e., number of keystrokes and typing time) and writing pauses as potential indicators of cognitive engagement. In a classroom experiment, N = 453 English-as-a-foreign-language (EFL) learners ( M age = 16.11) completed a writing task and revised their draft, receiving either feedback generated by a large language model (i.e., GPT 3.5 Turbo) or no feedback. A second writing task served as a transfer task. All texts were scored automatically to assess performance. The effect of automated feedback on learners’ revision and transfer performance was mediated through the different indicators of behavioral engagement during the text revision, although the direct effect of automated feedback on the transfer task was not significant. We found small effects of feedback on pause length and the number of pauses, but the indirect effects were not significant. The study provides further evidence on the role of learning engagement in feedback effectiveness and illustrates how online measures (i.e., keystroke logging) can be used to gain new insights into the effectiveness of automated feedback. The use of different process measures to assess learning engagement is discussed.
Self-assessment accuracy in the age of artificial Intelligence: Differential effects of LLM-generated feedback Lucas W. Liebenow, Fabian T.C. Schmidt, Jennifer Meyer, Johanna Fleckenstein Computers and Education, 2025 Feedback is a promising intervention to foster students’ self-assessment accuracy (SAA), but the effect can vary depending on students' initial skill levels or prior performance. In particular, lower-performing students who are less accurate might benefit more from feedback in terms of SAA. To deepen our understanding, the present study investigated the mechanism and dependencies of feedback effects on SAA in the realm of large language models (LLMs). Within a randomized control experiment, we examined the effect of LLM-generated feedback on SAA by considering students’ initial performance and initial SAA as potential moderators. A sample of N = 459 upper secondary students wrote an argumentative essay in English as a foreign language and revised their text. After finishing their first draft (pretest) and revision (posttest) of the draft, students self-assessed their writing performance. Students in the experimental group received GPT-3.5-turbo-generated feedback on their first draft during their revision. In the control group, students could revise their text without feedback. Our results indicated no significant main effect of LLM-generated feedback on students’ SAA. Furthermore, we found a significant interaction effect between feedback and students' pretest SAA on SAA changes, indicating that lower-calibrated students improved their SAA with feedback more, compared to students with similar pretest SAA and without feedback. Exploratory analyses revealed that students with higher pretest SAA did not improve their SAA with feedback and decreased their SAA. We discuss this nuanced evidence and draw implications for research and practice using LLM-generated feedback in education. • LLM-generated feedback did not improve self-assessment accuracy (SAA) on average. • Feedback effectiveness depended on students' initial SAA, not performance. • Students with lower initial SAA improved their SAA after LLM feedback. • LLM-generated feedback offers a scalable way to support students who need it most.
Neural Networks or Linguistic Features? - Comparing Different Machine-Learning Approaches for Automated Assessment of Text Quality Traits Among L1- and L2-Learners’ Argumentative Essays Julian F. Lohmann, Fynn Junge, Jens Möller, Johanna Fleckenstein, Ruth Trüb, Stefan Keller, Thorben Jansen, Andrea Horbach International Journal of Artificial Intelligence in Education, 2025 Recent investigations in automated essay scoring research imply that hybrid models, which combine feature engineering and the powerful tools of deep neural networks (DNNs), reach state-of-the-art performance. However, most of these findings are from holistic scoring tasks. In the present study, we use a total of four prompts from two different corpora consisting of both L1 and L2 learner essays annotated with trait scores (e.g., content, organization, and language quality). In our main experiments, we compare three variants of trait-specific models using different inputs: (1) models based on 220 linguistic features, (2) models using essay-level contextual embeddings from the distilled version of the pre-trained transformer BERT (DistilBERT), and (3) a hybrid model using both types of features. Results imply that when trait-specific models are trained based on a single resource, the feature-based models slightly outperform the embedding-based models. These differences are most prominent for the organization traits. The hybrid models outperform the single-resource models, indicating that linguistic features and embeddings indeed capture partially different aspects relevant for the assessment of essay traits. To gain more insights into the interplay between both feature types, we run addition and ablation tests for individual feature groups. Trait-specific addition tests across prompts indicate that the embedding-based models can most consistently be enhanced in content assessment when combined with morphological complexity features. Most consistent performance gains in the organization traits are achieved when embeddings are combined with length features, and most consistent performance gains in the assessment of the language traits when combined with lexical complexity, error, and occurrence features. Cross-prompt scoring again reveals slight advantages for the feature-based models.
(De)motivating Zero-Performing Students With Negative Feedback: Does the Salience of Performance Information Matter? Marlene Steinbach, Johanna Fleckenstein, Livia Kuklick, Jennifer Meyer Journal of Computer Assisted Learning, 2025 BackgroundProviding students with information on their current performance could help them improve by stimulating their reflection, but negative feedback that saliently mirrors task‐related failure can harm motivation. In the context of automated scoring based on artificial intelligence, we explored how feedback on written texts might be designed to be least detrimental for zero‐performing students who are likely to receive negative feedback frequently and might suffer from its motivational consequences.ObjectivesThis experiment set out to investigate whether making the negative performance information in automated feedback messages less salient reduces the potential threat of negative feedback for zero‐performing students' task‐specific self‐concept, intrinsic value, and performance.MethodsA sample of 105 (Mage = 13.97 years) zero‐performing students received negative feedback with either more or less salient performance information after completing an English writing task. We used regression analysis to examine pre–post effects and group differences in self‐concept, intrinsic value, and performance.Results and ConclusionsThe analyses showed that zero‐performing students' performance improved but their self‐concept and intrinsic value declined over the course of two writing tasks, with feedback provided after the initial task. Contrary to expectations, our findings showed that students' task‐specific self‐concept and intrinsic value declined more in the condition with less salient performance information (i.e., without a red cross as a salient visual performance cue). Our findings highlight the motivational potential of performance information and are discussed in terms of the need for further research into how negative feedback can be designed to effectively motivate and support zero‐performing learners.
“Can (A)I do this task?” The role of AI as a socializer of students' self-beliefs of their abilities Thorben Jansen, Jennifer Meyer, Johanna Fleckenstein, Allan Wigfield, Jens Möller Learning and Individual Differences, 2025 Students' beliefs about their own academic abilities – their answers to the question “Can I do this task?” - are crucial to their success. Learning within AI-supported environments, alongside AI agents, influences students' beliefs about their abilities. Studies show enhancing and diminishing influences that remain unexplained by motivation theory, limiting theories' explanatory effect in AI-supported learning environments, and leaving educational technology research without a solid theoretical foundation. The following article specifies the situated expectancy-value theory (SEVT) for students' self-belief formation in the context of an AI-driven society. The expanded theory conceptualizes AI as becoming an artificial socializer, capturing the role of AI as an instrumental tool and social agents making up students' individual environments. Bridging AI and motivational research provides a framework for systematically investigating students' self-beliefs in AI-supported contexts and how educational technology can support positive self-beliefs, considering students' contexts and individual differences. • Summarizes and explains empirical influences of AI on students' ability self-beliefs. • Integrates AI into situated-expectancy value theory. • Provides a framework to investigate AI effects on students' ability self-beliefs. • Describe potential mechanisms in the self-belief formation considering AI.
Nonengagement and unsuccessful engagement with feedback in lower secondary education: The role of student characteristics Jennifer Meyer, Thorben Jansen, Johanna Fleckenstein Contemporary Educational Psychology, 2025 • We investigated feedback engagement in a sample of lower-secondary students. • Findings show that 20% of students did not engage, 47% unsuccessfully engaged in a text revision. • We focused on the role of individual differences in feedback engagement. • We considered the role of gender, cognitive and noncognitive variables. Feedback can be a powerful learning intervention and learners’ active engagement is assumed to be one of the most important determinants of feedback effectiveness. But not all students successfully engage with feedback. In the present study, we aimed to make students’ engagement with feedback visible by focusing on their text revisions as an indicator of feedback response. On the basis of theoretical models of feedback processing, we differentiated between behavioral nonengagement (i.e., not revising at all after receiving feedback) and unsuccessful engagement (i.e., revising after receiving feedback, but not improving in the process). Capitalizing on this distinction, we compared the characteristics of students in both groups with those of students who (successfully) engaged with the feedback. We provided automated computer-based feedback on a writing task to a sample of 937 students in lower secondary education in Germany (49% female, Grades 7[28%], 8 [29%], and 9[43%]), asking students to revise their texts according to the feedback. We found that 20% of the students did not make any revisions to their text after receiving feedback (nonengagement) and that 47% of the students did not improve their performance after working with the feedback during a text revision (unsuccessful engagement). Male students and students with lower cognitive abilities were more likely to show nonengagement. For unsuccessful engagement, cognitive abilities and the English grade were relevant predictors, hinting at the role that domain-specific competencies play in translating feedback into effective revision. We also found significant positive associations of intrinsic task value with successful feedback engagement. We discuss how future research could advance understanding of feedback processing by taking a more fine-grained approach to investigating feedback response.
Understanding individual differences in students’ responses to technology-based feedback on a writing task: the role of achievement motives and initial task performance Jennifer Meyer, Thorben Jansen, Martin Daumiller, Johanna Fleckenstein Journal of Research on Technology in Education, 2025 Computer-based feedback interventions are generally effective—but not for all students. Students’ achievement motives (hopes for success, fear of failure) might explain how students respond to feedback in interplay with initial task performance. In a sample of 949 secondary school students in Germany (Grades 7–9) we found that when the task criterion was initially not met, higher hopes for success were positively associated with students’ subsequent task performance after receiving automated feedback. When the criterion was initially met, a higher fear of failure was negatively related to the subsequent task performance. Our results suggest that achievement motives can play a complex role at different levels of initial task performance. These insights could inform personalized feedback design to enhance feedback effectiveness in cognitively demanding tasks.
Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases. Thorben Jansen, Lucas W. Liebenow, Ute Mertens, Fabian T. C. Schmidt, Julian F. Lohmann, Johanna Fleckenstein, Jennifer Meyer Psychological Bulletin, 2025 Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement Ronja Schiller, Johanna Fleckenstein, Ute Mertens, Andrea Horbach, Jennifer Meyer Computers and Education, 2024 In the last couple of years, feedback research has shifted towards a feedback-as-process approach, taking a learner-centered perspective and focusing on the proactive role of the learner in feedback effectiveness. Process measures can provide new insights into the role of the learner by making learners’ actual behavioral engagement visible. We conducted an experimental study, comparing two groups (feedback vs. no feedback) of English-as-a-foreign-language learners in lower secondary schools ( N = 189). The learners completed a writing task and revised it with or without feedback. A second writing task served as a transfer task. Performance was automatically assessed using a scoring algorithm. To determine the level of learners’ behavioral engagement during the text revision, we used the revision time and the edit distance (i.e., a similarity measure) as behavioral measures. Our analyses showed a positive effect of feedback on text revision. We found a full mediation of the effect of feedback on text revision through revision time with an estimated portion of mediation (POM) of .63∗∗∗ and a partial mediation of the feedback effect on text revision through the edit distance with a POM of .30∗∗. We did not find significant mediation effects of either engagement variable regarding performance in a transfer task. Our findings contribute to the understanding of feedback effectiveness, highlighting the central role of learner engagement in the feedback process. • We investigate feedback effectiveness from a process-oriented perspective. • Log-data and computer-linguistic features as objective indicators of engagement. • Feedback is associated with higher levels of engagement during text revision. • Behavioral engagement mediates positive effect of feedback on revision performance. • We did not find an effect of the feedback on a transfer task.
Sequence Tagging in EFL Email Texts as Feedback for Language Learners Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning Nlp4call 2023, 2023
Measuring Task-Level Behavioral Learning Engagement During Text Revision R Schiller, J Fleckenstein, U Mertens, J Meyer Computers & Education, 105656 , 2026 2026
The Future of Feedback: How Can AI Help Transform Feedback to Be More Engaging, Effective, and Scalable? J Meyer, O Köller, T Jansen, J Fleckenstein, MW Asher, S Bichler, ... arXiv preprint arXiv:2603.12463 , 2026 2026
On the role of engagement in automated feedback effectiveness: Insights from keystroke logging R Schiller, J Fleckenstein, L Höft, A Horbach, J Meyer Computers & Education 238, 105386 , 2025 2025 Citations: 7
Self-assessment accuracy in the age of artificial Intelligence: Differential effects of LLM-generated feedback LW Liebenow, FTC Schmidt, J Meyer, J Fleckenstein Computers & Education 237, 105385 , 2025 2025 Citations: 15
Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases. T Jansen, LW Liebenow, U Mertens, FTC Schmidt, JF Lohmann, ... Psychological Bulletin 151 (10), 1280 , 2025 2025 Citations: 15
Neural networks or linguistic features?-Comparing different machine-learning approaches for automated assessment of text quality traits among L1-and L2-learners’ argumentative … JF Lohmann, F Junge, J Möller, J Fleckenstein, R Trüb, S Keller, T Jansen, ... International Journal of Artificial Intelligence in Education 35 (3), 1178-1217 , 2025 2025 Citations: 9
Testing teacher judgments comprehensively: Accuracy, halo, frame of reference, strategy, and personality effects in holistic and analytic assessments of student essays. JF Lohmann, F Lötscher, F Junge, S Keller, T Jansen, J Fleckenstein, ... Journal of Educational Psychology , 2025 2025 Citations: 2
“Can (A) I do this task?” The role of AI as a socializer of students' self-beliefs of their abilities T Jansen, J Meyer, J Fleckenstein, A Wigfield, J Möller Learning and Individual Differences 122, 102731 , 2025 2025 Citations: 8
(De) motivating Zero‐Performing Students With Negative Feedback: Does the Salience of Performance Information Matter? M Steinbach, J Fleckenstein, L Kuklick, J Meyer Journal of Computer Assisted Learning 41 (4), e70070 , 2025 2025 Citations: 2
Nonengagement and unsuccessful engagement with feedback in lower secondary education: The role of student characteristics J Meyer, T Jansen, J Fleckenstein Contemporary Educational Psychology 81, 102363 , 2025 2025 Citations: 22
Understanding individual differences in students’ responses to technology-based feedback on a writing task: the role of achievement motives and initial task performance J Meyer, T Jansen, M Daumiller, J Fleckenstein Journal of Research on Technology in Education, 1-31 , 2025 2025 Citations: 8
LLM feedback for academic writing: Effects on students’ performance and engagement R Glüsing, J Fleckenstein, F Schmidt, J Möller Available at SSRN 5445319 , 2025 2025 Citations: 3
Negative Feedback: Does the Salience of Performance Information Matter? M Steinbach, J Fleckenstein, L Kuklick, J Meyer 2025
Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement R Schiller, J Fleckenstein, U Mertens, A Horbach, J Meyer Computers & Education 223, 105163 , 2024 2024 Citations: 30
How am I going? Behavioral engagement mediates the effect of individual feedback on writing performance J Fleckenstein, T Jansen, J Meyer, R Trüb, EE Raubach, SD Keller Learning and Instruction 93, 101977 , 2024 2024 Citations: 22
Language quality, content, structure: What analytic ratings tell us about EFL writing skills at upper secondary school level in Germany and Switzerland SD Keller, J Lohmann, R Trüb, J Fleckenstein, J Meyer, T Jansen, J Möller Journal of Second Language Writing 65, 101129 , 2024 2024 Citations: 19
Two-way immersion promotes additional language learning: performance of bilingual sixth-grade students in English as a third language S Preusler, J Fleckenstein, S Zitzmann, J Baumert, J Möller International Journal of Bilingual Education and Bilingualism 27 (7), 910-922 , 2024 2024 Citations: 7
Do teachers spot AI? Evaluating the detectability of AI-generated texts among student essays J Fleckenstein, J Meyer, T Jansen, SD Keller, O Köller, J Möller Computers and Education: Artificial Intelligence 6, 100209 , 2024 2024 Citations: 225
Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions J Meyer, T Jansen, R Schiller, LW Liebenow, M Steinbach, A Horbach, ... Computers and Education: Artificial Intelligence 6, 100199 , 2024 2024 Citations: 486
Empirische arbeit: comparing generative AI and expert feedback to students’ writing: insights from student teachers T Jansen, L Höft, L Bahr, J Fleckenstein, J Möller, O Köller, J Meyer Psychologie in Erziehung und Unterricht 71 (2), 80-92 , 2024 2024 Citations: 67
MOST CITED SCHOLAR PUBLICATIONS
Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions J Meyer, T Jansen, R Schiller, LW Liebenow, M Steinbach, A Horbach, ... Computers and Education: Artificial Intelligence 6, 100199 , 2024 2024 Citations: 486
Measuring grit FTC Schmidt, J Fleckenstein, J Retelsdorf, L Eskreis-Winkler, J Möller European Journal of Psychological Assessment , 2017 2017 Citations: 306
Same same, but different? Relations between facets of conscientiousness and grit FTC Schmidt, G Nagy, J Fleckenstein, J Möller, JAN Retelsdorf European journal of personality 32 (6), 705-720 , 2018 2018 Citations: 227
Do teachers spot AI? Evaluating the detectability of AI-generated texts among student essays J Fleckenstein, J Meyer, T Jansen, SD Keller, O Köller, J Möller Computers and Education: Artificial Intelligence 6, 100209 , 2024 2024 Citations: 225
Expectancy value interactions and academic achievement: Differential relationships with achievement measures J Meyer, J Fleckenstein, O Köller Contemporary Educational Psychology 58, 58-74 , 2019 2019 Citations: 204
Automated feedback and writing: a multi-level meta-analysis of effects on students' performance J Fleckenstein, L Liebenow, J Meyer Frontiers in Artificial Intelligence 6 , 2023 2023 Citations: 171
The relationship of personality traits and different measures of domain-specific achievement in upper secondary education J Meyer, J Fleckenstein, J Retelsdorf, O Köller Learning and Individual Differences 69, 45-59 , 2019 2019 Citations: 126
The long‐term proficiency of early, middle, and late starters learning English as a foreign language at school: A narrative review and empirical study J Baumert, J Fleckenstein, M Leucht, O Köller, J Möller Language Learning 70 (4), 1091-1135 , 2020 2020 Citations: 94
Linking TOEFL iBT® writing rubrics to CEFR levels: Cut scores and validity evidence from a standard setting study J Fleckenstein, S Keller, M Krüger, RJ Tannenbaum, O Köller Assessing Writing 43, 100420 , 2020 2020 Citations: 80
Erfolgreich integrieren-die Staatliche Europa-Schule Berlin J Möller, F Hohenstein, J Fleckenstein, O Köller, J Baumert Waxmann Verlag , 2017 2017 Citations: 73
Empirische arbeit: comparing generative AI and expert feedback to students’ writing: insights from student teachers T Jansen, L Höft, L Bahr, J Fleckenstein, J Möller, O Köller, J Meyer Psychologie in Erziehung und Unterricht 71 (2), 80-92 , 2024 2024 Citations: 67
Is a long essay always a good essay? The effect of text length on writing assessment J Fleckenstein, J Meyer, T Jansen, S Keller, O Köller Frontiers in psychology 11, 562462 , 2020 2020 Citations: 67
English writing skills of students in upper secondary education: Results from an empirical study in Switzerland and Germany SD Keller, J Fleckenstein, M Krüger, O Köller, AA Rupp Journal of Second Language Writing 48, 100700 , 2020 2020 Citations: 63
Pädagogische und didaktische Anforderungen an die häusliche Aufgabenbearbeitung O Köller, J Fleckenstein, K Guill, J Meyer Langsam vermisse ich die Schule…“. Schule während und nach der Corona … , 2020 2020 Citations: 48
Conscientiousness and cognitive ability as predictors of academic achievement: Evidence of synergistic effects from integrative data analysis J Meyer, O Lüdtke, FTC Schmidt, J Fleckenstein, U Trautwein, O Köller European Journal of Personality 38 (1), 36-52 , 2024 2024 Citations: 46
Teachers’ judgement accuracy concerning CEFR levels of prospective university students J Fleckenstein, M Leucht, O Köller Language Assessment Quarterly 15 (1), 90-101 , 2018 2018 Citations: 40
Wer hat Biss? Beharrlichkeit und beständiges Interesse von Lehramtsstudierenden J Fleckenstein, FTC Schmidt, J Möller Psychologie in Erziehung und Unterricht 61 (4), 281-286 , 2014 2014 Citations: 40
Mehrsprachigkeit als Ressource J Fleckenstein, J Möller, J Baumert Zeitschrift für Erziehungswissenschaft 21 (1), 97-120 , 2018 2018 Citations: 38
Proficient beyond borders: assessing non-native speakers in a native speakers’ framework J Fleckenstein, M Leucht, HA Pant, O Köller Large-scale assessments in education 4 (1), 19 , 2016 2016 Citations: 38
Promoting mathematics achievement in one-way immersion: Performance development over four years of elementary school J Fleckenstein, SK Gebauer, J Möller Contemporary Educational Psychology 56, 228-235 , 2019 2019 Citations: 37