Johanna Fleckenstein

Scopus Publications

LLM feedback for academic writing: Effects on students’ performance and engagement
Robert Glüsing, Johanna Fleckenstein, Fabian T.C. Schmidt, Jens Möller
Contemporary Educational Psychology, 2026
Writing and revising academic texts is a demanding task that benefits significantly from feedback provided by teachers or peers. However, providing elaborated formative feedback on students’ academic writing is time-intensive and therefore hard to implement in educational practice. As a supplementary resource, large language models (LLMs) offer the potential to support the writing process by generating automated feedback to help students enhance their texts. The present study examined the accuracy of LLM-generated feedback on student texts and its effectiveness in improving university students’ revision performance and engagement in academic writing. In a randomized controlled experiment, a sample of N = 144 university students wrote an abstract summarizing a research article. All participants were then instructed to revise their abstracts; half received individualized feedback generated by GPT-4 using a standardized prompting procedure. Controlling for the quality of the initial drafts, regression analyses revealed that LLM-generated feedback led to higher revision quality and increased behavioral engagement, as measured by revision time and edit distance. Furthermore, behavioral engagement partially mediated the effect of feedback on revision quality. These findings demonstrate that LLMs can provide high-accuracy, effective feedback on academic writing. The study discusses the potential applications and implications of this technology within higher education contexts.
On the role of engagement in automated feedback effectiveness: Insights from keystroke logging
Ronja Schiller, Johanna Fleckenstein, Lars Höft, Andrea Horbach, Jennifer Meyer
Computers and Education, 2025
Feedback research increasingly focuses on the role of learners’ engagement in the feedback process. Process measures from technology-based learning environments that reflect writing behavior can provide new insights into the mechanisms underlying feedback effectiveness by making engagement visible. Previous research has shown that log data and similarity measures mediate the effects of automated feedback on learners’ revision performance. In the present study, we aimed to replicate and extend previous research using measures obtained from keystroke logging that represent the revision process on a more fine-grained level. We considered behavioral engagement (i.e., number of keystrokes and typing time) and writing pauses as potential indicators of cognitive engagement. In a classroom experiment, N = 453 English-as-a-foreign-language (EFL) learners ( M age = 16.11) completed a writing task and revised their draft, receiving either feedback generated by a large language model (i.e., GPT 3.5 Turbo) or no feedback. A second writing task served as a transfer task. All texts were scored automatically to assess performance. The effect of automated feedback on learners’ revision and transfer performance was mediated through the different indicators of behavioral engagement during the text revision, although the direct effect of automated feedback on the transfer task was not significant. We found small effects of feedback on pause length and the number of pauses, but the indirect effects were not significant. The study provides further evidence on the role of learning engagement in feedback effectiveness and illustrates how online measures (i.e., keystroke logging) can be used to gain new insights into the effectiveness of automated feedback. The use of different process measures to assess learning engagement is discussed.
Self-assessment accuracy in the age of artificial Intelligence: Differential effects of LLM-generated feedback
Lucas W. Liebenow, Fabian T.C. Schmidt, Jennifer Meyer, Johanna Fleckenstein
Computers and Education, 2025
Feedback is a promising intervention to foster students’ self-assessment accuracy (SAA), but the effect can vary depending on students' initial skill levels or prior performance. In particular, lower-performing students who are less accurate might benefit more from feedback in terms of SAA. To deepen our understanding, the present study investigated the mechanism and dependencies of feedback effects on SAA in the realm of large language models (LLMs). Within a randomized control experiment, we examined the effect of LLM-generated feedback on SAA by considering students’ initial performance and initial SAA as potential moderators. A sample of N = 459 upper secondary students wrote an argumentative essay in English as a foreign language and revised their text. After finishing their first draft (pretest) and revision (posttest) of the draft, students self-assessed their writing performance. Students in the experimental group received GPT-3.5-turbo-generated feedback on their first draft during their revision. In the control group, students could revise their text without feedback. Our results indicated no significant main effect of LLM-generated feedback on students’ SAA. Furthermore, we found a significant interaction effect between feedback and students' pretest SAA on SAA changes, indicating that lower-calibrated students improved their SAA with feedback more, compared to students with similar pretest SAA and without feedback. Exploratory analyses revealed that students with higher pretest SAA did not improve their SAA with feedback and decreased their SAA. We discuss this nuanced evidence and draw implications for research and practice using LLM-generated feedback in education. • LLM-generated feedback did not improve self-assessment accuracy (SAA) on average. • Feedback effectiveness depended on students' initial SAA, not performance. • Students with lower initial SAA improved their SAA after LLM feedback. • LLM-generated feedback offers a scalable way to support students who need it most.
Neural Networks or Linguistic Features? - Comparing Different Machine-Learning Approaches for Automated Assessment of Text Quality Traits Among L1- and L2-Learners’ Argumentative Essays
Julian F. Lohmann, Fynn Junge, Jens Möller, Johanna Fleckenstein, Ruth Trüb, Stefan Keller, Thorben Jansen, Andrea Horbach
International Journal of Artificial Intelligence in Education, 2025
Recent investigations in automated essay scoring research imply that hybrid models, which combine feature engineering and the powerful tools of deep neural networks (DNNs), reach state-of-the-art performance. However, most of these findings are from holistic scoring tasks. In the present study, we use a total of four prompts from two different corpora consisting of both L1 and L2 learner essays annotated with trait scores (e.g., content, organization, and language quality). In our main experiments, we compare three variants of trait-specific models using different inputs: (1) models based on 220 linguistic features, (2) models using essay-level contextual embeddings from the distilled version of the pre-trained transformer BERT (DistilBERT), and (3) a hybrid model using both types of features. Results imply that when trait-specific models are trained based on a single resource, the feature-based models slightly outperform the embedding-based models. These differences are most prominent for the organization traits. The hybrid models outperform the single-resource models, indicating that linguistic features and embeddings indeed capture partially different aspects relevant for the assessment of essay traits. To gain more insights into the interplay between both feature types, we run addition and ablation tests for individual feature groups. Trait-specific addition tests across prompts indicate that the embedding-based models can most consistently be enhanced in content assessment when combined with morphological complexity features. Most consistent performance gains in the organization traits are achieved when embeddings are combined with length features, and most consistent performance gains in the assessment of the language traits when combined with lexical complexity, error, and occurrence features. Cross-prompt scoring again reveals slight advantages for the feature-based models.
(De)motivating Zero-Performing Students With Negative Feedback: Does the Salience of Performance Information Matter?
Marlene Steinbach, Johanna Fleckenstein, Livia Kuklick, Jennifer Meyer
Journal of Computer Assisted Learning, 2025
BackgroundProviding students with information on their current performance could help them improve by stimulating their reflection, but negative feedback that saliently mirrors task‐related failure can harm motivation. In the context of automated scoring based on artificial intelligence, we explored how feedback on written texts might be designed to be least detrimental for zero‐performing students who are likely to receive negative feedback frequently and might suffer from its motivational consequences.ObjectivesThis experiment set out to investigate whether making the negative performance information in automated feedback messages less salient reduces the potential threat of negative feedback for zero‐performing students' task‐specific self‐concept, intrinsic value, and performance.MethodsA sample of 105 (Mage = 13.97 years) zero‐performing students received negative feedback with either more or less salient performance information after completing an English writing task. We used regression analysis to examine pre–post effects and group differences in self‐concept, intrinsic value, and performance.Results and ConclusionsThe analyses showed that zero‐performing students' performance improved but their self‐concept and intrinsic value declined over the course of two writing tasks, with feedback provided after the initial task. Contrary to expectations, our findings showed that students' task‐specific self‐concept and intrinsic value declined more in the condition with less salient performance information (i.e., without a red cross as a salient visual performance cue). Our findings highlight the motivational potential of performance information and are discussed in terms of the need for further research into how negative feedback can be designed to effectively motivate and support zero‐performing learners.
“Can (A)I do this task?” The role of AI as a socializer of students' self-beliefs of their abilities
Thorben Jansen, Jennifer Meyer, Johanna Fleckenstein, Allan Wigfield, Jens Möller
Learning and Individual Differences, 2025
Students' beliefs about their own academic abilities – their answers to the question “Can I do this task?” - are crucial to their success. Learning within AI-supported environments, alongside AI agents, influences students' beliefs about their abilities. Studies show enhancing and diminishing influences that remain unexplained by motivation theory, limiting theories' explanatory effect in AI-supported learning environments, and leaving educational technology research without a solid theoretical foundation. The following article specifies the situated expectancy-value theory (SEVT) for students' self-belief formation in the context of an AI-driven society. The expanded theory conceptualizes AI as becoming an artificial socializer, capturing the role of AI as an instrumental tool and social agents making up students' individual environments. Bridging AI and motivational research provides a framework for systematically investigating students' self-beliefs in AI-supported contexts and how educational technology can support positive self-beliefs, considering students' contexts and individual differences. • Summarizes and explains empirical influences of AI on students' ability self-beliefs. • Integrates AI into situated-expectancy value theory. • Provides a framework to investigate AI effects on students' ability self-beliefs. • Describe potential mechanisms in the self-belief formation considering AI.
Nonengagement and unsuccessful engagement with feedback in lower secondary education: The role of student characteristics
Jennifer Meyer, Thorben Jansen, Johanna Fleckenstein
Contemporary Educational Psychology, 2025
• We investigated feedback engagement in a sample of lower-secondary students. • Findings show that 20% of students did not engage, 47% unsuccessfully engaged in a text revision. • We focused on the role of individual differences in feedback engagement. • We considered the role of gender, cognitive and noncognitive variables. Feedback can be a powerful learning intervention and learners’ active engagement is assumed to be one of the most important determinants of feedback effectiveness. But not all students successfully engage with feedback. In the present study, we aimed to make students’ engagement with feedback visible by focusing on their text revisions as an indicator of feedback response. On the basis of theoretical models of feedback processing, we differentiated between behavioral nonengagement (i.e., not revising at all after receiving feedback) and unsuccessful engagement (i.e., revising after receiving feedback, but not improving in the process). Capitalizing on this distinction, we compared the characteristics of students in both groups with those of students who (successfully) engaged with the feedback. We provided automated computer-based feedback on a writing task to a sample of 937 students in lower secondary education in Germany (49% female, Grades 7[28%], 8 [29%], and 9[43%]), asking students to revise their texts according to the feedback. We found that 20% of the students did not make any revisions to their text after receiving feedback (nonengagement) and that 47% of the students did not improve their performance after working with the feedback during a text revision (unsuccessful engagement). Male students and students with lower cognitive abilities were more likely to show nonengagement. For unsuccessful engagement, cognitive abilities and the English grade were relevant predictors, hinting at the role that domain-specific competencies play in translating feedback into effective revision. We also found significant positive associations of intrinsic task value with successful feedback engagement. We discuss how future research could advance understanding of feedback processing by taking a more fine-grained approach to investigating feedback response.
Understanding individual differences in students’ responses to technology-based feedback on a writing task: the role of achievement motives and initial task performance
Jennifer Meyer, Thorben Jansen, Martin Daumiller, Johanna Fleckenstein
Journal of Research on Technology in Education, 2025
Computer-based feedback interventions are generally effective—but not for all students. Students’ achievement motives (hopes for success, fear of failure) might explain how students respond to feedback in interplay with initial task performance. In a sample of 949 secondary school students in Germany (Grades 7–9) we found that when the task criterion was initially not met, higher hopes for success were positively associated with students’ subsequent task performance after receiving automated feedback. When the criterion was initially met, a higher fear of failure was negatively related to the subsequent task performance. Our results suggest that achievement motives can play a complex role at different levels of initial task performance. These insights could inform personalized feedback design to enhance feedback effectiveness in cognitively demanding tasks.
Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases.
Thorben Jansen, Lucas W. Liebenow, Ute Mertens, Fabian T. C. Schmidt, Julian F. Lohmann, Johanna Fleckenstein, Jennifer Meyer
Psychological Bulletin, 2025
Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
Understanding the effectiveness of automated feedback: Using process data to uncover the role of behavioral engagement
Ronja Schiller, Johanna Fleckenstein, Ute Mertens, Andrea Horbach, Jennifer Meyer
Computers and Education, 2024
In the last couple of years, feedback research has shifted towards a feedback-as-process approach, taking a learner-centered perspective and focusing on the proactive role of the learner in feedback effectiveness. Process measures can provide new insights into the role of the learner by making learners’ actual behavioral engagement visible. We conducted an experimental study, comparing two groups (feedback vs. no feedback) of English-as-a-foreign-language learners in lower secondary schools ( N = 189). The learners completed a writing task and revised it with or without feedback. A second writing task served as a transfer task. Performance was automatically assessed using a scoring algorithm. To determine the level of learners’ behavioral engagement during the text revision, we used the revision time and the edit distance (i.e., a similarity measure) as behavioral measures. Our analyses showed a positive effect of feedback on text revision. We found a full mediation of the effect of feedback on text revision through revision time with an estimated portion of mediation (POM) of .63∗∗∗ and a partial mediation of the feedback effect on text revision through the edit distance with a POM of .30∗∗. We did not find significant mediation effects of either engagement variable regarding performance in a transfer task. Our findings contribute to the understanding of feedback effectiveness, highlighting the central role of learner engagement in the feedback process. • We investigate feedback effectiveness from a process-oriented perspective. • Log-data and computer-linguistic features as objective indicators of engagement. • Feedback is associated with higher levels of engagement during text revision. • Behavioral engagement mediates positive effect of feedback on revision performance. • We did not find an effect of the feedback on a transfer task.
How am I going? Behavioral engagement mediates the effect of individual feedback on writing performance
Johanna Fleckenstein, Thorben Jansen, Jennifer Meyer, Ruth Trüb, Emily E. Raubach, Stefan D. Keller
Learning and Instruction, 2024
Language quality, content, structure: What analytic ratings tell us about EFL writing skills at upper secondary school level in Germany and Switzerland
Stefan D. Keller, Julian Lohmann, Ruth Trüb, Johanna Fleckenstein, Jennifer Meyer, Thorben Jansen, Jens Möller
Journal of Second Language Writing, 2024
Do teachers spot AI? Evaluating the detectability of AI-generated texts among student essays
Johanna Fleckenstein, Jennifer Meyer, Thorben Jansen, Stefan D. Keller, Olaf Köller, Jens Möller
Computers and Education Artificial Intelligence, 2024
Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions
Jennifer Meyer, Thorben Jansen, Ronja Schiller, Lucas W. Liebenow, Marlene Steinbach, Andrea Horbach, Johanna Fleckenstein
Computers and Education Artificial Intelligence, 2024
Individualizing goal-setting interventions using automated writing evaluation to support secondary school students’ text revisions
Thorben Jansen, Jennifer Meyer, Johanna Fleckenstein, Andrea Horbach, Stefan Keller, Jens Möller
Learning and Instruction, 2024
Two-way immersion promotes additional language learning: performance of bilingual sixth-grade students in English as a third language
Sandra Preusler, Johanna Fleckenstein, Steffen Zitzmann, Jürgen Baumert, Jens Möller
International Journal of Bilingual Education and Bilingualism, 2024
Conscientiousness and Cognitive Ability as Predictors of Academic Achievement: Evidence of Synergistic Effects From Integrative Data Analysis
Jennifer Meyer, Oliver Lüdtke, Fabian T. C. Schmidt, Johanna Fleckenstein, Ulrich Trautwein, Olaf Köller
European Journal of Personality, 2024
Comparing Generative AI and Expert Feedback to Students’ Writing: Insights from Student Teachers
Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer
Psychologie in Erziehung Und Unterricht, 2024
Machine Learning in the educational context: Evidence of prediction accuracy considering essays in English as a foreign language
Jennifer Meyer, Thorben Jansen, Johanna Fleckenstein, Stefan Keller, Olaf Köller
Zeitschrift Fur Padagogische Psychologie, 2023
A closer look at the domain-specific associations of openness with language achievement: Evidence on the role of intrinsic value from two large-scale longitudinal studies
Jennifer Meyer, Fabian T. C. Schmidt, Johanna Fleckenstein, Olaf Köller
British Journal of Educational Psychology, 2023
Automated feedback and writing: a multi-level meta-analysis of effects on students' performance
Johanna Fleckenstein, Lucas W. Liebenow, Jennifer Meyer
Frontiers in Artificial Intelligence, 2023
Sequence Tagging in EFL Email Texts as Feedback for Language Learners
Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning Nlp4call 2023, 2023
Read at home to do well at school: informal reading predicts achievement and motivation in English as a foreign language
Jennifer Meyer, Johanna Fleckenstein, Maleika Krüger, Stefan Daniel Keller, Nicolas Hübner
Frontiers in Psychology, 2023
Judgment accuracy of German student texts: Do teacher experience and content knowledge matter?
Jens Möller, Thorben Jansen, Johanna Fleckenstein, Nils Machts, Jennifer Meyer, Raja Reble
Teaching and Teacher Education, 2022
Correction to: Studies on the Acculturation of Young Refugees in the Educational Domain: A Scoping Review of Research and Methods (Adolescent Research Review, (2021), 6, 1, (15-31), 10.1007/s40894-019-00129-7)
Débora B. Maehler, Steffen Pötzschke, Howard Ramos, Paul Pritchard, Johanna Fleckenstein
Adolescent Research Review, 2021
Studies on the Acculturation of Young Refugees in the Educational Domain: A Scoping Review of Research and Methods
Débora B. Maehler, Steffen Pötzschke, Howard Ramos, Paul Pritchard, Johanna Fleckenstein
Adolescent Research Review, 2021
The Long-Term Proficiency of Early, Middle, and Late Starters Learning English as a Foreign Language at School: A Narrative Review and Empirical Study
Jürgen Baumert, Johanna Fleckenstein, Michael Leucht, Olaf Köller, Jens Möller
Language Learning, 2020
Is a Long Essay Always a Good Essay? The Effect of Text Length on Writing Assessment
Johanna Fleckenstein, Jennifer Meyer, Thorben Jansen, Stefan Keller, Olaf Köller
Frontiers in Psychology, 2020
Is younger always better? Early foreign language learning at primary school
Johanna Fleckenstein, Jens Möller, Jürgen Baumert
Zeitschrift Fur Padagogische Psychologie, 2020
English writing skills of students in upper secondary education: Results from an empirical study in Switzerland and Germany
Stefan D. Keller, Johanna Fleckenstein, Maleika Krüger, Olaf Köller, André A. Rupp
Journal of Second Language Writing, 2020
Linking TOEFL iBT® writing rubrics to CEFR levels: Cut scores and validity evidence from a standard setting study
Johanna Fleckenstein, Stefan Keller, Maleika Krüger, Richard J. Tannenbaum, Olaf Köller
Assessing Writing, 2020
Writing skills in English as a foreign language in upper secondary school
Olaf Köller, Johanna Fleckenstein, Jennifer Meyer, Anna Lara Paeske, Maleika Krüger, Andre A. Rupp, Stefan Keller
Zeitschrift Fur Erziehungswissenschaft, 2019
Expectancy value interactions and academic achievement: Differential relationships with achievement measures
Jennifer Meyer, Johanna Fleckenstein, Olaf Köller
Contemporary Educational Psychology, 2019
Measuring grit: A German validation and a domain-specific approach to grit
Fabian T. C. Schmidt, Johanna Fleckenstein, Jan Retelsdorf, Lauren Eskreis-Winkler, Jens Möller
European Journal of Psychological Assessment, 2019
Promoting mathematics achievement in one-way immersion: Performance development over four years of elementary school
Johanna Fleckenstein, Sandra Kristina Gebauer, Jens Möller
Contemporary Educational Psychology, 2019
The relationship of personality traits and different measures of domain-specific achievement in upper secondary education
Jennifer Meyer, Johanna Fleckenstein, Jan Retelsdorf, Olaf Köller
Learning and Individual Differences, 2019
Same Same, but Different? Relations Between Facets of Conscientiousness and Grit
Fabian T.C. Schmidt, Gabriel Nagy, Johanna Fleckenstein, Jens Möller, Jan Retelsdorf
European Journal of Personality, 2018
Multilingualism as a resource: Dual-immersion students’ achievement in English as a third language
Johanna Fleckenstein, Jens Möller, Jürgen Baumert
Zeitschrift Fur Erziehungswissenschaft, 2018
Editorial
Jens Möller, Johanna Fleckenstein, Sandra Preusler, Isabell Paulick, Jürgen Baumert
Zeitschrift Fur Erziehungswissenschaft, 2018
Variations and effects of bilingual education in schools
Jens Möller, Johanna Fleckenstein, Friederike Hohenstein, Sandra Preusler, Isabell Paulick, Jürgen Baumert
Zeitschrift Fur Erziehungswissenschaft, 2018
Teachers’ Judgement Accuracy Concerning CEFR Levels of Prospective University Students
Johanna Fleckenstein, Michael Leucht, Olaf Köller
Language Assessment Quarterly, 2018
Proficient beyond borders: assessing non-native speakers in a native speakers’ framework
Johanna Fleckenstein, Michael Leucht, Hans Anand Pant, Olaf Köller
Large Scale Assessments in Education, 2016
What works in school? Expert and novice teachers’ beliefs about school effectiveness
Johanna Fleckenstein, Friederike Zimmermann, Olaf Köller, Jens Møller
Frontline Learning Research, 2015
Who's got Grit? Perseverance and consistency of interest in pre-service teachers. A German adaptation of the 12-Item Grit Scale
Johanna Fleckenstein, Fabian T.C. Schmidt, Jens Möller
Psychologie in Erziehung Und Unterricht, 2014

Johanna Fleckenstein

Scopus Publications

RECENT SCHOLAR PUBLICATIONS

MOST CITED SCHOLAR PUBLICATIONS