Open Source Language Models Can Provide Feedback : Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Koutcheme, Charles; Dainese, Nicola; Sarsa, Sami; Hellas, Arto; Leinonen, Juho; Denny, Paul

doi:10.1145/3649217.3653612

dc.contributor.author	Koutcheme, Charles
dc.contributor.author	Dainese, Nicola
dc.contributor.author	Sarsa, Sami
dc.contributor.author	Hellas, Arto
dc.contributor.author	Leinonen, Juho
dc.contributor.author	Denny, Paul
dc.date.accessioned	2024-08-07T08:50:54Z
dc.date.available	2024-08-07T08:50:54Z
dc.date.issued	2024
dc.identifier.citation	Koutcheme, C., Dainese, N., Sarsa, S., Hellas, A., Leinonen, J., & Denny, P. (2024). Open Source Language Models Can Provide Feedback : Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge. In <i>ITiCSE 2024 : Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1</i> (pp. 52-58). ACM. <a href="https://doi.org/10.1145/3649217.3653612" target="_blank">https://doi.org/10.1145/3649217.3653612</a>
dc.identifier.other	CONVID_221088392
dc.identifier.uri	https://jyx.jyu.fi/handle/123456789/96533
dc.description.abstract	Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings.	en
dc.format.extent	754
dc.format.mimetype	application/pdf
dc.language.iso	eng
dc.publisher	ACM
dc.relation.ispartof	ITiCSE 2024 : Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1
dc.rights	CC BY 4.0
dc.subject.other	open source
dc.subject.other	large language models
dc.subject.other	generative AI
dc.subject.other	LLMs
dc.subject.other	automatic feedback
dc.subject.other	automatic evaluation
dc.subject.other	programming feedback
dc.subject.other	LLM-as-a-judge
dc.subject.other	Zephyr
dc.subject.other	Code Llama
dc.subject.other	GPT-4
dc.title	Open Source Language Models Can Provide Feedback : Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge
dc.type	conferenceObject
dc.identifier.urn	URN:NBN:fi:jyu-202408075409
dc.contributor.laitos	Informaatioteknologian tiedekunta	fi
dc.contributor.laitos	Faculty of Information Technology	en
dc.type.uri	http://purl.org/eprint/type/ConferencePaper
dc.relation.isbn	979-8-4007-0600-4
dc.type.coar	http://purl.org/coar/resource_type/c_5794
dc.description.reviewstatus	peerReviewed
dc.format.pagerange	52-58
dc.type.version	publishedVersion
dc.rights.copyright	© 2024 the Authors
dc.rights.accesslevel	openAccess	fi
dc.relation.conference	Conference on Innovation and Technology in Computer Science Education
dc.subject.yso	ohjelmointi
dc.subject.yso	chattibotit
dc.subject.yso	avoin lähdekoodi
dc.subject.yso	tekoäly
dc.subject.yso	korkeakouluopetus
dc.subject.yso	palaute
dc.subject.yso	kielimallit
dc.format.content	fulltext
jyx.subject.uri	http://www.yso.fi/onto/yso/p4887
jyx.subject.uri	http://www.yso.fi/onto/yso/p39028
jyx.subject.uri	http://www.yso.fi/onto/yso/p17089
jyx.subject.uri	http://www.yso.fi/onto/yso/p2616
jyx.subject.uri	http://www.yso.fi/onto/yso/p1246
jyx.subject.uri	http://www.yso.fi/onto/yso/p1236
jyx.subject.uri	http://www.yso.fi/onto/yso/p40335
dc.rights.url	https://creativecommons.org/licenses/by/4.0/
dc.relation.doi	10.1145/3649217.3653612
jyx.fundinginformation	This research was partially supported by the Research Council of Finland (Academy Research Fellow grant number 356114).
dc.type.okm	A4

Aineistoon kuuluvat tiedostot

Nimi:: 3649217.3653612.pdf
Koko:: 911.6Kb
Tiedostomuoto:: PDF
Kuvaus:: publishedVersion

Katso/Avaa

Aineisto kuuluu seuraaviin kokoelmiin

Informaatioteknologian tiedekunta [2293]

Näytä suppeat kuvailutiedot

Ellei muuten mainita, aineiston lisenssi on CC BY 4.0

Näytetään aineistoja, joilla on samankaltainen nimeke tai asiasanat.

Evaluating Contextually Personalized Programming Exercises Created with Generative AI

Logacheva, Evanfiya; Hellas, Arto; Prather, James; Sarsa, Sami; Leinonen, Juho (ACM, 2024)

Programming skills are typically developed through completing various hands-on exercises. Such programming problems can be contextualized to students’ interests and cultural backgrounds. Prior research in educational ...
"Like a Nesting Doll" : Analyzing Recursion Analogies Generated by CS Students Using Large Language Models

Bernstein, Seth; Denny, Paul; Leinonen, Juho; Kan, Lauren; Hellas, Arto; Littlefield, Matt; Sarsa, Sami; Macneil, Stephen (ACM, 2024)

Grasping complex computing concepts often poses a challenge for students who struggle to anchor these new ideas to familiar experiences and understandings. To help with this, a good analogy can bridge the gap between ...
How do Finnish and Chinese students’ diverse pedagogical experiences shape feedback interpretation?

Liontou, Magdalini (Suomen soveltavan kielitieteen yhdistys ry, 2023)

Due to the dissemination of joint degree programmes in higher education, more students from different educational backgrounds are exposed to the same teaching and assessment without sharing a common pedagogical culture. ...
Unfolding principles for student peer feedback : a comparative analysis of examples across higher education contexts

Ellegaard, Marianne; Niss, Maritn; Bruun, Jesper; Lämsä, Joni; Voetman Christiansen, Frederik; Linell, Gry Green; Fogh Larsen, Camilla; Nyman, Rimma; Johannsen, Bjørn Friis (Cappelen Damm AS - Cappelen Damm Akademisk, 2022)

In this paper we conceptualize formative peer feedback principles by analyzing and comparing six empirical examples of formative peer feedback in a set of international STEM (science, technology, engineering, and mathematics) ...
The Relevance of Versatile Learning Online Assessment Feedback for University Student

Maunula, Minna; Maunumäki, Minna; Harju-Luukkainen, Heidi (International Journal of Multidisciplinary Perspectives in Higher Education, 2023)

In the process of learning, assessment is relevant from multiple perspectives. Learning assessment guides student learning and teaching either knowingly or unconsciously. This study takes a closer look at the meanings given ...

Open Source Language Models Can Provide Feedback : Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Aineistoon kuuluvat tiedostot

Aineisto kuuluu seuraaviin kokoelmiin

Samankaltainen aineisto

Evaluating Contextually Personalized Programming Exercises Created with Generative AI ﻿

"Like a Nesting Doll" : Analyzing Recursion Analogies Generated by CS Students Using Large Language Models ﻿

How do Finnish and Chinese students’ diverse pedagogical experiences shape feedback interpretation? ﻿

Unfolding principles for student peer feedback : a comparative analysis of examples across higher education contexts ﻿

The Relevance of Versatile Learning Online Assessment Feedback for University Student ﻿

Evaluating Contextually Personalized Programming Exercises Created with Generative AI

"Like a Nesting Doll" : Analyzing Recursion Analogies Generated by CS Students Using Large Language Models

How do Finnish and Chinese students’ diverse pedagogical experiences shape feedback interpretation?

Unfolding principles for student peer feedback : a comparative analysis of examples across higher education contexts

The Relevance of Versatile Learning Online Assessment Feedback for University Student