Peer assessment of code readability

Photo credit: Pixabay, Pexels, CC0

In this post, Dr Paul Anderson, a Research Fellow in the School of Informatics, shares some unexpected findings from a research project, which explores a peer-based approach called Adaptive Comparative Judgement… 

You may think that marking computer code is a very objective activity – we can just run the code and see if it produces the “right answer”. We can even write an “automarking” program that does this for us. But for anything more than the most trivial examples, this isn’t the whole story. As well as providing the instructions for the machine, the code acts as a description of the algorithm for other humans to read. If this isn’t clear, then the code isn’t useful in any “real” situation; nobody else can check that it isn’t hiding some obscure bug or security flaw, or modify it to keep it working when the world changes. Assessing these “non-functional” aspects of the code is often neglected because it is hard to do. “Readability” is a subjective criteria, so large classes with multiple markers present a big challenge in terms of both resources and consistency.

I was inspired by a report by Hardy et al (2016), Ask, Answer, Assess: Peer learning from student-generated content, to try and solve this problem using a peer approach based on “Adaptive, Comparative Judgement” [2]. Adaptive, Comparative Judgement (ACJ) starts from the premise that it is much easier to compare two samples and just say which is “best”, than it is to produce a consistent ranking from a large number of samples. The ACJ algorithm selects pairs of samples to be compared, and then computes a ranking based on the results of these pairwise comparisons. A large number of comparisons are needed, but each one can usually be made very quickly. This has proven extremely successful for marking large exams, with more reliable results, and less susceptibility to marker variation than traditional methods.

Hardy et al (2016) used this ACJ technique but asked the students themselves to make the judgements. This seemed to me to be a brilliant solution – not only would this provide a peer ranking which could be used to inform the marking, but it would also expose the students to many different styles and help them appreciate what makes code easy to read (or not).

We initially implemented this in 2017 for a MSc course on Java Programming (IJP). After the students submitted their code, we asked them to visit a web site which provided them with pairs of solutions from other students and asked them to decide which they though was clearer and easier to read, simply using their own subjective judgement. We also gave them the opportunity to leave comments on the samples which were later collated and passed back to the authors. We did not give any formal credit for making these comparisons, but we made it a requirement that they completed a small number of them.

The results from this initial experiment were inconclusive. Most students appreciated the opportunity to see other’s code, and many left useful peer feedback. But the algorithm failed to generate a ranking that correlated in any way with the manual marking.

The following year, we repeated the experiment. This time, we used draft design documents rather than code. We hoped that the opportunity to see other students’ solutions before submitting their own final design would be sufficient motivation to participate. We also used an improved version of the software, and avoided some of the procedural mistakes from the first experiment.

75% of the students engaged with this, but the correlation with the manual marking was again poor – certainly not useful as a contribution to the marking process. Many students found it useful to see other’s ideas, but it is difficult to tell how much this contributed to any improvement in the final designs. Notably, several students actually went on to change their designs for the worse after seeing the other students’ submissions!

Figure 1: ACJ vs manual marking

I still believe that this approach has great potential, both for assessment and for the opportunity that it provides for students to see alternative approaches. It also scales particularly well to large numbers, and develops the students’ assessment literacy. But I’m still uncertain why the experiments performed so poorly – possible factors range from the algorithm itself to the students’ uncertainty about the factors that they should be considering. We are not repeating the experiment with this year’s course, but we are continuing to experiment with the algorithms and simulations, and I hope to try this again in the future. For presentation slides and simulations videos on this project, you can visit my homepage.

This post is based on a project, Helping students to write readable code, which was funded by the Principal’s Teaching Award Scheme.

 [1] Hardy, J., Galloway, R., Rhind, S., McBride, K., Hughes, K., and Donnelly, R. (2016). Ask, Answer, Assess: Peer learning from student-generated content. Higher Education Academy, UK.

[2]  Pollitt, A. (2012). The method of Adaptive Comparative Judgement. Assessment in Education: Principles, Policy & Practice, 19(3), pp. 281-300.

Paul Anderson

Paul Anderson is a Research Fellow in the School of Informatics. He has been teaching programming for nearly 40 years and is currently teaching an “Introduction to Practical Programming With Objects” for the MSc in Informatics.

Leave a Reply

Your email address will not be published. Required fields are marked *