Last year I posted an article on The Unfairness of Measuring Teaching Performance concerning anonymous student comments that said that the teacher was “too old” . An article published on the online site Phys.org  found that male teachers were most likely to be evaluated the highest by students and female teachers from a non-English background the lowest. Further, the bias showed up most in student surveys in Science and Business and was largely absent from students surveys from Engineering and other disciplines.
This study was based upon 500,000 student surveys of teaching at the University of NSW, Sydney between 2010 and 2016. It involved more than 3000 teachers over 2000 courses. across 5 Faculties.
In my previous article, I strongly supported teaching surveys as a tool for professional teacher development using tailored questions that are teacher selectable. It was my belief that problems arise because of
… the impersonal nature of the survey, as well as the fact that it is exclusively university, administered, that is the heart of the problem.
The use of institution-centric and standardised teaching surveys promote sameness of approach to teaching. Nowadays, completion rates of such surveys are often very low, suggesting that students tend to complete them out of a sense of obligation. The example given in my earlier article, shows 30 completions for the teacher-initiated survey I used in 1998 (100% of the students attending the lecture at which the survey was administered and ∼95% of the enrolled students) compared with 4 completions (24% of the enrolled students) for the standardised institutional survey in 2015.
In saying this I’m not expressing a desire to turn back technology — I’m expressing a desire for teachers to have more control over the way that the survey technology is used and integrated into the classroom experience, even if that classroom experience is more virtual space than physical. In my opinion, this is the answer to the question I posed in my earlier article:
This raises an even more pertinent question: why after 17 years, with all the intervening developments in information technology, do we have less fairness in evaluating the quality of teaching in the university system in 2018 than we had in 1998?
What’s to be done?
The article in phys.org states that the use of student evaluations of teaching is fallible as a measure of teacher performance because of the bias that this study demonstrates. This implies that the use of such data in formal staff evaluations is therefore also fallible.
Given the demonstrated bias found in this study, students survey data have no place being used to assess the performance of teachers nor to help decide questions of whether a given academic should be re-hired, promoted or fired. To do so, can only further entrench the bias towards male leadership into the university sector.
Does this mean that we can expect to see less use of student survey metrics in higher education? This is highly unlikely given the Australian Government’s commitment to “Quality Indicators for Learning and Teaching” (QILT) initiative and similar measures by other Governments such as the European Commission’s U-Multirank .
Indeed, in its conclusions, the article in Phys.org goes on to say that universities need to reduce the sources of bias by better-educating students and staff; as well as encouraging more women and members of minorities to work at universities at all levels. Of course, these are admirable objectives, but disappointing as a conclusion because they add nothing to what universities have already declared to be their goals and responsibilities.
The article on bias in Phys.org sidesteps how the use of quality metrics, based upon student surveys of the learning experience, could be improved. The underlying assumption seems to be that quality metrics are sacrosanct and beyond our reach to improve.
This is very disappointing for reasons I’ll outline in the next section.
Gaming the System
It is my conjecture that the evidence of bias from the statistics presented in the Phys.org article reflects that men who rate higher in the student outcomes metrics are simply better at gaming the system than are women and members of other minorities. Furthermore, those men game the system to exclude others.
I can’t prove this of course, but it follows from Goodhart’s Law, named after the British economist who formulated it. Marilyn Strathern’s rephrasing Goodhart’s Law states:
“When a measure becomes a target, it ceases to be a good measure.” Marilyn Strathern
This occurs when individuals try to anticipate the outcome of policy changes by taking actions that change the outcomes. To put it another way, anything that can be measured and rewarded will be gamed. In the university sector, the gaming that has received the most attention is the lowering of the standards required to pass .
In my experience, I’ve seen academics make runs to the supermarket or candy store prior to their lectures to stock up on wrapped sweets that they then throw out to students who answer their questions correctly in the lecture hall. Usually, these academics do well in student evaluations. They might argue that this is an effective engagement technique but I say that it displays either a discredited form of operant conditioning or worse, gaming the system of student evaluations.
There are other more subtle forms of gaming the system. Because students tend to view surveys as a fill-in-the-form exercise they tend to rank teachers according to their ranking in the university hierarchy: Dean of Faculty higher than Head of School higher than Professor higher than lecturer higher than casual tutor — a kind of donkey vote when you can’t remember much about the teacher concerned. This effect alone could explain some of the bias observed. This behaviour might be reinforced in a class by the senior academic in the way they speak to students, a subtle form of gaming.
The data reported by Phys.org  shows a statistically significant bias towards males rating higher in student surveys of teaching based upon a large sample size of 500,000 evaluations, several thousand teachers and courses. You can either choose to treat these findings dismissively, say that they are a problem that we’re trying to deal with — as the article itself tends to do — or you can say that these findings reveal a dark and ugly side of academic behaviour that should be exposed to the very scrutiny and accountability that teaching quality metrics claims to bring about. In this case, the quality metrics, or the system of implementation, have been found wanting.
Perhaps a little more trust between university administrators and academics would go a long way towards improving quality of teaching and reduce the desire by some (mostly male) players in the academic sector to game the system and thereby excluding others from gaining recognition by merit.
Acknowledgement: I’d like to thank Jillian Rowe from Griffith University for reading over this article and helping me to sharpen its focus, trim the wordage and generally save me from embarrassing typos and grammatical errors.
 By “unfairness” I’m referring to comments that are fiercely critical of the person (in my earlier post) or bias based upon gender, race or sexuality (this post).
 Merlin Crossley, Emma Johnston And Yanan Fan, “Male teachers are most likely to rate highly in university student feedback” Phys.org, available online; published 14 Feb; accessed 15 Feb.
 Le Hoa Phan and Kristin Childs, “Student surveys of teaching & learning quality” Institute for Teaching and Learning Innovation (ITaLI), The University of Queensland, available online, published: Jan 2017; accessed: 15 Feb.
 Jerry Muller, “The Tyranny of Metrics” Princeton University Press (2018).
2 Replies to “The Unfairness of Measuring Teaching Performance – Revisited”
Interesting. Doing Data Analysis in software development, surveys is about seeing if a current system is running well, and only implementing new a plan or strategy, the survey, is to help with evaluation and only making changes if needed, it seems the Board are using Data Anaysis. If the influx of the survey of those being gaged has dropped from 100% down 24% the question should be does this evaluation represent the majority, and is the evaluation getting out through the right form of medium, given todays tech.
You bring an interesting perspective but what is being refered to here is a benckmarking exercise against standard measures called metrics, rather than statistics per se. It’s the HR department that set the benchmarks and metrics under the guidance of the senior management of the organisation. They usually take into account sector-wide standards as well.