Assessment is a computing education grand challenge

How do you know what someone knows about computing?

This question is foundational and pops up everywhere. It arises in classrooms, where teachers need to be able to accurately determine what a student has learned, both to help them learn better (through formative assessments) but also to establish a record of how well they’ve learned it (a summative assessment). But it also arises in professional settings such as hiring: when an applicant says they “know” Java, what does that actually mean? What is it predictive of? Surely there are better ways for an employer to know how well someone knows a programming language other than self-report or having passed a class at a university. We don’t even know how well these indicators actually predict ability.

Isn’t this just a matter of writing tests? It turns out that writing good tests is very difficult. It’s not enough to write an exam that asks people to define concepts and solve problems. If the wording of the questions is off, people may get the answers wrong even though they know the answer, or even get the answers right even though they don’t. These are examples of poor test validity, where the test measures something other than the knowledge one is trying to assess. Some tests aren’t reliable, in that using the test repeatedly produces different results for the same individual in different settings. Reliability issues can arise from ambiguous wording, ill-defined concepts, or poorly constructed definitions of correct answers leading to unreliable scoring.

Making a reliable, valid test is a considerable amount of work. Several of the students from Mark Guzdial‘s lab have spent a substantial portion of their time as doctoral students developing reliable, valid tests for measuring how well students can mentally simulate (or trace) the behavior of simple imperative programs (see the FCS1 and SCS1). Even after their rigorous efforts, these assessments are hard to reproduce and sensitive to overuse, making it difficult to scale these efforts to other concepts or other languages.

The implications of unreliable, low validity tests can be severe. Bad tests in introductory programming classes can fail students that actually know quite a lot or pass students that know quite little. This poor signal can trickle down to employers, who might use courses, grades, and other credentials as an indicator of ability. And because tests are garbage-in, garbage-out, all of this happens without a teacher or employer ever really knowing, producing a garbled, sometimes overconfident sense of what students know.

I’ve seen these problems as a student myself. I remember graduating back in 2002 with my undergraduate degree in CS with many of my high performing peers admitting that despite all of their high grades, they still couldn’t sit down in front of an empty code editor and write a program to solve a problem. Sure, they solved lots of problems in class with the help of peers, TAs, and highly scaffolded assignments within the scope of problems their teachers had discussed. But they often didn’t know why their solutions worked. I remember getting partial credit for solutions for regurgitating partial solutions that I really didn’t understand, resulting in inflated grades that miscommunicated the level of my understanding.

Does it matter that developers understand the code that they write if the code still works? If correct code stayed correct and code could be “correct enough,” this might not matter. Unfortunately, correctness matters: programs receive unexpected inputs and developers have to debug, and developers can only do this well with a deep understanding of the semantics of a program’s execution. Moreover, this deep understanding likely would have prevented some of these defects from occurring in the first place.

All this said, there are some people who obviously develop a deep, nuanced, accurate understanding of computation. These people are our best programmers, our computer science faculty, and others who’ve likely devoted their life to eradicating every misconception about computation from their mind through incredible amounts of deliberate practice. I suspect these individuals aren’t confined to the limits of assessment because they’ve learned to self-assess their knowledge. In fact, computing might be unique in that people can actually test their understanding of computation by carefully probing program behavior, using the computer itself as a source of feedback about their understanding. Perhaps this is how people are able to develop robust understandings of computation despite the failures of assessment. This might also explain why CS teachers appear to believe that some students “get it” and some don’t: what’s really going on is that some have an insatiable curiosity about how computers behave, and use that to fuel a limitless quest for more robust knowledge of computing.

Because knowing what computing knowledge is in someone’s head is so hard and so important, I believe it’s a grand challenge of computing education research. If we don’t discover reliable, valid, scalable, replicable ways of knowing what people know about computing—or find a way to give more people an insatiable curiosity about computing—we’ll continue to overlook deficiencies in knowledge, producing defective unreliable code. It’s up to researchers to make these discoveries and up to society to fund it.

2 thoughts on “Assessment is a computing education grand challenge

  1. This is, indeed, a grand challenge! Once upon a time, one hypothesis I had (that I still hold) was that developers are good at assessing the quality of their peers that they have worked with. Here’s a paper on the topic:

    Jeffrey C. Carver, Lorin Hochstein, Jason Oslin, “Identifying Programmer Ability Using Peer Evaluation: An Exploratory Study”, First Workshop on Human Aspects of Software Engineering (HAoSE 2009), OOPSLA, Orlando, Florida. September 2009.

    • It’s a great hypothesis and some compelling early evidence. One of the biggest barriers to progress on the topic is that assessment and measurement isn’t nearly as compelling to Ph.D. students, or the research community for that matter. It’s just really, really difficult, important work. Thanks for helping to build a foundation!

Leave a Reply

Your email address will not be published. Required fields are marked *