Testing. It’s become so much part of the life of a learner or a teacher, at any age. And it’s a fascinating topic.
Okay, I’m one of those weird people who thinks of test-taking myself as a sort of competitive athletic event, one at which I’m really quite good even while thinking that the vast majority of tests I’ve ever taken were nearly completely pointless. No, that’s not my impostor syndrome kicking in. It has to do with a central concept in test design, which I’ll explain below.
What I love most about assessment is how useful it can be when done well. One of my colleagues says that testing doesn’t bring out the best in people, it doesn’t bring out the worst in people, but it brings out the most in people. We put you in a situation where your normal compensatory strategies for getting along in the world aren’t going to work. As Peter Ossorio says, when you ask a person to do something they can’t do, they’ll do something they can do. You’ll figure out something to do, the best you can, and what you do will be a reflection, in some way, of you. It’s like science — each test is an experiment that you and I do together. No one bit of data proves anything by itself, but when we put things together and look for themes, consistencies, divergences, a story begins to emerge, and it often does so surprisingly quickly.
But what bugs me is how little most folks understand about tests of all stripes — most importantly, how they’re built, how they work, what they’re good for… and what they aren’t. So what I’d like to do is to kick off a random-access series of posts on various aspects of assessment, including ordinary classroom tests, high-stakes testing for No Child
Left Behind Allowed Ahead (also known as No Teacher Left Standing) and other similar “accountability” movements, bubble tests like the dreaded SAT and its ilk, and, of course, my favorite, the one-on-one kinds of tests used for special education and other diagnostic work, the kind that seriously geeky people like me give. Those include cognitive tests, neuropsychological tests, academic tests, psychological tests, behavioral questionnaires, and other fun stuff. I’ll start there because, well, because I like them and I think they’re really pretty interesting. I’ll try to chew off manageable chunks to talk about, and over time, I hope people learn something.
The most serious and popular misconception I encounter is a fundamental misunderstanding of what tests can do. They’re not magic, and neither are those of us who give them magicians. We’re just very observant (or at least we’re supposed to be!), and we’re using them to make a series of structured observations.
Again, this is like science. When I was training as a molecular biologist, one of the things I had thwacked into my head (through reading in the literature some of the truly impressively weird things that happened when people didn’t remember it) was that no experiment ever tells you anything about the real world. It tells you what happened on that day when that person did that experiment in that way. You might use that information to conjecture about the nature of the real world based on your data, and over time, as you build up more data, you can get a better and better sense of what the real world might be like. But you might see a different experiment, claiming to answer the same question, where you get different results. Uh, oh. Where do you look, to figure out what was going on to find the difference that made the difference? In the Materials and Methods, the specifics of how the experiment was designed and constructed. Very often, that’s where the difference lies. You cannot separate data from the experiment that generated it.
Same with assessment. No test, no matter how beautifully it’s designed, how skillfully it’s administered, and how insightfully it’s interpreted, can possibly tell you anything incontrovertibly true about the real human being. The test tells you what that person did on that day on that test with that tester in that environment. It might reflect something probably true about the person, but you have to stay humble with your interpretation.
Since you will always value what you measure, it makes sense to think very carefully about how to measure what you actually value. In education, we talk about the idea of “alignment” — we’d say that this test is or is not well-aligned to the skills we want the student to be able to demonstrate. That’s what I was talking about above, why I don’t respect the very bubble tests that I tend to be able to blow out of the water. They typically test what is easy to measure, but not what a thoughtful professional would consider all that valuable. At the conclusion of many thousands of hours of clinical training, psychologists in most states have to take a detailed fact-recall bubble test covering basically the entire field. We to prove that we know which classic theorist suggested that you were running from the bear because you were afraid, versus which one suggested that you were afraid because you were running from the bear. But we don’t have to demonstrate the capacity to actually manifest any clinical competencies with actual, oh, I dunno, human beings in distress. In test design, we talk about the very-closely-related concept of “validity,” which comes in many flavors. In this case, the construct validity of the test — how it defines what it is that it’s trying to measure — is awful. Fact knowledge within a domain is a useful thing, and might be a good prerequisite to beginning clinical work. But the public is not protected from incompetent psychologists by choosing only those who can remember the facts printed in their textbooks.
I think the best-aligned test I ever took was the qualifying exam for the Ph.D. I didn’t get in cancer biology. I was required to dive in to fields I was unfamiliar with, learn about the prior research in those fields, and propose new lines of research that would answer important unanswered questions. Minus the speed with which I had to do it (three of these, in completely different fields, within a single week!), this test was testing very much what I would need to do if I became a principal investigator running my own lab someday. Of course, the alignment/construct validity of that test wasn’t perfect either. What it didn’t explore was the personality traits which set me up to be a very sad and bored and frustrated person in the lab, the precise difference between thinking about science, which I love and am good at, and doing bench science on a day-to-day basis, which I don’t and am not.
What I find most concerning about the high-stakes testing (aka “accountability”) movement in education is that it tends to use tests with poor validity in a variety of domains (construct validity, content validity, and predictive validity being the most notable), and that it tends to ignore other underlying methodological differences between comparison groups (most notably, differences in the populations being served and the resources available to teachers and administrators to serve them, but also differences in how various jurisdictions define their goals and standards). When science teachers teach kids about experimental controls, we start with the idea of a “fair game.” But there’s no way on earth that these “games” are fair. There’s nothing truly “standardized” about these experiments, and almost every interpretation that is made of them is a massive overinterpretation from inadequate data. Gives serious testing a bad name. Harrumph.
Okay, so my plans for this series of posts right now involve topics like the various types of validity and reliability (the twin pillars of assessment for people who actually want usable data!), and a sort of overview of each of the major types of clinical testing (e.g., cognitive, academic, neuropsychological, behavioral, projective) and what they are and aren’t good for. I’ll do classroom and educational and high-stakes stuff later, but I’d rather start with what I do the most of. If there are specific ideas or questions you’d like me to address, feel free to drop them in the comments area here.