Some Cautions on the Exclusive Use of Standardized Assessments in Recovery-Oriented Treatment

By Eric Larsson, PhD, LP, BCBA-D 

A common practice in clinical research is to obtain norm-referenced, standardized assessments for pre- and post-testing. This practice should be employed only very cautiously for a variety of reasons. These reasons include basic experimental design concerns and the theoretical assumptions underlying the construction of such tests. Traditional tests provide standard scores that are merely hypothetical constructs. The actual standard score is based upon a number of questionable theoretical assumptions and decisions made when the test was constructed. The theoretical problems are further exacerbated by sample sizes that in practice are much smaller than might be perceived in a cursory examination of the test manual. If valid behavioral results are being called into question by the results of the standardized assessment, then the competent behavioral researcher is a victim of a straw-man argument.

A simple reason for caution with standardized test results is that pre-post comparisons do not allow for an experimental analysis of the effects of an intervention upon that change (Baer, Wolf, & Risley, 1968). Although this caution might be taken as obvious in research in Applied Behavior Analysis, a quick sample of academic interventions being reported in the Journal of Applied Behavior Analysis, the most rigorous journal in the field, shows the practice of simple pre- and post-testing to be included in 27 percent of the studies reviewed. Of course, the common rationale for the inclusion of pre-post standardized assessments is to provide some evidence of social validity. However, as will be discussed below, such data are often very difficult to interpret. Standardized tests have one purpose, and that is to provide a norm-referenced score for the assessed performance of a child: to compare their performance to that of normal peers. The typical standard scores are Intelligence Quotients, age-equivalents, and grade-equivalents. The value of the standard scores is that they allow a single child’s score to be compared to that of a group of peers who were tested under similar conditions (Anastasi, 1982). This comparison is the only meaning of the standard scores.

However, a standard score should be understood as a hypothetical construct (MacCorquodale & Meehl, 1948; Morris, Higgins & Bickel, 1982) and to apply this construct to a child’s performance is to commit the logical fallacy of affirming the consequent by assuming an undistributed middle term (Algozzine, 1980). In other words, the norm group is judged to possess a certain property: a given age-level of development, for example. The norm group is then found to receive a certain score on the test. Finally, the child of interest is found to receive the same score on the test. Therefore, the child of interest is concluded to possess the same age level development as the norm group. However, this conclusion can only be logically applied if the middle term is undistributed: only children with the given age-level development receive that score on the test. Unfortunately, those same scores are typically distributed across a wide variety of children with actually differing age-level development (Baer, 1970). Children with very different skills receive the same test scores; this is particularly true for children who have disabilities. When this wide distribution of scores occurs, the application of the hypothetical construct to experimental results is rendered tenuous. Thus the meaning of the standard score becomes difficult to understand.

The application of norm references to experimental results through the use of standardized assessments is further complicated by the questionable validity of the assessments (Kamin, 1974; Sternberg, Grigorenko, & Bundy, 2001). Each test is organized around a theoretical decision making process for assigning the original standardized scores to given raw performances. For example in the case of grade-equivalents, the test construction requires a process of deciding which performances or groups of children qualify as the reference standard for a given grade level. Given the variety of curriculums and student performances currently found, this decision making process is open to question.

This problem is one major reason for the wide distribution of scores. The usual means of judging grade-level performance is to find students who are in a given grade according to the criteria of the participating school district. These criteria vary widely across schools, administrators, and teachers. The advocacy of individual parents, the social skills of individual students, and the age of the students also are factors that determine gradelevel. Remember also that the standard scores reflect grade-level at one-month intervals. The variability of this assignment process is then multiplied by the variability of the performances of the given students in the norm group on the day that they sat for the test. Age-equivalents have similar difficulties, given the wide variability in development across children at any given month of age. When Intelligence Quotients are examined, the meaning is much less clear. There is no outside standard for an Intelligence Quotient, save the results of another I.Q. test.

It might be argued that, if not understandable, at least the standardized testing procedure yields some replicable score that is understandable by the reader of a research report, at least in terms of within-subject comparisons. Another factor that might give the scores some utility is the large sample size of the norm group, thus averaging out the kinds of discrepancies described above. However, even these assumptions may be misleading, as will be discussed below.

To begin with, the norm group is typically not large enough to provide a full set of empirically-derived standard scores (Horst, 1976). Typically, the norm group is composed of either one or two samples per grade, age, or mental-age stratum. All other gradations within the stratum are arrived at by interpolation. For example, a grade equivalent is often based upon testing only those children who are theoretically judged to be at even-grade-level and grade-level-plus-nine months. All other equivalents within that grade level are arrived at by interpolation. The interpolation process typically assumes regular growth across the grade level. This process, however, is more likely to be uneven, in particular, showing a loss of performance during the last three months of the school year (summer recess). As such, a given raw score performance could be typical of three different grade equivalents within each grade level, but this awkward relationship is smoothed out by the graduated interpolation of scores to create a one-to-one relationship of raw and standard scores. This procedure, then, can result in an unwarranted apparent growth when no growth is more likely to be the case. In an experimental study, this procedure can show a control condition to result in an unwarranted growth of performance that may mask the relative growth of an experimental intervention. Such an effect may be a possible source of the poor data seen in the Follow-through experiments (Bushell, 1978), and may routinely restrict the apparent effects of experimental interventions (Greenwood et al., 1984).

A related problem occurs when the standardization norm groups are not large enough to provide a full distribution of possible standard scores. Typically, 67 percent of the standard sample receive scores within a single stratum, 28 percent receive scores within the neighboring strata, and 5 percent receive scores within the other strata. In a typical case, where an apparently large sample of 1000 subjects is tested across 10 grade levels, this might amount to a sample of 67 subjects serving as the norm group for a given grade equivalent, 10 serving as the norm group for the grade-equivalent-minus-one-year, 18 serving as the norm group for the grade-equivalent plus-one-year, and 5 subjects serving as the norm group for all other grade equivalents. The 10 versus 18 discrepancy would be the result of random variation in subject performance. The five extreme scores would then be thrown out as unreliable and all other grade equivalents be based upon a theoretical extrapolation of the scores of the three groups of only 67, 18, and 10 subjects to all of the other strata. This is a particular problem for research in developmental disabilities, where the majority of scores are likely to be below grade level, and thus primarily based upon theoretical extrapolation from a small test sample.

Another theoretical process determines comparisons between differing scores. At each stratum of a test, a child’s standard score is based upon a different set of assessment items and peers. Here, two children, who are tested at different levels of the test, have performances which are based on different items and norm referenced against different samples of peers. Therefore, a score derived from one stratum cannot be directly compared to those from a different stratum. If a child’s experimental progress is across two strata, then the pre-post comparison is of two different performances which are only comparable in terms of the theoretical premises of the construction of that test.

Similarly, standardized intelligence tests are actually designed to eliminate change by continually renormalizing test results across each stratum of the standardization sample. Therefore, if a child’s achievement spans two or more strata, the resulting scores are to be expected to be the same, rather than improved. Not only are such comparisons (across strata) of standardized scores inappropriate, but they may yield data which suggests a loss of gains due to the planned regression of the scores through the normalization procedure.

The standard scores of most tests are also not standardized for the common practice of averaging a diverse experimental group’s performance in any meaningful way. Further, the standard scores are not standardized cross-sectionally to show change in a single child’s behavior. Finally, when experimental groups are matched on standardized scores, these matches are inappropriate when the children were tested at different strata, as is common (Thurston, 1977). In order to avoid the logical problems discussed above, experimental group averages should actually only be referenced to normal groups who received the same test items and who had the same characteristics of age, grade etc as did the experimental subjects. Given the problems with the norm-referenced tests, the use of criterion-referenced standardized assessments has much to offer. Comparisons of subject performance to the criteria alone will yield results which are much more easily understood by the reader of the report.

Now the use of standardized IQ and adaptive behavior scores are very helpful in communicating the impact of an intervention to broad audiences. In fact, if a child who previously had scored in the developmentally delayed range, is now testing within the normal range, that is a substantial measure of typical performance. However, as discussed, it is by no means the only measure of effective progress. In practice, that is why we use all of these forms, as well as a broad battery of behavioral measures, and consider the most valid predictor to be the multi-modal assessment.


Algozzine, B. (1980). Current assessment practices: A new use for the Susan B. Anthony dollar? In C.L. Hansen, (Ed.), Child assessment: The process and the product (U.S. Office of Special Education and Rehabilitative Services contract# 300-79-0062). Seattle, WA: Program Development Assistance System.

Anastasi, A., & Urbina, S. (2009). Psychological Testing. New York: Prentice-Hall.

Baer, D. M. (1970). An age-irrelevant concept of development. Merrill-Palmer Quarterly, 16, 238-245.

Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1, 91-97.

Bushell, D. (1978). An engineering approach to the elementary classroom: The Behavior Analysis Follow- Through Project. In A.C. Catania & T. A. Brigham (Eds.), Handbook of applied behavior analysis: Social and instructional processes. New York: Irvington.

Greenwood, C. R., Dinwiddie, G., Terry, B., Wade, L., Stanley, S. O., Thibadeau, S., & Delquadri, J. C. (1984). Teacher- versus peer-mediated instruction: An ecobehavioral analysis of achievement outcomes. Journal of Applied Behavior Analysis, 17, 521-538.

Horst, D. P. (1976). What’s bad about grade-equivalent scores? (U.S. Office of Education contract). Mountain View, CA: RMC Research Corporation.

Kamin, L.J. (1974). The Science and Politics of IQ. Mahwah, NJ: Lawrence Erlbaum Associates.

MacCorquodale, K. & Meehl, P. E. (1948). On a distinction between hypothetical constructs and intervening variables. Psychological Review, 55, 95-107.

Morris, E. K., Higgins, S. T., & Bickel, W. K. (1982). Comments on cognitive science in the experimental analysis of behavior. The Behavior Analyst, 5, 109-126.

Sternberg, R. J., Grigorenko, E. L., & Bundy, D. A. (2001). The predictive value of IQ. Merrill-Palmer Quarterly, 47, 1-41.

Thurston, L. P. (1977). The experimental analysis of a parent-tutoring program to increase reading enjoyment, and oral reading and comprehension skills of urban elementary school children. Unpublished doctoral dissertation, University of Kansas, Lawrence, KS.

Citation for this article:

Larsson, E. (2014). Focus on Science: Some cautions on the exclusive use of standardized assessments in recovery-oriented treatment. Science in Autism Treatment, 11(3), 22-25

Print Friendly, PDF & Email