Spreckley, M., & Boyd, R. (2009). Efficacy of applied behavioral intervention in preschool children with autism for improving cognitive, language, and adaptive behavior: A systematic review and meta-analysis. The Journal of Pediatrics, 338-344.
Reviewed by Jonathan W. Kimball, PhD, BCBA
Woodfords Family Services, Portland, ME
Spreckley and Boyd (2009) have written a meta-analysis of the efficacy of “applied behavior intervention” (ABI) programs for preschoolers with Autism Spectrum Disorders (ASD). Their article stirred much interest and conversation among a variety of individuals concerned with education and treatment for children with autism in the state where I work, because of their conclusion that “there is inadequate evidence that ABI has better outcomes than standard care for children with autism” (p. 338). An implication of this statement is that the authors compared ABI to another uniform type of intervention, when in fact they did no such thing. At this time there is no universally accepted form of “standard care” analogous to what exists for other disorders or illnesses. We only have the most general guidelines recommending features that any comprehensive program should have (e.g., National Research Council, 2001). Spreckley and Boyd did not refer to such guidelines nor did they demonstrate, for children who were not receiving ABI, that any uniform type of care was delivered across the studies they evaluate. On the other hand, specific behavior analytic interventions employed in comprehensive treatment programs for children with autism are well represented in hundreds of peer reviewed studies conducted over several decades and carried out by researchers worldwide (Baer, 2005; Matson, Benavidez, Compton, Paclawskyj, & Baglio, 1996). In what follows I will explain my skepticism concerning the validity of Spreckley and Boyd‘s conclusion, and discuss why it ought not to have significant impact on autism treatment, research, or policy. I am a clinician, and more, a well- informed consumer than a producer of research, and it is in this capacity that I will discuss Spreckley and Boyd. I will not, therefore, provide a nuanced critique of their meta-analysis: With respect to this article my intention is as much to inoculate as to illuminate.
A meta-analysis involves combining effect sizes (i.e., the quantitative expressions of response to treatment relative to results for a comparison group or to pre-treatment performance) reported across several studies for a given variable. Results thus aggregated have greater statistical power and thus, it is believed, lead to more valid conclusions about a treatment‘s effects than they would if studies were considered individually. Spreckley and Boyd sought to examine comprehensive behavior analytic intervention in terms of its effects on cognitive, adaptive, and language development of children with ASD (incidentally, what they referred to as ABI, is otherwise known as early intensive behavioral intervention [EIBI] or, in an unfortunate conflation of discipline and intervention, “ABA”). One of their criteria for including a study in the analysis was that it must have been a randomized or quasi randomized controlled trial (RCT), which requires random assignment of participants to treatment or comparison groups. Randomly assigning some children to receive treatment and others to receive none or less treatment is very difficult to accomplish ethically in research with human participants, and therefore just four studies met all criteria: Eikeseth, Smith, Jahr, and Eldevik (2002); Eikeseth, Smith, Jahr, and Eldevik (2007); Sallows and Graupner (2005); and Smith, Groen, and Wynn (2000). These studies are but a fraction of the behavioral research that has been conducted with children with autism—a fact to which we will return below. At any rate, based on their statistical analysis of these four studies the authors concluded “that ABI did not result in significant improvement in cognitive, language, or adaptive behavioral outcomes compared with standard care” (pp. 341-342). For each of the four variables Spreckley and Boyd examined—IQ, receptive and expressive language, and adaptive behavior—they took the aggregated scores from respective studies, combined them, and compared the result with similarly aggregated scores for comparison groups.
While four studies met their inclusion criteria, data only happened to be available from different combinations of three studies for each of the four variables. When these data are examined more closely, a startling realization dawns: Sallows and Graupner (2005) is included in the analysis of all four variables, and in every case, the mean score in that study favors the comparison group over the treatment group. Later in the text, Spreckley and Boyd tell us “The results of this review should be interpreted with caution because [for two studies, Sallows & Graupner, 2005, and Smith et al., 2000] the content of the intervention was the same for the comparison group, although at reduced intensity (80% and 16%)” (p. 343, emphasis added). In other words, Sallows and Graupner‘s comparison group—the one favored in the analysis—was also receiving “ABI”! The comparison was really between clinic-directed vs. parent-directed behavioral programming (the latter actually having 82% of the overall hours [T. Smith, personal communication, June 30, 2009]). The higher parent-directed comparison score considerably lowered the average for Spreckley and Boyd‘s aggregated “treatment” group for all four variables. This low average was the principal basis for Spreckley and Boyd‘s statement that “Current evidence does not support ABI as a superior intervention for children with ASD” (p. 342). Their bold conclusion—never mind the unsubstantiated and ideologically hand-tipping closing remark that “the overwhelming majority of children with ASD change over time as part of their development as opposed to change resulting from an intervention” (p. 343)—thus went well beyond the data and seemed to ignore their own admonition to interpret the results with caution.
I think it would be widely acknowledged that what is measured and evaluated in autism treatment is behavior. Further, it has been argued that children with autism are unique, as different from each other as from typically developing children. Some of the research that employs inferential statistics with aggregated data reminds me of a metaphor facetiously related by one of my professors, who said that if medical research were practiced in a manner that disregarded individual uniqueness, we could put every participant‘s blood into the same vessel and just run one test. As James Johnston (1988) wrote, “Behavior exists only between individual organisms and their environments, and in order to be effective, experimental studies must respect this fact…This means that if experimenters do anything that contaminates, dilutes, or otherwise distorts measures of behavior change, there is likely to be some deleterious effect on the inferences that can be drawn from the data. Among other actions, this caveat clearly includes the variety of measurement and data processing techniques that result in collating individual data into some group amalgam” (p. 2).
What Spreckley and Boyd did, in discussing the results of their meta-analysis, was to draw conclusions based on an amalgam of already amalgamated data. When done well, this method of analysis can be appropriate to retrospectively assess the degree of confidence we should have in extant data, and by extension, the conclusions drawn from those data (see Reichow & Wolery, 2009, for a more comprehensive meta-analysis and more humble conclusions). When it comes to learning about behavior, however, this method is like inserting yet another mattress between the princess and the pea. If the Sallows and Graupner data were to be eliminated from the analysis, the conclusion would be quite different yet probably no more enlightening: in terms of behavior, individual differences would still be lost, with no accompanying gain on the statistical side because any analysis would be suspect due to an insufficient number of participants. Adding a much larger sample of children in order to increase statistical power, however, would also increase the heterogeneity of the sample. A large sample would then entail the use of additional statistical procedures to help determine the extent to which heterogeneity influences the generalizability of the findings— which seems like taking one pill to help manage the undesirable effects of another.
Despite Spreckley and Boyd‘s faulty conclusion, their argument was really not about efficacy, but about evidence itself. By “evidence,” Spreckley and Boyd were primarily referring to a particular kind of experiment, the RCT, as the grail of research, as the one true means of producing valid data for making sound conclusions about reported effects. Much has been written about insufficient RCT data to support comprehensive behavioral intervention (cf. two fine articles by Reichow & Wolery, 2009, and Rogers & Vismara, 2008), but I think this concern is to some extent misplaced or premature. Reichow and Wolery may have been technically accurate when they said that “Without comparisons between EIBI and empirically validated treatment programs, it is not possible to determine if EIBI is more or less effective than other treatment options” (p. 39), but while comparison may constitute one kind of worthwhile pursuit, it has its own shortcomings.
First, comparison studies are as much an actuarial endeavor as a clinical one. The chief aim of such research is “to estimate dividends and risks for general categories based on statistical records alone, that is, without attempting to understand the reasons for each event so as to allow prediction in a more individualized fashion” (Johnston, 1988, p. 3). Well-conducted RCTs can indeed help identify individual characteristics that, statistically, seemed to enhance or impede response to a given treatment—valuable information indeed—but if a specific participant does not respond to that treatment, researchers are more often left merely with ignorance than with alternatives. The autism community would be better served by studies that seek to match child characteristics with the most promising type of treatment program, as Sherer and Schreibman (2005) did with Pivotal Response Training and a more structured behavioral approach (some benefited from the former, others, the latter—these are individual judgments and cannot be made a priori on the basis of aggregated data).
Second, when it comes to treatment of children with autism, there is really very little to compare. There is much data supporting the efficacy EIBI, while the amount of experimental research produced by non-behavioral programs is minimal at present (Schreibman, 2005). There is no law requiring the use of behavioral techniques for children with autism; there is, however, a federal law (the Individual‘s with Disabilities Education Act) mandating that practices be based on “peer-reviewed research,” and when it comes to children with autism, most of the extant research is behavioral.
At this point, I must clarify what is denoted in speaking of “EIBI” and “behavioral research.” Saying a child received EIBI is like saying a child received college. I think Don Baer (2005) put it very well:
“ABA [EIBI] acknowledges from the outset of each case that each child with autism requires a unique sequence of behavior changes made by different procedures to maximize his or her chances of achieving the best outcome possible. ABA is, as far as I know, the only approach that has always measured its outcomes objectively, reliably, and validly. Approximately 500 published studies show that one or a few of the many behavior changes children with autism require can be made by ABA programming. True, perhaps 300 of those 500 studies lacked a convincing experimental design and formal evidence of reliable measurement, but the other 200 replicated their results and extended them with good measurement and convincing designs. ABA is, as far as I know, the only approach that has evaluated outcomes in well-controlled clinical trials…. ABA has produced unprecedented good results…[and] no other approach has proved that it can do nearly as well, as far as I know (p. 6).”
In this paragraph, Baer not only suggested that no two children will, or ought to, receive identical intervention components under the umbrella of EIBI, but also indicated the single-subject nature of behavioral research. What characterizes EIBI is less any given intervention, which may happen to have been developed by behavior analysts, and more that the effects (dependent variables) of every intervention (independent variable) are measured frequently and reliably. An instructional method is employed because (a) it has a documented track record of effectively teaching specific skills under similar circumstances in the past, and (b) it remains effective in its current use with a child, as demonstrated by regular monitoring of performance data. In other words, individual outcomes matter more than whatever specific technique reliably produced them, but it happens that we know more about behavioral methods because their outcomes have been so extensively measured—sometimes well enough that alternative explanations for the change can be confidently ruled out.
Spreckley and Boyd‘s concern was not only with the kind of available evidence—RCTs versus single-subject research—but also with the amount of available evidence. This is not an academic consideration, when federal law calls for the use of empirically supported practices but does not spell out how much peer-reviewed research is enough. Spreckley and Boyd chose to apply a strict “threshold” standard, an all-or-nothing judgment whereby an intervention program that is not supported by a certain number of RCTs is not considered to be supported at all. In contrast to this, there are also hierarchical standards of evidence, such as those of the American Psychological Association, which place interventions, whether investigated via group comparison or single-subject research, along a continuum from “well established” to “probably efficacious” to “experimental” (Detrich, 2008, p. 29). These standards are merely arbitrary conventions, a matter of consensus within a given field, and with respect to single-subject research we are “just beginning the process of determining the professional standards that allow demonstration of an evidence-based practice” for special education (Horner et al., 2005). Horner et al. (2005) suggested five criteria that must be met in order for an intervention that has been effective in single-subject research to be considered evidence-based, one of which proposes that:
“A practice may be considered evidence based when (a) a minimum of five single-subject studies that meet minimally acceptable methodological criteria and document experimental control have been published in peer-reviewed journals, (b) the studies are conducted by at least three different researchers across at least three different geographical locations, and (c) the five or more studies include a total of at least 20 participants (p. 176).”
Parents, providers, and policy makers will find very many specific behavioral interventions that meet this standard.
It is the dual standards and models of research—threshold vs. hierarchical, actuarial comparisons of programs vs. single-subject evaluation of interventions—that the much-cited National Research Council (NRC) report (2001) attempted to reconcile. On one hand, the report applied a very high threshold standard to evidence itself, and even Spreckley and Boyd were correct that there is negligible RCT research on any general approach to teaching children with autism (in fact, for many putative treatments there is no research at all). On the other hand, the NRC applied a hierarchical standard to specific instructional interventions, and on this basis made numerous more or less qualified recommendations, almost all of which happened to be supported by behavioral research. It is not necessarily that these interventions are inherently superior to other methods—any judgment of that kind is indeed premature. Rather, they are simply supported by a preponderance of evidence that meets conventional standards for being well-established and that does, therefore, offer firm enough footing for policy. In suggesting otherwise, Spreckley and Boyd‘s logic would seem to be not only unsupported by conventions of psychology and special education, but also, in proffering a nearly impossible standard of evidence, potentially harmful. Tobacco companies once chose to adopt a similarly impossible standard and on the basis of that self-serving choice disputed that there was adequate “evidence” that cigarettes “caused” lung cancer—despite compelling and continuously mounting demonstrations of a relationship. Spreckley and Boyd‘s putatively scientific claim could be exploited to argue against providing behavioral intervention, and in the supposed absence of evidence, authority is left as the only arbiter of treatment decisions.
Baer, D. M. (2005). Letters to a Lawyer. In W. L. Heward, T. E. Heron, N. A. Neef, S. M. Peterson, D., Sainato, G., Cartledge, R. Gardner III, L. D. Peterson, S. B. Hersh, and J. C. Dardig (Eds.). Focus on behavior analysis in education: Achievements, challenges, and opportunities. Columbus, OH: Pearson.
Detrich, R. (2008). Evidence-based, empirically supported, or best practice? A guide for the scientist- practitioner. In J. K. Luiselli, D. C. Russo, W. P. Christian, and S. M. Wilczynski (Eds.), Effective practices for children with autism. New York: Oxford University Press.
Eikeseth, S., Smith, T., Jahr, E., & Eldevik, S. (2002). Intensive behavioral treatment at school for 4- to 7-year-old children with autism: A 1-year comparison controlled study. Behavior Modification, 26, 49-68.
Eikeseth, S., Smith, T., Jahr, E., & Eldevik, S. (2007). Outcome for children with autism who began intensive behavioral treatment between ages 4 and 7: A comparison controlled study. Behavior Modification, 31, 264-278
Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single- subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179.
Johnston, J. M. (1988). Strategic and tactical limits of comparison studies. The Behavior Analyst, 11, 1- 9.
Matson, J. L., Benavidez, D. A., Compton, L. S., Paclawskyj, T. & Baglio, C. (1996). Behavioral treatment of autistic persons: A review of research from 1980 to the present. Research in Developmental Disabilities, 17, 433-465.
National Research Council . (2001). Educating children with autism. Committee on Educational Interventions for Children with Autism, Division of Behavioral and Social Sciences and Education. Washington, D.C.: National Academy Press.
Reichow, B. & Wolery, M. (2009). Comprehensive synthesis of early intensive behavioral interventions for young children with autism based on the UCLA young autism project model. Journal of Autism and Developmental Disorders, 39, 23-41.
Rogers, S. J. & Vismara, L. A. (2008). Evidence-based comprehensive treatments for early autism. Journal of Clinical Child and Adolescent Psychology, 37, 8-38.
Sallows, G. O., & Graupner, T. D. (2005). Intensive behavioral treatment for children with autism: Four- year outcome and predictors. American Journal on Mental Retardation, 110, 417-438.
Schreibman, L. (2005). The science and fiction of autism. Cambridge, MA: Harvard University Press.
Sherer, M. R., & Schreibman, L. (2005). Individual behavioral profiles and predictors of treatment effectiveness for children with autism. Journal of Consulting and Clinical Psychology, 73, 525- 538.
Smith, T., Groen, A. D., & Wynn, J. W. (2000). Randomized trial of intensive early intervention for children with pervasive developmental disorder. American Journal on Mental Retardation, 105, 269-285.
Spreckley, M., & Boyd, R. (2009, March). Efficacy of applied behavioral intervention in preschool children with autism for improving cognitive, language, and adaptive behavior: A systematic review and meta-analysis. The Journal of Pediatrics, 338-344.