Science Corner: Interpreting standardized assessment scores

ASAT's Science Corner

Sarah Connolly, PhD, BCBA-D, ABPP
University of Nebraska Medical Center, Munroe-Meyer Institute

In the vignette included below, imagine the journey of a parent who is absorbing the news that their child has just been diagnosed with autism spectrum disorder (ASD). In an instant, their world shifts—and so does the flood of information that follows.

Vignette: Part 1

During the diagnostic feedback appointment, the psychologist provides an overview of the ASD diagnosis, an explanation of how the diagnosis was determined, and a long list of recommendations and referrals. The psychologist delivers the news of the diagnosis with clarity and sensitivity, but the parent finds it difficult to digest the large volume of information in such a brief appointment. Even though the diagnosis did not come as a surprise, the parent is overwhelmed by the number of recommendations and referrals.

“Where do I begin?” they wonder.

As the parent manages their own emotional reaction to learning of their child’s diagnosis, they dig deep and find within themselves the strength to shift into “action mode” on behalf of their child. The parent combs through the multiple pages of recommendations, and the bolded text reading, “pursue intensive applied behavior analysis (ABA) therapy” stands out to them.

The parent has heard the term “ABA,” but they seek to gather additional information regarding whether this approach to intervention is best suited for their child. At the recommendation of the diagnosing psychologist, the parent visits the virtual resource center of a reputable non-profit organization that focuses on autism advocacy, research, and awareness. They see links to scientific journal articles that present evidence for the use of ABA for children with ASD.

Upon opening one of the articles and skimming through the first few pages the parent asks themselves, “How do I know whether the autistic children included in this study are anything like my child?” They can see, from the results section, that the intervention worked for the children in the research study, but the parent asks themselves, “How do I figure out if this intervention may work for my child too?”

Research articles rely on various tools to both describe the characteristics of the participants (found in the article’s method section) and measure the effectiveness of the intervention (described in the results section). However, these participant characteristics are often not described in ways that are accessible to caregivers, making it difficult for caregivers to determine if the participants in the study share common characteristics with their child. The assessment used and corresponding scores may be unfamiliar, making it difficult to interpret the results. Readers may get lost in the jargon-filled descriptions of the assessment used in the article and are not provided with the necessary information to determine the relevance of the findings to their loved-one. This article, therefore, aims to help readers overcome those challenges by walking them through different types of standardized assessments, the ways in which they are scored, and how they should be interpreted. Additional tools for interpreting assessment information are also provided to empower readers as they delve into complex research studies.

Understanding Norm Referenced Assessments

Norm-referenced assessments (NRA) compare an individual’s performance with a comparable norm group. A norm group is a sample from the general population that is used to represent the broader population and provides a reference group to which an individual’s scores can be compared. Scores on NRAs allow the individuals performance to what is “typical” or “average” compared to a larger population. Testing manuals that accompany the assessment will describe the normative sample, which is the large group of people that are selected to represent the population for whom the test was designed. While some assessments may include a normative sample of individuals with ASD (such as assessments that specifically measure ASD symptoms), many other assessments (such as cognitive or language assessments) are normed on broader, non-autistic populations. NRAs consist of questions or tasks that are presented according to highly specific, standardized procedures, and to which strict scoring criteria are applied. Some examples of NRAs that are used to characterize individuals with autism are shown in Figure 1.

Figure 1

Examples of Norm-referenced Assessments

Autism Symptomology

Early Developmental Functioning

Intellectual Functioning

Adaptive Functioning

Autism Diagnostic Observation Scale – Second Edition (ADOS-2)

Mullen Scales of Early Learning

(MSEL)

Differential Ability

Scales – Second Edition

(DAS-II)

Vineland Adaptive Behavior Scales – Third Edition (VABS-3)

Childhood Autism Rating Scale – Second Edition (CARS-2)

Bayley Scales of Infant and Toddler Development (Bayley-4)

Stanford Binet Intelligence Scales – Fifth Edition

(SB-5)

Adaptive Behavior Assessment – Third Edition

(ABAS – 3)

Scores on NRAs are reported according to the normal distribution, which is also known as the bell curve (Figure 2). In well-designed instruments, individual scores are distributed along this curve, with the majority of scores clustering near the mean or average, and fewer appearing at the higher and lower ends of the distribution. Raw scores (e.g., the number of items scored as correct or total points awarded) often cannot be meaningfully interpreted, as they do not account for contextual factors such as age, gender, diagnoses, or education. Therefore, raw scores are converted to standard scores (also called derived scores). Once the assessment has been administered and a raw score is obtained, various statistical calculations are used to transform the raw score into a standardized score that can be interpreted according to the normal curve. Upon calculating the standardized score, the assessor can determine where the individual’s score falls according to the normal distribution—for example, whether the individual’s score is within the average range, or somewhat higher or lower.

Figure 2

Normal Distribution or “Bell Curve”

Interpreting Standard Scores

Summaries of these standardized scores obtained within a research study may appear in the methods or results section of a published research article, particularly when participant characteristics or outcomes of the study are being reported. When readers, such as parents, understand how to interpret standardized scores, they will likely be more equipped to determine whether the intervention and findings described in the research article are relevant to their child.

Test publishers often include raw score conversion tables within their test manuals, which provide evaluators with a convenient way to convert a raw score into a standard score; computerized scoring programs also provide this information. While there are various types of scores that can be calculated and converted, Table 1 includes a sample of commonly used standard/derived scores within psychological assessments, followed by guidance on interpreting these scores.

Table 1

Examples of standardized test scores in ASD assessment and research
Standard score	Scores commonly found in measuring broad psychological constructs such as intelligence, adaptive functioning, or academic achievement
T-Score	Scores commonly found in measuring broad psychological constructs, as well as more specific scales of functioning such as social and behavioral functioning
Scaled score	Scores often used to represent performance of sub-test of a broader construct
Z-Score	Scores commonly used to easily understand how many standard deviations a score falls from the mean; also used to interpret scores across various metrics (e.g., T-score, standard score, scaled scores)
Percentile rank	Scores used to reflect an individual score’s relative standing within a data set

With each standard score, it can be helpful to understand the mean (or average) for that score, as well as the standard deviation. The standard deviation indicates the variability of scores across the normal distribution (Cohen, 2013). With any measurement, we would expect there to be a level of variability within and across participants; the standard deviation indicates how far typical scores deviate from the mean. The “empirical rule” (Rakrak, 2025) for interpreting standard deviation within psychological assessments states the following:

Approximately 68% of the population falls within one standard deviation above and below the mean
Approximately 95% of the population falls within two standard deviations of the mean
Approximately 99.7% of the population falls within three standard deviations of the mean

Among the most well-studied psychological applications of the bell-curve is in measuring intellectual functioning (IQ). IQ tests are generally designed to follow a normal distribution, and data are reported as standard scores in which the mean is 100, and the standard deviation is 15. This means that, when administering well-established measures of IQ, the average score for 68% of the population falls between 85-115. In other words, individuals who score between 85-115 fall within one standard deviation above or below the mean. Outliers or atypicality can be noted when a score falls outside of the average range; standard deviations tell us how far outside of the average range a score falls. Most published measures additionally provide specific qualitative descriptors within testing manuals or automated score reports which also aid in interpretation; these descriptors may vary slightly across instruments. Table 2 reflects a sample of qualitative descriptors that may be used to interpret standard scores on norm-referenced assessments.

Table 2

Qualitative Interpretation of Standard Scores
Standard Score	Qualitative Descriptor
130+	Significantly Above Average
115-130	Above Average
85-115	Average
70-85	Below Average
<70	Significantly Below Average

Note: Standard Score: Mean = 100; Standard deviation = 15

While standard scores with a mean of 100 and a standard deviation of 15 are largely utilized in psychological metrics such as measures of intelligence, adaptive functioning, academic achievement, and expressive and receptive language, there are many other derived scores reported through standardized psychological assessments. Even within a given instrument, a combination of different standardized scores may be reported.

T-scores represent another commonly reported standardized test score used to determine where an individual score falls according to the normal distribution. T-scores are also used to compare test scores across various instruments. T-scores have an average score of 50, and a standard deviation of 10 (Cohen, 2013); T-scores between 40-60 are generally considered to fall within the average range. Whereas higher scores on assessments of intellectual or adaptive functioning are reflective of higher, or advanced, functioning, it is important to attend to the context and construct that is being reported through the T-score. For example, an above average score on a measure of hyperactivity likely reflects concern for elevated behavioral symptomology, while an above average score of adaptability may represent an area of behavioral strength. Table 3 reflects an example of qualitative descriptors that may be used to interpret a T-score, though specific descriptors will vary by instrument and are generally published within the test manual.

Table 3

Qualitative Interpretation of T-scores
Standard Score	Qualitative Descriptor
70+	Significantly Above Average
60-70	Above Average
40-60	Average
30-40	Below Average
<30	Significantly Below Average

Note: T-score: Mean = 50; Standard deviation = 10

Another standardized score often found within psychological measurement includes scaled scores. Scaled scores reflect performance on a specific sub-test of an assessment, whereas a standard score reflects performance on a broader domain that is often made up of various subtests. For example, on a measure of intellectual functioning, a Nonverbal IQ may be reported as standard score, which is derived from performance on various subtests that measure nonverbal intelligence; the subtests are often reported as scaled scores. Scaled scores have a mean of 10 and a standard deviation of 3. Examples of qualitative descriptors for scaled scores can be found in Table 4, though minor variability in descriptions may be noted across instruments.

Table 4

Qualitative Interpretation of Scaled Scores
Scaled score	Qualitative Descriptor
16+	Significantly Above Average
13-16	Above Average
7-13	Average
4-7	Below Average
<4	Significantly Below Average

Note, Scaled score: Mean = 10; Standard deviation = 3

Z-scores, referred by some as the “Swiss army knife” of psychological assessment, allows raw scores to be converted to a standardized metric allowing for easy comparison across various distributions. For example, to see how an IQ score (reported as a standard score) compares to ASD symptomology on an ASD rating scale (reported as a T-score), both scores can be converted to z-scores and evaluated according to the same metric. Some practicing psychologists prefer to present z-scores for all scores obtained within a psychological assessment, to minimize confusion in interpreting various scores that may be reported within a psychological evaluation report. By utilizing z-scores, readers can easily detect patterns in comparing an individual’s performance to the normative sample, regardless of the specific instrument that was used. Z-scores have a mean of 0 and a standard deviation of 1; therefore, a positive z-score would indicate a score that is above the mean, while a negative z-score would reflect a score that falls below the mean (Cohen, 2013). For example, a z-score of +2.0 would be significantly above average and indicates the data point is two standard deviations above the mean; a z-score of -2.0 would be significantly below average and indicates that data point falls two standard deviations below the mean.

Another widely utilized standard score is percentile rank, which also allows us to understand where a person’s performance compares to others within their normative group (e.g., age, gender) (Cohen, 2013). A percentile rank of 50% indicates that the individual score is higher than 50% of the individuals in the norming sample, with the 50^th percentile being where the largest portion of scores in the distribution falls. Percentile ranks are generally well understood among non-professional audiences, as they are commonly reported in settings such as pediatrician’s offices where a child’s weight and height are plotted according to a growth chart according to age and gender, or within an educational setting in which state-wide academic achievement tests are conducted. However, as with all standard scores, there are limitations in relying on percentile rank as the primary method of interpreting data, particularly within research. Specifically, percentile ranks do not progress in equal intervals across the bell curve and cannot be interpreted across various instruments. Additionally, percentile ranks don’t tell us how far the individual score deviates from the mean, and it is difficult to know whether the difference that is observed is of clinical or statistical significance, or simply reflective of measurement error or chance. To determine statistical significance, a percentile rank often needs to be converted into another standard score (e.g., such as a z-score) to determine how far the individual score deviates from the mean. While percentile ranks essentially tell us where you stand in line, they are generally not utilized in research.

Vignette: Part 2

The parent comes across a research article about early intensive behavioral intervention (EIBI) is an ABA-based model of intervention that aimed at improving functioning in young children with ASD (Eikeseth et al., 2012). In the research article, several tools, including the Vineland Adaptive Behavior Scales (VABS) and Childhood Autism Rating Scale (CARS) are described in the results section. The parent references their child’s recent diagnostic report and notes that these same tools were used in the diagnostic evaluation. Further, review of their child’s scores reveals similarities to those described in the study. With this information, the parent has increased confidence as they consider pursuing EIBI for their child.

Tools for Understanding Instruments Described in Autism Research

Given the unlimited number of metrics that can be used to measure skills, performance, abilities, and behaviors with autism research, it would not be feasible for even the most expert of readers to be familiar with every instrument described within a research study. While familiarity with commonly utilized assessment tools and corresponding scores reported in research can ease readability, readers are also encouraged to utilize additional tools and resources, such as artificial intelligence (AI), when presented with an unfamiliar measurement tool. Even the most proficient autism evaluators are likely to encounter measures described in the literature with which they are not familiar. The good news? A lack of familiarity with any given instrument does not preclude the ability to be a well-informed consumer of scientific literature. For example, in a study in which participants are characterized using a measure of executive functioning, such as the Behavior Rating Inventory of Executive Function – Second Edition (BRIEF-2), the article may report participants’ T scores. If readers are unfamiliar with the BRIEF-2, they may use an AI generator, such as ChatGPT to interpret the results, learn more about its purpose, and its limitations. In Figure 3, an example of the type of information that can be gathered about the BRIEF-2 through AI is included (ChatGPT, personal communication, March 4, 2025).

Figure 3

Example of Chat GPT Output on an Assessment Tool

Utilization of such tools such as AI can increase readers’ understanding of scientific literature and may allow the reader to more easily determine the relevance of the methods and findings to the individual, population, or context of interest.

Conclusion

As readers set out to be well-informed consumers of scientific literature, tools such as the tables, figures, and resources provided in this article may be useful, regardless of the reader’s level of expertise. Characterization tools and outcome measures inform how research is understood and applied. Specifically, understanding of characterization tools used within a study allows readers to ascertain the relevance of the study’s findings to diverse autistic population (e.g., those with or without intellectual or language impairments). Readers, including parents, educators, or practitioners may then be able to determine whether the intervention is relevant to their child/student/client based on shared characteristics to research participants. Misunderstanding of participant characterization tools may result in overgeneralization or inappropriate application of the interventions described in the study. Additionally, familiarity with outcome measures described in scientific literature help readers to judge whether the reported outcomes are meaningful, reliable, and applicable to other populations and settings. Finally, because autism research is highly heterogenous with an infinite number of potential constructs to be studied and reported, familiarity with characterization and outcome tools can allow readers to make informed decisions when comparing, interpreting, and synthesizing the findings across multiple studies. In sum, instruments reported in the scientific autism literature directly influence the interpretation and impact of research; readers’ ability to understand these instruments allows them to engage with the scientific literature effectively and responsibly.

References

Bayley, N. (2019). Bayley Scales of Infant and Toddler Development (4th ed.). Pearson.

Cohen, B. H. (2013). Explaining psychological statistics (4th ed.). Wiley.

Eikeseth, S., Klintwall, L., Jahr, E., & Karlsson, P. (2012). Outcome for children with autism receiving early and intensive behavioral intervention in mainstream preschool and kindergarten settings. Research in Autism Spectrum Disorders, 6(1), 829–835.

Elliott, C. D. (2007). Differential Ability Scales (2nd ed.). Harcourt Assessment.

Gioia, G. A., Isquith, P. K., Guy, S. C., & Kenworthy, L. (2000). Behavior Rating Inventory of Executive Function (BRIEF) professional manual. Psychological Assessment Resources.

Harrison, P. L., & Oakland, T. (2015). Adaptive Behavior Assessment System (3rd ed.). Western Psychological Services.

Lord, C., Rutter, M., DiLavore, P. C., Risi, S., Gotham, K., & Bishop, S. (2012). Autism Diagnostic Observation Schedule (2nd ed.). Western Psychological Services.

Mullen, E. M. (1995). Mullen Scales of Early Learning (AGS ed.). American Guidance Service.

Rakrak, M. (2023). Exploring variability in data: The role of range, variance, and standard deviation. International Journal of Multidisciplinary Research and Analysis, 8(3), 1327–1331. https://doi.org/10.47191/ijmra/v8-i03-47

Roid, G. H. (2003). Stanford–Binet Intelligence Scales (5th ed.). Riverside Publishing.

Schopler, E., Van Bourgondien, M. E., Wellman, G. J., & Love, S. R. (2010). Childhood Autism Rating Scale (2nd ed.). Western Psychological Services.

Sparrow, S. S., Cicchetti, D. V., & Saulnier, C. A. (2016). Vineland Adaptive Behavior Scales (3rd ed.). Pearson Assessments.

Reference for this Article:

Connolly, S. (2025). Science Corner: Interpreting standardized assessment scores in participant characterizations. Science in Autism Treatment, 22(8).

Other Science Corner Articles:

Other ASAT Articles:

#Researchers #SavvyConsumer #Educators #Parents

Science Corner: Interpreting Standardized Assessment Scores in Participant Characterizations