0% found this document useful (0 votes)
798 views189 pages

SRRPTechnical Manual

STAR technical manual

Uploaded by

Sandy Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
798 views189 pages

SRRPTechnical Manual

STAR technical manual

Uploaded by

Sandy Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Star Assessments™ for Reading

Technical Manual
Renaissance Learning
PO Box 8036
Wisconsin Rapids, WI 54495-8036
Telephone: (800) 338-4204
(715) 424-3636
Outside the US: 1.715.424.3636
Fax: (715) 424-4242
Email (general/technical questions): support@[Link]
Email (international support): worldsupport@[Link]
Website: [Link]

Copyright Notice
Copyright © 2024 by Renaissance Learning, Inc. All Rights Reserved.
This publication is protected by US and international copyright laws. It is unlawful to duplicate or reproduce
any copyrighted material without authorization from the copyright holder. This document may be reproduced
only by staff members in schools that have a license for Star Reading software. For more information, contact
Renaissance Learning, Inc., at the address above.
All logos, designs, and brand names for Renaissance’s products and services, including but not limited to
Accelerated Reader, Accelerated Reader Bookfinder, AR, AR Bookfinder, AR Bookguide, Accelerated Math,
Freckle, Lalilo, myIGDIs , myON, myON Classics, myON News, Renaissance, Renaissance Growth Alliance,
Renaissance Growth Platform, Renaissance Learning, Renaissance Place, Renaissance Smart Start,
Renaissance-U, Star Assessments, Star 360, Star CBM, Star Reading, Star Math, Star Early Literacy, Star
Custom, Star Spanish, Schoolzilla, and Renaissance are trademarks of Renaissance Learning, Inc., and its
subsidiaries, registered, common law, or pending registration in the United States. All other product and company
names should be considered the property of their respective companies and organizations.
Macintosh is a trademark of Apple Inc., registered in the U.S. and other countries.
METAMETRICS®, the METAMETRICS® logo and tagline, LEXILE®, the LEXILE® logo, POWERV®,
QUANTILE®, and the QUANTILE® logo are registered trademarks of MetaMetrics, Inc. in the United States and
abroad. Copyright © 2024 MetaMetrics, Inc. All rights reserved.

6/2024 SRRP
Contents

Introduction......................................................... 1
Star Reading: Screening and Progress-Monitoring Assessment............................1
Star Reading Purpose.............................................................................................1
Design of Star Reading...........................................................................................2
Three Generations of Star Reading Assessments...........................................2
Overarching Design Considerations.................................................................3
Improvements Specific to Star Reading Versions 3 and Higher................5
Test Interface..........................................................................................................6
Practice Session.....................................................................................................6
Adaptive Branching/Test Length.............................................................................7
Test Length.......................................................................................................7
Test Repetition........................................................................................................9
Item Time Limits......................................................................................................9
Accessibility and Test Accommodations ........................................................11
Unlimited Time................................................................................................12
Test Security.........................................................................................................12
Split-Application Model...................................................................................12
Individualized Tests........................................................................................12
Data Encryption..............................................................................................12
Access Levels and Capabilities......................................................................13
Test Monitoring/Password Entry.....................................................................13
Final Caveat...................................................................................................13
Test Administration Procedures............................................................................14

Content and Item Development............................. 15


Content Specification: Star Reading.....................................................................15
The Educational Development Laboratory’s Core Vocabulary List:
ATOS Graded Vocabulary List...................................................................18
Content Specification: Star Reading.....................................................................19
Item Development Specifications: Star Reading...................................................19
Vocabulary-in-Context Item Specifications.....................................................19

Star Assessments™ for Reading


Technical Manual i
Contents


Authentic Text Passage Item Specifications...................................................20


Reading Skills Item Specifications........................................................................23
Adherence to Skills.........................................................................................23
Level of Difficulty: Readability.........................................................................23
Level of Difficulty: Cognitive Load, Content Differentiation,
and Presentation........................................................................................24
Efficiency in Use of Student Time...................................................................25
Balanced Items: Bias and Fairness ...............................................................25
Accuracy of Content.......................................................................................26
Language Conventions...................................................................................26
Item Components...........................................................................................26
Metadata Requirements and Goals................................................................27
Star Reading and Renaissance Learning Progressions for Reading....................27

Item and Scale Calibration.................................... 31


Background...........................................................................................................31
Calibration of Star Reading Items for Use in Version 2.........................................31
Sample Description...............................................................................................32
Item Presentation..................................................................................................34
Item Difficulty.........................................................................................................36
Item Discrimination...............................................................................................36
Item Response Function.......................................................................................36
Rules for Item Retention.................................................................................38
Scale Calibration and Linking...............................................................................39
Online Data Collection for New Item Calibration...................................................41
Computer-Adaptive Test Design...........................................................................43
Scoring in the Star Reading Tests.........................................................................44
A New Scale for Reporting Star Reading Test Scores..........................................45

Reliability and Measurement Precision................. 48


34-Item Star Reading Tests...................................................................................50
Generic Reliability...........................................................................................50
Split-Half Reliability.........................................................................................51
Alternate Form Reliability...............................................................................52
Standard Error of Measurement.....................................................................54
Decision Accuracy and Decision Consistency................................................57
25-Item Star Reading Progress Monitoring Tests.................................................60

Star Assessments™ for Reading


Technical Manual ii
Contents


Reliability Coefficients....................................................................................60
Standard Error of Measurement.....................................................................61
Decision Accuracy and Consistency...............................................................63

Validity................................................................. 64
Content Validity.....................................................................................................64
Construct Validity..................................................................................................64
Internal Evidence: Evaluation of Unidimensionality of Star Reading....................65
External Evidence: Relationship of Star Reading Scores to Scores
on Other Tests of Reading Achievement..........................................................68
Relationship of Star Reading Scores to Scores on State Tests
of Accountability in Reading..............................................................................71
Relationship of Star Reading Scores to Scores on Multi-State Consortium
Tests in Reading...............................................................................................72
Meta-Analysis of the Star Reading Validity Data...................................................73
Additional Validation Evidence for Star Reading...................................................74
A Longitudinal Study: Correlations with SAT9................................................74
Concurrent Validity: An International Study of Correlations with
Reading Tests in England...........................................................................75
Construct Validity: Correlations with a Measure of Reading
Comprehension..........................................................................................76
Investigating Oral Reading Fluency and Developing the Estimated
Oral Reading Fluency Scale......................................................................78
Cross-Validation Study Results......................................................................80
Classification Accuracy of Star Reading...............................................................81
Accuracy for Predicting Proficiency on a State Reading Assessment............81
Accuracy for Identifying At-Risk Students.......................................................81
Brief Description of the Current Sample and Procedure..........................82
Disaggregated Validity and Classification Data..............................................83
Evidence of Technical Accuracy for Informing Screening
and Progress Monitoring Decisions...........................................................84
Screening.......................................................................................................85
Progress Monitoring.......................................................................................89
Additional Research on Star Reading as a Progress Monitoring Tool.....92
Differential Item Functioning.................................................................................92
Summary of Star Reading Validity Evidence.........................................................95

Star Assessments™ for Reading


Technical Manual iii
Contents


Norming................................................................ 96
Background...........................................................................................................96
The 2024 Star Reading Norms.......................................................................96
Sample Characteristics.........................................................................................97
Geographic region..........................................................................................99
School size.....................................................................................................99
Socioeconomic status as indexed by the percent of school
students with free and reduced lunch........................................................99
Test Administration..............................................................................................101
Data Analysis......................................................................................................101
Growth Norms.....................................................................................................103

Score Definitions................................................... 105


Types of Test Scores...........................................................................................105
Enterprise Scale Scores...............................................................................106
Unified Scale Scores....................................................................................106
Grade Equivalent (GE).................................................................................106
Comparing the Star Reading Test with Conventional Tests...................107
Estimated Oral Reading Fluency (Est. ORF)................................................108
Instructional Reading Level (IRL).................................................................108
Special IRL Scores.......................................................................................109
Understanding IRL and GE Scores..............................................................109
Percentile Rank (PR)....................................................................................110
Normal Curve Equivalent (NCE)................................................................... 111
Student Growth Percentile (SGP).................................................................112
Lexile® Measures.........................................................................................113
Lexile Measures of Students and Books: Measures of Student
Reading Achievement and Text Readability .....................................114
Special Star Reading Scores..............................................................................116
Zone of Proximal Development (ZPD)..........................................................116
Grade Placement..........................................................................................116
Indicating the Appropriate Grade Placement.........................................116
Compensating for Incorrect Grade Placements.....................................117

Conversion Tables................................................. 118

Appendix A: Estimated Oral Reading Fluency....... 135

Star Assessments™ for Reading


Technical Manual iv
Contents


Appendix B: Detailed Evidence of Star Reading


Validity................................................................. 136
Relationship of Star Reading Scores to Scores on Other Tests
of Reading Achievement.................................................................................136
Relationship of Star Reading Scores to Scores on State Tests
of Accountability in Reading............................................................................156
Relationship of Star Reading Enterprise Scores to Scores
on Previous Versions......................................................................................160
Data from Post-Publication Studies....................................................................162
Predictive Validity: Correlations with SAT9 and the California
Standards Tests........................................................................................162
Linking Star and State Assessments: Comparing Student- and
School-Level Data..........................................................................................163
Methodology Comparison.............................................................................163
Student-Level Data................................................................................164
School-Level Data..................................................................................164
Accuracy Comparisons.................................................................................165
Aggregated Classification Accuracy Data.....................................................169
Receiver Operating Characteristic (ROC) Curves
as defined by NCRTI:........................................................................169
Brief Description of the Current Sample and Procedure........................169
Aggregated Validity Data..............................................................................172
Disaggregated Validity and Classification Data............................................172

References............................................................ 174

Index..................................................................... 177

Star Assessments™ for Reading


Technical Manual v
Star Assessments™ for Reading
Technical Manual vi
Introduction

Star Reading: Screening and Progress-Monitoring


Assessment
Since the 2011–2012 school year, two different versions of Star Reading have
been available for use in assessing the reading achievement of students in
grades K–12. The comprehensive version is a 34-item standards-based adaptive
assessment, aligned to state and national curriculum standards, that takes an
average of less than 20 minutes. A shorter, 25-item version assesses reading
comprehension only, and takes an average of less than 10 minutes, making
it a popular choice for progress monitoring in programs such as Response
to Intervention. Both versions provide immediate feedback to teachers and
administrators on each student’s reading development.

Star Reading Purpose


As a periodic progress-monitoring assessment, Star Reading progress monitoring
serves three purposes for students with at least 100-word sight vocabulary. First,
it provides educators with quick and accurate estimates of reading comprehension
using students’ instructional reading levels. Second, it assesses reading
achievement relative to national norms. Third, it provides the means for tracking
growth in a consistent manner longitudinally for all students. This is especially
helpful to school- and district-level administrators.

The lengthier Star Reading serves similar purposes, but tests a greater breadth of
reading skills appropriate to each grade level. While the Star Reading test provides
accurate normed data like traditional norm-referenced tests, it is not intended to
be used as a “high-stakes” test. Generally, states are required to use high-stakes
assessments to document growth, adequate yearly progress, and mastery of
state standards. These high-stakes tests are also used to report end-of-period
performance to parents and administrators or to determine eligibility for promotion
or placement. Star Reading is not intended for these purposes. Rather, because
of the high correlation between the Star Reading test and high-stakes instruments,
classroom teachers can use Star Reading scores to fine-tune instruction while
there is still time to improve performance before the regular test cycle. At the same
time, school- and district-level administrators can use Star Reading to predict
performance on high-stakes tests. Furthermore, Star Reading results can easily be
disaggregated to identify and address the needs of various groups of students.

Star Assessments™ for Reading


Technical Manual 1
Introduction
Design of Star Reading

The Star Reading test’s repeatability and flexible administration provide specific
advantages for everyone responsible for the education process:
X For students, Star Reading software provides a challenging, interactive, and
brief test that builds confidence in their reading ability.
X For teachers, the Star Reading test facilitates individualized instruction by
identifying children who need remediation or enrichment most.
X For principals, the Star Reading software provides regular, accurate reports on
performance at the class, grade, building, and district level.
X For district administrators and assessment specialists, it provides a wealth of
reliable and timely data on reading growth at each school and districtwide. It
also provides a valid basis for comparing data across schools, grades, and
special student populations.

This manual documents the suitability of Star Reading computer-adaptive testing


for these purposes and demonstrates quantitatively how well this innovative
instrument in reading assessment performs.

Star Reading is similar in many ways to the Star Reading Progress Monitoring
version, but with some enhanced features, including additional reports and
expanded benchmark management.

Design of Star Reading


Three Generations of Star Reading Assessments
The introduction of the current version of Star Reading in 2011 marked the third
generation of Star Reading assessments. The first generation consisted of Star
Reading version 1, which was a variable-length adaptive assessment of reading
comprehension that employed a single item type: vocabulary-in-context items.
Star Reading’s original item bank contained 800+ such items. Although it was a
breakthrough computer adaptive test, Star Reading 1 was based on classical test
theory.

The second generation consisted of Star Reading versions 2 through 4.4, including
the current Star Reading Progress Monitoring version. This second generation
differed from the first in three major respects: It replaced classical test theory with
Item Response Theory (IRT) as the psychometric foundation for adaptive item
selection and scoring; its test length was fixed at twenty-five items (rather than
the variable length of version 1); and its content included a second item type: the
original vocabulary in context items were augmented in grades 3–12 by the use
of longer, authentic text passages for the last 5 items of each test. The second
generation versions differed from one another primarily in terms of the size of their

Star Assessments™ for Reading


Technical Manual 2
Introduction
Design of Star Reading

item banks, which grew to over 2000 items in version 4.4. Like the first generation
of Star Reading tests, the second generation continued to measure a single
construct: reading comprehension.

The third generation is represented by the current version of Star Reading. This
is the first version of Star Reading to be designed as a standards-based test; its
items are organized into 5 blueprint domains, 10 skill sets, 36 general skills, and
over 470 discrete skills—all designed to align to national and state curriculum
standards in reading and language arts, including the Common Core State
Standards. Like the second generation of Star Reading tests, the third generation
Star uses fixed-length adaptive tests. Its tests are longer than the second
generation test—34 items in length—both to facilitate broader standards coverage
and to improve measurement precision and reliability.

Overarching Design Considerations


One of the fundamental Star Reading design decisions involved the choice of
how to administer the test. The primary advantage of using computer software
to administer Star Reading tests is the ability to tailor each student’s test based
on his or her responses to previous items. Conventional assessments, including
paper-and-pencil tests, typically entail fixed test forms: every student must respond
to the same items in the same sequence. Using computer-adaptive procedures,
it is possible for students to test on items that appropriately match their current
level of proficiency. The item selection procedures, termed Adaptive Branching,
effectively customize the test for each student’s achievement level.

Adaptive Branching offers significant advantages in terms of test reliability, testing


time, and student motivation. Reliability improves over fixed-form tests because
the test difficulty is adjusted to each individual’s performance level; students do not
have to fit a “one test fits all” model. Most of the test items that students respond
to are at levels of difficulty that closely match their achievement level. Testing time
decreases because, unlike in paper-and-pencil tests, there is no need to expose
every student to a broad range of material, portions of which are inappropriate
because they are either too easy for high achievers or too difficult for those with
low current levels of performance. Finally, student motivation improves simply
because of these issues—test time is minimized and test content is neither too
difficult nor too easy.

Another fundamental Star Reading design decision involved the choice of the
content and format of items for the test. Many types of stimulus and response
procedures were explored, researched, discussed, and prototyped. These
procedures included the traditional reading passage followed by sets of literal or
inferential questions, previously published extended selections of text followed by
open-ended questions requiring student-constructed answers, and several cloze-

Star Assessments™ for Reading


Technical Manual 3
Introduction
Design of Star Reading

type procedures for passage presentation. While all of these procedures can be
used to measure reading comprehension and overall reading achievement, the
vocabulary-in-context format was selected as the primary item format for the first
generation Star Reading assessments. This decision was made for interrelated
reasons of efficiency, breadth of construct coverage, and objectivity and simplicity
of scoring.

Four fundamental arguments support the use of the original Star Reading design
for obtaining quick and reliable estimates of reading comprehension and reading
achievement:

1. The vocabulary-in-context test items, while using a common format for


assessing reading, require reading comprehension. Each test item is a
complete, contextual sentence with a tightly controlled vocabulary level. The
semantics and syntax of each context sentence are arranged to provide clues
as to the correct cloze word. The student must actually interpret the meaning
of (in other words, comprehend) the sentence in order to choose the correct
answer because all of the answer choices “fit” the context sentence either
semantically or syntactically. In effect, each sentence provides a mini-selection
on which the student demonstrates the ability to interpret the correct meaning.
This is, after all, what most reading theorists believe reading comprehension to
be—the ability to draw meaning from text.

2. In the course of taking the vocabulary-in-context section of Star Reading tests,


students read and respond to a significant amount of text. The Star Reading
test typically asks the student to demonstrate comprehension of material that
ranges over several grade levels. Students will read, use context clues from,
interpret the meaning of, and attempt to answer 20 to 25 cloze sentences
across these levels, generally totaling more than 300 words. The student must
select the correct word from sets of words that are all at the same reading
level, and that at least partially fit the sentence context. Students clearly must
demonstrate reading comprehension to correctly respond to these 20 to 25
questions.

3. A child’s level of vocabulary development is a major factor—perhaps the


major factor—in determining his or her ability to comprehend written material.
Decades of reading research have consistently demonstrated that a student’s
level of vocabulary knowledge is the most important single element in
determining the child’s ability to read with comprehension. Tests of vocabulary
knowledge typically correlate better than do any other components of reading
with valid assessments of reading comprehension. In fact, vocabulary tests
often relate more closely to sound measures of reading comprehension than
various measures of comprehension do to each other. Knowledge of word
meaning is simply a fundamental component of reading comprehension.

Star Assessments™ for Reading


Technical Manual 4
Introduction
Design of Star Reading

4. The student’s performance on the vocabulary-in-context section is used to


determine the initial difficulty level of the subsequent authentic text passage
items. Although this section consists of just five items, the accurate entry level
and the continuing adaptive selection process mean that all of the authentic
text passage items are closely matched to the student’s reading ability level.
This results in unusually high measurement efficiency.

The current third-generation tests expand the breadth of item formats and content
beyond that of the previous versions. Each test consists of 34 items; of these, the
first 10 are vocabulary-in-context items, while the last 24 items spiral their content
to include standards-based material from all five blueprint domains.

The introduction of the 34-Item Star Reading version does not replace the
previous version or make it obsolete. The previous version continues to be
available as “Star Reading Progress Monitoring,” the familiar 25-item measure of
reading comprehension. Star Reading thus gives users a choice between a brief
assessment focusing on reading comprehension alone, or a longer, standards-
based assessment which assures that a broad range of different reading skills,
appropriate to student grade level and performance, are included in each
assessment.

For these reasons, the Star Reading test design and item format provide a valid
procedure for assessing a student’s reading comprehension. Data and information
presented in this manual reinforce this.

Improvements Specific to Star Reading Versions 3 and Higher

Versions 3 and 4 are adaptations of version 2 designed specifically for use on a


computer with web access. In versions 3 and higher, all management and test
administration functions are controlled using a management system which is
accessed by means of a computer with web access.

This makes a number of new features possible:


X It makes it possible for multiple schools to share a central database, such as
a district-level database. Records of students transferring between schools
within the district will be maintained in the database; the only information that
needs revision following a transfer is the student’s updated school and class
assignments.
X The same database that contains Star Reading data can contain data on
other Star tests, including Star Early Literacy and Star Math. The Renaissance
program is a powerful information management program that allows you to
manage all your district, school, personnel, parent, and student data in one
place. Changes made to district, school, teacher, parent, and student data for

Star Assessments™ for Reading


Technical Manual 5
Introduction
Test Interface

any of these products, as well as other Renaissance software, are reflected in


every other Renaissance program sharing the central database.
X Multiple levels of access are available, from the test administrator within a
school or classroom to teachers, principals, district administrators, and even
parents.
X Renaissance takes reporting to a new level. Not only can you generate reports
from the student level all the way up to the school level, but you can also limit
reports to specific groups, subgroups, and combinations of subgroups. This
supports “disaggregated” reporting; for example, a report might be specific
to students eligible for free or reduced lunch, to English language learners,
or to students who fit both categories. It also supports compiling reports by
teacher, class, school, grade within a school, and many other criteria such as
a specific date range. In addition, the Renaissance consolidated reports allow
you to gather data from more than one program (such as Star Reading and
Accelerated Reader) at the teacher, class, school, and district level and display
the information in one report.
X Since the Renaissance software is accessed through a web browser, teachers
(and administrators) will be able to access the program from home—provided
the district or school gives them that access.

Test Interface
The Star Reading test interface was designed to be both simple and effective.
Students can use either the mouse or the keyboard to answer questions.
X If using the keyboard, students press one of the four number keys (1, 2, 3, and
4) and then press the Enter key (or the return key on Macintosh computers).
X If using the mouse, students click the answer of choice and then click Next to
enter the answer.
X On a tablet, students tap their answer choice; then, they tap Next.

Practice Session
Star Reading software includes a provision for a brief practice test preceding the
test itself. The practice session allows students to get comfortable with the test
interface and to make sure that they know how to operate it properly. As soon
as a student has answered three practice questions correctly, the program takes
the student into the actual test. As long as they possess the requisite 100-word
vocabulary, even the lowest-level readers should be able to answer the sample

Star Assessments™ for Reading


Technical Manual 6
Introduction
Adaptive Branching/Test Length

questions correctly. If the student has not successfully answered three items by
the end of the practice session, Star Reading will halt the testing session and tell
the student to ask the teacher for help. It may be that the student cannot read at
even the most basic level, or it may be that the student needs help operating the
interface, in which case the teacher should help the student through the practice
session the next time. Before beginning the next test with the student, the program
will recommend that the teacher assist the student during the practice.

Once a student has successfully passed a practice session, the student will not
be presented with practice items again on a test of the same type taken within the
next 180 days.

Adaptive Branching/Test Length


Star Reading’s branching control uses a proprietary approach somewhat
more complex than the simple Rasch maximum information IRT model. The
Star Reading approach was designed to yield reliable test results for both the
criterion-referenced and norm-referenced scores by adjusting item difficulty to the
responses of the individual being tested while striving to minimize test length and
student frustration.

In order to minimize student frustration, the first administration of the Star Reading
test begins with items that have a difficulty level that is below what a typical
student at a given grade can handle—usually one or two grades below grade
placement. On the average, about 85 percent of students will be able to answer
the first item correctly. Teachers can override this typical value by entering an
even lower Estimated Instructional Reading Level for the student. On the second
and subsequent administrations, the Star Reading test again begins with items
that have a difficulty level lower than the previously demonstrated reading ability.
Students generally have an 85 percent chance of answering the first item correctly
on second and subsequent tests.

Test Length
Once the testing session is underway, the Star Reading test administers 34 items
(the Star Reading Progress Monitoring test administers 25 items) of varying
difficulty based on the student’s responses; this is sufficient information to obtain a
reliable Scaled Score and to determine the student’s Instructional Reading Level.

The length of time needed to complete a Star Reading test varies across students.

Table 1 provides an overview of the testing time by grade for the students who
took the full-length 34-item version of Star Reading during the 2018–2019 school
year. The results of the analysis of test completion time indicate that half or more

Star Assessments™ for Reading


Technical Manual 7
Introduction
Adaptive Branching/Test Length

of students completed the test in less than 20 minutes, depending on grade, and
even in the slowest grade (grade K) 95% of students finished their Star Reading
test in less than 34 minutes.

Table 2 provides an overview of the Star Reading Progress Monitoring testing


time by grade for the students using data from the 2017–2018 and 2018–2019
school years. For that version of the test, about half of the students at every grade
completed the Star Reading Progress Monitoring test in less than 10 minutes, and
even in the slowest grade (grade 1) 95 percent of students finished in less than 18
minutes.

Table 1: Average and Percentiles of Total Time to Complete the 34-item Star Reading Assessment
During the 2018–2019 School Year

Time to Complete Test (in Minutes)


Sample Standard 5th 50th 95th 99th
Grade Size Mean Deviation Percentile Percentile Percentile Percentile
K 77,319 18.39 8.16 8.73 16.68 34.00 43.83
1 1,734,368 18.84 7.38 9.08 17.75 32.50 40.52
2 3,574,122 19.17 6.57 9.80 18.42 31.08 37.63
3 4,047,336 18.78 5.22 10.47 18.55 27.75 31.60
4 3,872,024 19.75 5.21 11.18 19.65 28.55 32.12
5 3,758,949 19.63 5.01 11.42 19.53 28.07 31.60
6 2,827,076 19.59 4.89 11.48 19.53 27.75 31.13
7 2,190,539 19.33 4.83 11.25 19.30 27.35 30.67
8 2,063,913 19.13 4.80 11.12 19.12 27.10 30.45
9 914,315 18.92 4.87 10.68 18.93 26.95 30.27
10 724,030 18.51 4.9o 10.35 18.48 26.67 30.12
11 448,315 18.25 4.98 10.02 18.22 26.55 30.03
12 275,495 17.95 5.12 9.70 17.85 26.58 30.18

Star Assessments™ for Reading


Technical Manual 8
Introduction
Test Repetition

Table 2: Average and Percentiles of Total Time to Complete the 25-item Star Reading Progress
Monitoring Assessment During the 2017–2018 and 2018–2019 School Years

Time to Complete Test (in Minutes)

Sample Standard 5th 50th 95th 99th


Grade Size Mean Deviation Percentile Percentile Percentile Percentile
1 10,260 10.27 3.55 5.62 9.67 17.15 20.37
2 31,898 9.35 2.87 5.60 8.83 14.85 17.75
3 33,128 9.67 2.57 5.95 9.38 14.43 16.67
4 31,340 9.48 2.46 5.93 9.20 13.98 16.33
5 28,656 9.35 2.47 5.82 9.03 13.93 16.00
6 14,980 9.02 2.42 5.65 8.68 13.52 16.07
7 10,196 8.71 2.29 5.57 8.40 12.95 15.10
8 10,232 8.59 2.33 5.45 8.25 12.93 15.67
9 1,800 8.55 2.34 5.45 8.13 13.14 15.31
10 1,451 8.11 2.07 5.32 7.78 11.85 14.12
11 738 8.00 2.10 5.32 7.62 12.18 14.12
12 483 7.92 2.09 5.30 7.67 11.97 14.93

Test Repetition
Star Reading score data can be used for multiple purposes such as screening,
placement, planning instruction, benchmarking, and outcomes measurement. The
frequency with which the assessment is administered depends on the purpose for
assessment and how the data will be used. Renaissance Learning recommends
assessing students only as frequently as necessary to get the data needed.
Schools that use Star for screening purposes typically administer it two to five
times per year. Teachers who want to monitor student progress more closely or
use the data for instructional planning may use it more frequently. Star Reading
may be administered monthly for progress monitoring purposes, and as often as
weekly when needed.

Star Reading keeps track of the questions presented to each student from test
session to test session and will not ask the same question more than once in any
120-day period.

Item Time Limits


Star Reading tests place no limits on total testing time. However, there are time
limits for each test item. The per-item time limits are generous, and ensure that

Star Assessments™ for Reading


Technical Manual 9
Introduction
Item Time Limits

more than 90 percent of students can complete each item within the normal time
limits.

Star Reading provides the option of extended time limits for selected students
who, in the judgment of the test administrator, require more than the standard
amount of time to read and answer the test questions.

Extended time may be a valuable accommodation for English language learners


as well as for some students with disabilities. Test users who elect the extended
time limit for their students should be aware that Star Reading norms, as well as
other technical data such as reliability and validity, are based on test administration
using the standard time limits. When the extended time limit accommodation is
elected, students have three times longer than the standard time limits to answer
each question.

Table 3 shows the Star Reading Progress Monitoring version’s test time-out limits
for individual items. These time limits are based on a student’s grade level.

Table 3: Star Reading Progress Monitoring Time-Out Limits

Standard Time
Limit (seconds/ Extended Time Limit
Grade Question Type item) (seconds/item)
K–2 Practice 60 180
a
Test, questions 1–25 60 180
Skill Test—Practice 60 180
(Calibration)
Skill Test—Test (Calibration) 60 180
3–12 Practice 60 180
Test, questions 1–20a 45 135
b
Test, questions 21–25 90 270
Skill Test—Practice 60 180
(Calibration)
Skill Test—Test (Calibration) 90 270
a. Vocabulary-in-context items.
b. Authentic text/passage comprehension items.

These time-out values are based on latency data obtained during item validation.
Very few vocabulary-in-context items at any grade had latencies longer than 30
seconds, and almost none (fewer than 0.3 percent) had latencies of more than
45 seconds. Thus, the time-out limit was set to 45 seconds for most students and
increased to 60 seconds for the very young students. Longer time limits were
allowed for the lengthier authentic text passages items.

Table 4 shows time limits for the 34-item Star Reading version’s test questions:

Star Assessments™ for Reading


Technical Manual 10
Introduction
Item Time Limits

Table 4: Star Reading Time-Out Limits

Standard Time Extended Time


Limit (seconds/ Limit (seconds/
Grade Question Type item) item)
K–2 Practice 60 180
a
Test Section A, questions 1–10 120 360
b
Test Section B, questions 11–34 180 405
3–12 Practice 60 180
a
Test Section A, questions 1–10 105 315
b
Test Section B, questions 11–34 150 450
a. Vocabulary-in-context items.
b. Items from 5 domains in 5 blocks, including some vocabulary-in-context.

At all grades, regardless of the extended time limit setting, when a student has
only 15 seconds remaining for a given item, a time-out warning appears, indicating
that he or she should make a final selection and move on. Items that time out
are counted as incorrect responses unless the student has the correct answer
selected when the item times out. If the correct answer is selected at that time, the
item will be counted as a correct response.

If a student doesn’t respond to an item, the item times out and briefly gives
the student a message describing what has happened. Then the next item is
presented. The student does not have an opportunity to take the item again. If a
student doesn’t respond to any item, all items are scored as incorrect.

Accessibility and Test Accommodations


The Star Reading test can be accessed in an accessible format that is in
compliance with WCAG 2.1 AA. This format allows for users with different ability
levels to access the test utilizing different modalities, including assistive technology
such as the JAWS screen reader. The content of the item bank is the same as the
traditional item delivery format, although the user interface is modified slightly. A
student will be presented with the WCAG 2.0 AA version of the test after educators
select one of the relevant test accommodations available in that student’s Personal
Needs Profile. Some of the available accommodations are the ability to change
the size of the text or the color contrast, a highlighter, a line reader, an answer
choice eliminator or unlimited time to answer questions. In order to provide the
best experience for students and teachers, the available accommodations could
be modified during the school year.

Star Assessments™ for Reading


Technical Manual 11
Introduction
Test Security

Unlimited Time
Beginning with the 2022–23 school year, a new preference has been added: the
Accommodations Preference. Among other things, this preference allows teachers
to give students virtually unlimited time to answer questions: 15 minutes for both
practice questions and test questions. When this preference is set, the student will
not see a time-out warning when there are 15 seconds left; however, if there is no
activity at all from the student within 15 minutes of a question first being presented,
the student will be shown a dialog box. The student will have 60 seconds to close
the dialog box and return to the test. If the student does not close the dialog box
within 60 seconds, the student’s current progress on the test will be saved and the
test will be ended (and can be resumed the same way as a paused test).

Test Security
Star Reading software includes a number of security features to protect the
content of the test and to maintain the confidentiality of the test results.

Split-Application Model
When students log into Star Reading, they do not have access to the same
functions that teachers, administrators, and other personnel can access. Students
are allowed to take the test, but no other features available in Star Reading are
available to them; therefore, they have no access to confidential information.
When teachers and administrators log in, they can manage student and class
information, set preferences, and create informative reports about student test
performance.

Individualized Tests
Using Adaptive Branching, every Star Reading test consists of items chosen
from a large number of items of similar difficulty based on the student’s estimated
ability. Because each test is individually assembled based on the students past
and present performance, identical sequences of items are rare. This feature,
while motivated chiefly by psychometric considerations, contributes to test security
by limiting the impact of item exposure.

Data Encryption
A major defense against unauthorized access to test content and student test
scores is data encryption. All of the items and export files are encrypted. Without

Star Assessments™ for Reading


Technical Manual 12
Introduction
Test Security

the appropriate decryption code, it is practically impossible to read the Star


Reading data or access or change it with other software.

Access Levels and Capabilities


Each user’s level of access to a Renaissance program depends on the primary
position assigned to that user. Each primary position is part of a user permission
group. There are six of these groups: district level administrator, district dashboard
owner, district staff, school level administrator, school staff, and teacher. By
default, each user permission group is granted a specific set of user permissions;
each user permission corresponds to one or more tasks that can be performed in
the program. The user permissions for these groups can be changed, and user
permissions can be granted or removed on an individual level.

Renaissance also allows you to restrict students’ access to certain computers. This
prevents students from taking Star Reading tests from unauthorized computers
(such as home computers). For more information, see [Link]
com/setup/22509.

The security of the Star Reading data is also protected by each person’s user
name (which must be unique) and password. User names and passwords
identify users, and the program only allows them access to the data and features
that they are allowed based on their primary position and the user permissions
that they have been granted. Personnel who log in to Renaissance (teachers,
administrators, or staff) must enter a user name and password before they
can access the data and create reports. Parents who are granted access to
Renaissance must also log in with a user name and password before they can
access information about their children. Without an appropriate user name and
password, personnel and parents cannot use the Star Reading software.

Test Monitoring/Password Entry


Test monitoring is another useful Star Reading security feature. Test monitoring
is implemented using the Password Requirement preference, which specifies
whether monitors must enter their passwords at the start of a test. Students are
required to enter a user name and password to log in before taking a test. This
ensures that students cannot take tests using other students’ names.

Final Caveat
While Star Reading software can do much to provide specific measures of test
security, the most important line of defense against unauthorized access or misuse
of the program is the user’s responsibility. Teachers and test monitors need to be

Star Assessments™ for Reading


Technical Manual 13
Introduction
Test Administration Procedures

careful not to leave the program running unattended and to monitor all testing to
prevent students from cheating, copying down questions and answers, or performing
“print screens” during a test session. Taking these simple precautionary steps will
help maintain Star Reading’s security and the quality and validity of its scores.

Test Administration Procedures


In order to ensure consistency and comparability of results to the Star Reading
norms, students taking Star Reading tests should follow standard administration
procedures. The testing environment should be as free from distractions for the
student as possible.

The Test Administration Manual included with the Star Reading product describes
the standard test orientation procedures that teachers should follow to prepare
their students for the Star Reading test. These instructions are intended for
use with students of all ages; however, the Star Reading test should only be
administered to students who have a reading vocabulary of at least 100 words.
The instructions were successfully field-tested with students ranging from grades
1–8. It is important to use these same instructions with all students before they
take the Star Reading test.

Star Assessments™ for Reading


Technical Manual 14
Content and Item Development

Content Specification: Star Reading


The scale and scope of Star Reading content has steadily grown since content
development first started many years ago. Since the original test in 1995, which
was exclusively Vocabulary-in-Context, other item types have been added to test
additional skills, and new item designs continue to be added as state standards
evolve.

Star Reading is based upon the assessment of 36 Blueprint Skills organized


within 5 Blueprint Domains of reading (see Table 5), and maps the progressions
of reading skills and understandings as they develop in sophistication from
kindergarten through grade 12. Each Star item is designed to assess a specific
skill within the test blueprint. The test blueprint is structured to provide a
consistent assessment experience even as state-specific Renaissance Reading
Learning Progressions may change, as well as the set of items associated with
the blueprint. The Star Reading test blueprint is largely fixed. Renaissance may
alter the blueprint if there are data-driven reasons to make a major change to the
content.

For information regarding the development of Star Reading items, see “Item
Development Specifications: Star Reading” on page 19. Before inclusion
in the Star Reading item bank, all items are reviewed to ensure they meet the
content specifications for Star Reading item development. Items that do not
meet the specifications are either discarded or revised for recalibration. All new
item development adheres to the content specifications and all items have been
calibrated using the dynamic calibration method.

The first stage of expanded Star Reading development is to identify the set of skills
to be assessed. Multiple resources were consulted to determine the set of skills
most appropriate for assessing the reading development of K–12 US students.
The resources include but are not limited to:
X Reading Next—A Vision for Action and Research in Middle and High School
Literacy: A Report to Carnegie Corporation of New York © 2004 by Carnegie
Corporation of New York. [Link]
[Link].
X NCTE Principles of Adolescent Literacy Reform, A Policy Research Brief,
Produced by The National Council of Teachers of English, April 2006.
[Link]

Star Assessments™ for Reading


Technical Manual 15
Content and Item Development
Content Specification: Star Reading

X Improving Adolescent Literacy: Effective Classroom and Intervention Practices,


August 2008. [Link]
X Reading Framework for the 2009 National Assessment of Education
Progress. [Link]
reading/[Link].
X Common Core State Standards Initiative (2010). Common Core State
Standards for English Language Arts & Literacy in History/Social Studies,
Science, and Technical Subjects.
X Individual state standards from all 50 states.

The development of the skills list included iterative reviews by reading and
assessment experts and psychometricians specializing in educational assessment.
See Table 5 for the Star Reading Blueprint Skills List. The skills list is organized
into five blueprint domains:
X Word Knowledge and Skills
X Comprehension Strategies and Constructing Meaning
X Analyzing Literary Text
X Understanding Author’s Craft
X Analyzing Argument and Evaluating Text

The second stage of development includes item development and calibration.


Assessment items are developed according to established specifications for
grade-level appropriateness and then reviewed to ensure the items meet the
specifications. Grade-level appropriateness is determined by multiple factors
including reading skill, reading level, cognitive load, vocabulary grade level,
sentence structure, sentence length, subject matter, and interest level. All writers
and editors have content-area expertise and relevant classroom experience and
use those qualifications in determining grade-level appropriateness for subject
matter and interest level. A strict development process is maintained to ensure
quality item development.

Assessment items, once written, edited, and reviewed, are field tested and
calibrated to estimate their Rasch difficulty parameters and goodness of fit to the
model. Field testing and calibration are conducted in a single step. This dynamic
calibration method is done by embedding new items in appropriate, random
positions within the Star assessments to collect the item response data needed
for psychometric evaluation and calibration analysis. Following these analyses,
each assessment item—along with both traditional and Item Response Theory
(IRT) analysis information (including fit plots) and information about the test level,
form, and item identifier—is stored in an item statistics database. A panel of

Star Assessments™ for Reading


Technical Manual 16
Content and Item Development
Content Specification: Star Reading

content reviewers then examines each item within the proper context, to determine
whether the item meets all criteria for use in an operational assessment.

Table 5: Star Reading Assessment Organization: Star Reading Blueprint Domains, Skill Sets, and
Skills

Star Reading Star Reading


Blueprint Domain Blueprint Skill Set Star Reading Blueprint Skill
Word Knowledge and Skills Vocabulary Strategies • Use context clues
• Use structural analysis
Vocabulary Knowledge • Recognize and understand synonyms
• Recognize and understand homonyms and multi-
meaning words
• Recognize connotation and denotation
• Understand idioms
• Understand analogies
Comprehension Strategies Reading Process Skills • Make predictions
and Constructing Meaning • Identify author’s purpose
• Identify and understand text features
• Recognize an accurate summary of text
Constructing Meaning • Understand vocabulary in context
• Draw conclusions
• Identify and understand main ideas
• Identify details
• Extend meaning and form generalizations
• Identify and differentiate fact and opinion
Organizational • Identify organizational structure
Structure • Understand cause and effect
• Understand comparison and contrast
• Identify and understand sequence
Analyzing Literary Text Literary Elements • Identify and understand elements of plot
• Identify and understand setting
• Identify characters and understand characterization
• Identify and understand theme
• Identify the narrator and point of view
Genre Characteristics • Identify fiction and nonfiction, reality and fantasy
• Identify and understand characteristics of genres
Understanding Author’s Author’s Choices • Understand figurative language
Craft • Understand literary devices
• Identify sensory detail
Analyzing Argument Analysis • Identify bias and analyze text for logical fallacies
and Evaluating Text • Identify and understand persuasion
Evaluation • Evaluate reasoning and support
• Evaluate credibility

Star Assessments™ for Reading


Technical Manual 17
Content and Item Development
Content Specification: Star Reading

An Example of Star Reading Item Adherence to a Specific Skill within Star


Reading Blueprint Structure

Blueprint Domain: Analyzing literary text


Blueprint Skill Set: Literary Elements
Blueprint Skill: Identify characters and understand characterization
Grade-level 2nd Describe major and minor characters and their traits
subskill grade using key details.
statements: 3rd grade Identify and describe main characters’ traits, motives,
and feelings.

3rd Grade Star Reading Item


Ajay likes being the youngest child in his family.
His two older brothers look after him. Before he
goes to sleep, they tell him adventure stories.
Ajay always falls asleep before the stories are
over. The stories will be continued the next night.

How does Ajay feel about his brothers?

1. He wants to get bigger so he can play with


them.
2. He likes that they look after him and tell him
stories.
3. He wishes their stories didn’t keep him
awake.

4th grade Describe characters, interactions with other


characters, and relationship between actions, traits,
and motives.

The Educational Development Laboratory’s Core Vocabulary List:


ATOS Graded Vocabulary List
The original point of reference for the development of Star Reading items was the
1995 updated vocabulary lists that are based on the Educational Development
Laboratory’s (EDL) A Revised Core Vocabulary (1969) of 7,200 words. The
EDL vocabulary list is a soundly developed, validated list that is often used by
developers of educational instruments to create all types of educational materials
and assessments. It categorizes hundreds of vocabulary words according to grade
placement, from primer (pre-grade 1) through grade 13 (post-high school). This
was exactly the span desired for the Star Reading test.

Star Assessments™ for Reading


Technical Manual 18
Content and Item Development
Content Specification: Star Reading

Beginning with new test items introduced in version 4.3, Star Reading item
developers have used ATOS instead of the EDL word list. ATOS is a system for
evaluating the reading level of continuous text; it contains over 125,000 words in
its graded vocabulary list. This readability formula was developed by Renaissance
Learning, Inc., and designed by leading readability experts. ATOS is the first
formula to include statistics from actual student book reading.

Content Specification: Star Reading


The Content item bank for Star Reading has been expanding steadily since the
original product launch and continues to this day. Content development is driven
by the test design and test purposes, which are to measure comprehension and
general reading achievement. Based on test purpose, the desired content had to
meet certain criteria. First, it had to cover a range broad enough to test students
from grades K–12. Thus, items had to represent reading levels ranging all the way
from kindergarten through post-high school. Second, the current collection of test
items must be large enough so that students could test often without being given
the same items twice.

The current item bank for Star Reading contains over 6,000 items.

Item Development Specifications: Star Reading


During item development, every effort is made to avoid the use of stereotypes,
potentially offensive language or characterizations, and descriptions of people
or events that could be construed as being offensive, demeaning, patronizing, or
otherwise insensitive. The editing process also includes a strict sensitivity review
of all items to attend to issues of gender and ethnic-group balance and fairness.

Vocabulary-in-Context Item Specifications


Each of the vocabulary items is written to the following specifications:

1. Each vocabulary-in-context test item consists of a single-context sentence.


This sentence contains a blank indicating a missing word. Three or four
possible answers are shown beneath the sentence. For questions developed
at a kindergarten or first-grade reading level, three possible answers are given.
Questions at a second-grade reading level and higher offer four possible
answers.

2. To answer the question, the student selects the word from the answer choices
that best completes the sentence. The correct answer option is the word that

Star Assessments™ for Reading


Technical Manual 19
Content and Item Development
Item Development Specifications: Star Reading

appropriately fits both the semantics and the syntax of the sentence. All of
the incorrect answer options either fit the syntax of the sentence or relate to
the meaning of something in the sentence. They do not, however, meet both
conditions.

3. The answer blanks are generally located near the end of the context sentence
to minimize the amount of rereading required.

4. The sentence provides sufficient context clues for students to determine the
appropriate answer choice. However, the length of each sentence varies
according to the guidelines shown in Table 6.

5. Typically, the words that provide the context clues in the sentence are below
the level of the actual test word. However, due to a limited number of available
words, not all of the questions at or below grade 2 meet this criterion—but even
at these levels, no context words are above the grade level of the item.

6. The correct answer option is a word selected from the appropriate grade
level of the item set. Incorrect answer choices are words at the same
grade level or one grade below. Through vocabulary-in-context test items,
Star Reading requires students to rely on background information, apply
vocabulary knowledge, and use active strategies to construct meaning from the
assessment text. These cognitive tasks are consistent with what researchers
and practitioners describe as reading comprehension.

Table 6: Maximum Sentence Length per Item Grade Level

Maximum Sentence Length


Item Grade Level (Including Sentence Blank)
Kindergarten and Grade 1 10 words

Grades 2 and 3 12 words

Grades 4–6 14 words

Grades 7–13 16 words

Authentic Text Passage Item Specifications


Authentic text items are used exclusively as an element of the Star Reading
Progress Monitoring test. Authentic text passage items are passages of extended
text administered to students at grade levels 3–13. To support students receiving
items at grade levels K–3, some original passages were written. Authentic
text items were developed by identifying authentic texts, extracting appropriate
passages, and creating cloze-type questions and answers. Each passage is
comprised of content that can stand alone as a unified, coherent text. Items were
selected which assess passage-level, not merely sentence-level, understanding.

Star Assessments™ for Reading


Technical Manual 20
Content and Item Development
Item Development Specifications: Star Reading

To answer the item correctly, the student needs to have a general understanding of
the context and content of the passage, not merely an understanding of the specific
content of the sentence.

The first authentic passages in Star Reading were extracted from children’s and
young adult literature, from nonfiction books, and from newspapers, magazines,
and encyclopedias. Passages were selected from combinations of three primary
categories for school-age children: popular fiction, classic fiction, and nonfiction.
Overall Flesch-Kincaid readability estimates of the source materials were used as
initial estimates of grade-level difficulty.

After the grade-level difficulty of a passage was estimated, the passage was
searched for occurrences of Educational Development Laboratory (EDL) words
at the same grade level difficulty. When an EDL word was found that, if replaced
with a blank space, would make the passage a good cloze passage, the passage
was extracted for use as an authentic text passage test item. Approximately 600
authentic text passage items were initially developed.

Each of the items in the resulting pool was then rated according to several criteria
in order to determine which items were best suited for inclusion in the tryout and
calibration. Three educators rated each item on the following criteria:
X Grade-level appropriateness of the text
X Cohesiveness of the passage
X Suitability of the passage for its grade level in terms of vocabulary
X Suitability of the passage for its grade level in terms of content density

To ensure a variety of authentic text passage items on the test, each passage was
also placed in one of the following categories, according to Meyer and Rice:

1. Antecedent-consequence—causal relationships are found between sentences.

2. Response—a question-answer or a problem-solving format.

3. Comparison—similarities and differences between sentences are found.

4. Collection—sentences are grouped together based on some common idea or


event. This would include a sequence of events.

5. Description—sentences provide information by explanation, in specific


attributes of the topic, or elaborating on setting.

Replacement passages and newly created items intended for use in versions 4.3
and later were extracted primarily from Accelerated Reader (AR) books. (Updated
content specifications were used for writing the new and replacement Star
Reading items in version 4.3.) Target words were selected in advance (based on
the average ATOS level of target words within a range of difficulty levels). Texts of

Star Assessments™ for Reading


Technical Manual 21
Content and Item Development
Item Development Specifications: Star Reading

AR books, based on those with the fewest quiz requests, were run through a text-
analysis tool to find instances of use. This was done to decrease the possibility
that students may have already encountered an excerpt.

Consideration was given to include some passages from the public domain. When
necessary, original long items were written. In any case, passages excerpted or
adapted are attributed in “Item and Scale Calibration” on page 31.

Each of the authentic text passage items is written to the following specifications:

1. Each authentic text passage test item consists of a paragraph. The second half
of the paragraph contains a sentence with a blank indicating a missing word.
Four possible answers are shown beneath the sentence.

2. To answer the question, the student selects the word from the list of answer
choices that best completes the sentence based on the context of the
paragraph. The correct answer choice is the word that appropriately fits
both the semantics and the syntax of the sentence, and the meaning of the
paragraph. All of the incorrect answer choices either fit the syntax of the
sentence or relate to the meaning of the paragraph.

3. The paragraph provides sufficient context clues for students to determine the
appropriate answer choice. Average sentence length within the paragraphs is
8–16 words depending on the item’s grade level. Total passage length ranges
from 27–107 words, based on the average reading speed of each grade level,
as shown in Table 7.

Table 7: Authentic Text Passage Length


Passage Length
Average Reading Speed (Approximate Number
Grade (Words/Minute) of Words)
1 80 30
2 115 40
3 138 55
4 158 70
5–6 173, 185 80
7–9 195, 204, 214 90
10–12 224, 237, 250 100

4. Answer choices for authentic text passage items are EDL Core Vocabulary or
ATOS words selected from vocabulary levels at or below that of the correct
response. The correct answer for a passage is a word at the targeted level of
the item. Incorrect answers are words or appropriate synonyms at the same
EDL or ATOS vocabulary level or one grade below.

Star Assessments™ for Reading


Technical Manual 22
Content and Item Development
Reading Skills Item Specifications

Reading Skills Item Specifications


Valid item development is contingent upon several interdependent factors. The
following section outlines the factors which guide Star Reading item content
development. Item content is comprised of stems, answer choices, and short
passages. Additional, detailed information may be found in the English Language
Arts Content Appropriateness Guidelines and Item Development Guidelines
outlined in the content specification.

Adherence to Skills
Star Reading assesses more than 600 grade-specific skills within the Renaissance
Core Progress for Reading Learning Progression. Item development is skill-specific.
Each item in the item bank is developed for and clearly aligned to one skill. An item
meets the alignment criteria if the knowledge and skill required to correctly answer
the item match the intended knowledge and skill being assessed. Answering an item
correctly does not require reading skill knowledge beyond the expected knowledge
for the skill being assessed. Star Reading items include only the information and text
needed to assess the skill.

Level of Difficulty: Readability


Readability is a primary consideration for level of item difficulty. Readability
relates to the overall ease of reading a passage and items. Readability involves
the reading level, as well as the layout and visual impact of the stem, passage/
support information/graphics, and the answer choices. Readability in Star item
development accounts for the combined impact, including intensity and density, of
each part of the item, even though the individual components of the item may have
different readability guidelines.

The reading level and grade level for individual words are determined by ATOS.
Item stems and answer choices present several challenges to accurately
determining reading level. Items may contain discipline-specific vocabulary that is
typically above grade level but may still be appropriate for the item. Examples of
this could include summary, paragraph, or organized and the like. Answer choices
may be incomplete sentences for which it is difficult to get an accurate reading
grade level. These factors are taken into account when determining reading level.

Item stems and answer choices that are complete sentences are written for the
intended grade level of the item. The words in answer choices and stems that are
not complete sentences are within the designated grade-level range. Reading
comprehension is not complicated by unnecessarily difficult sentence structure
and/or vocabulary.

Star Assessments™ for Reading


Technical Manual 23
Content and Item Development
Reading Skills Item Specifications

Items and passages are written at grade level. Table 8 indicates the GLE range,
item word count range, maximum passage word count range, and sentence length
range.

One exception exists for the reading skill use context clues. For those items, the
target word will be one grade level above the designated grade of the item.

Table 8: Readability Guidelines Table

Maximum Sentence
GLE Item Word Length Number of Words 1 Number of Unrecognized
Grade Range Count Range Grade Above (per 100) Words
K Less than 30 < 10 0 As a rule, the only
unrecognized words will be:
names, common derivatives,
etc.
1 30 10 0
2 1.8–2.7 40 Up to 12 0
3 2.8–3.7 Up to 55 Up to 12 0
4 3.8–4.7 Up to 70 Up to 14 0
5 4.8–5.7 Up to 80 Up to 14 In grade 5 and above, only 1
and only when needed.

6 5.8–6.7 Up to 80 Up to 14 1
7 6.8–7.7 Up to 90 Up to 16 1
8 7.8–8.7 Up to 90 Up to 16 1
9 8.8–9.7 Up to 90 Up to 16 1
10–12 9.8–10.7 Up to 100 Up to 16 1

Level of Difficulty: Cognitive Load, Content Differentiation, and


Presentation
In addition to readability, each item is constructed with consideration to cognitive
load, content differentiation, and presentation as appropriate for the ability and
experience of a typical student at that grade level.
X Cognitive Load: Cognitive load involves the type and amount of knowledge
and thinking that a student must have and use in order to answer the
item correctly. The entire impact of the stem and answer choices must be
considered.
X Content Differentiation: Content differentiation involves the level of detail
that a student must address to correctly answer the item. Determining and/

Star Assessments™ for Reading


Technical Manual 24
Content and Item Development
Reading Skills Item Specifications

or selecting the correct answer should not be dependent on noticing subtle


differences in the stem or answer choices.
X Depth of Knowledge: Depth of Knowledge is a language system used as
an evaluative tool for differentiating among the different levels, 1 through 4,
of complexity of specific learning expectations. Items are written to engage
students at the targeted depth of knowledge identified for each skill within the
assessment.
X Presentation: The presentation of the item includes consistent placement of
item components, including directions, stimulus components, questions, and
answer choices. Each of these should have a typical representation for the
discipline area and grade level. The level of visual differentiation needed to
read and understand the item components must be grade-level appropriate.

Efficiency in Use of Student Time


Efficiency is evidenced by a good return of information in relation to the amount
of time the student spends on the item. The action(s) required of the student
are clearly evident. Ideally, the student is able to answer the question without
reading the answer choices. Star Reading items have clear, concise, precise, and
straightforward wording.

Balanced Items: Bias and Fairness


Item development meets established demographic and contextual goals that are
monitored during development to ensure the item bank is demographically and
contextually balanced. Goals are established and tracked in the following areas:
use of fiction and nonfiction text, subject and topic areas, geographic region,
gender, ethnicity, occupation, age, and disability.
X Items are free of stereotyping, representing different groups of people in non-
stereotypical settings.
X Items do not refer to inappropriate content that includes but is not limited
to content that presents stereotypes based on ethnicity, gender, culture,
economic class, or religion.
X Items do not present any ethnicity, gender, culture, economic class, or religion
unfavorably.
X Items do not introduce inappropriate information, settings, or situations.
X Items do not reference illegal activities, sinister or depressing subjects,
religious activities or holidays based on religious activities, witchcraft, or unsafe
activities.

Star Assessments™ for Reading


Technical Manual 25
Content and Item Development
Reading Skills Item Specifications

Accuracy of Content
Concepts and information presented in items are accurate, up-to-date, and
verifiable. This includes, but is not limited to, references, dates, events, and
locations.

Language Conventions
Grammar, usage, mechanics, and spelling conventions in all Star Reading items
adhere to the rules and guidelines in the approved content reference books.
Merriam Webster’s 11th Edition is the reference for pronunciation and spelling.
The Chicago Manual of Style 17th Edition is the anchor reference for grammar,
mechanics, and usage.

Item Components
In addition to the guidelines outlined above, there are criteria that apply to
individual item components. The guidelines for passages are addressed above.
Specific considerations regarding stem and distractors are listed below.

Item stems meet the following criteria with limited exceptions:


X The question is concise, direct, and a complete sentence. The question is
written so students can answer it without reading the distractors.
X Generally, completion (blank) stems are not used. If a completion stem is
necessary, (such as is the case with vocabulary in context skills) the stem
contains enough information for the student to complete the stem without
reading the distractors, and the completion blank is as close to the end of the
stem as possible.
X The stem does not include verbal or other clues that hint at correct or incorrect
distractors.
X The syntax and grammar are straightforward and appropriate for the grade
level. Negative construction is avoided.
X The stem does not contain more than one question or part.
X Concepts and information presented in the items are accurate, up-to-date, and
verifiable. This includes but is not limited to dates, references, locations, and
events.

Distractors meet the following criteria with limited exceptions:


X All distractors are plausible and reasonable.

Star Assessments™ for Reading


Technical Manual 26
Content and Item Development
Star Reading and Renaissance Learning Progressions for Reading

X Distractors do not contain clues that hint at correct or incorrect distractors.


Incorrect answers are created based on common student mistakes.
X Distractors that are not common mistakes may vary between being close to the
correct answer or close to a distractor that is the result of a common mistake.
X Distractors are independent of each other, are approximately the same length,
have grammatically parallel structure, and are grammatically consistent with
the stem.
X None of these, none of the above, not given, all of the above, and all of these
are not used as distractors.

Metadata Requirements and Goals


Due to the restrictions for modifying text, the content may not meet the following
goals; however, new item development works to bring the content into alignment
with these goals:
X Gender: After removing gender-neutral items, an equal number of male and
female items should be represented. In addition to names (Sara) and nouns
(sisters), gender is also represented by pronoun (she). Gender is not indicated
by subject matter or appeal. For instance, an item on cooking is not female
unless there is a female character in it.
X Ethnicity: The goal is to provide students with an assessment that reflects the
ethnic diversity of our school children within the US: 48% White, 15% Black or
African American, 27% Hispanic, 5% Middle Eastern, and 5% Asian or Indian.
Ethnicity can be based on name or subject matter.
X Subject: A variety of subject areas should be present across the items,
such as Arts/Humanities, Science, History, Physical Education, Math, and
Technology.

Metadata is tagged with codes for Genres, Ethnicity, Occupations, Subjects,


Topics, and Regions.

Star Reading and Renaissance Learning Progressions


for Reading
Star Reading bridges assessment and instruction through research-based
learning progressions to help teachers make effective instructional decisions
and to adjust instruction to meet the needs of student at different achievement
levels. Star Reading assesses more than 600 grade-specific blueprint skills with
items developed and aligned to each skill. The skills measured by Star Reading

Star Assessments™ for Reading


Technical Manual 27
Content and Item Development
Star Reading and Renaissance Learning Progressions for Reading

are drawn from an overarching pool of skills known as the universal skills pool.
The universal skills pool contains the full range of skills reflected in state content
standards from all 50 US states and the District of Columbia from early literacy to
high-school level analysis and critique. The universal skills pool continues to grow
and evolve as state standards change and are updated. Learning progressions
are created by mapping the skills in the universal skills pool to different content
standards. Learning progressions define coherent and continuous pathways in
which students acquire knowledge and skills and present the knowledge and skills
in teachable orders that can be used to inform instructional decisions.

The first learning progression created for Star Reading was the Renaissance
Core Progress for Reading Learning Progression, which identifies a continuum of
reading skills that span from early literacy through high-school level analysis and
critique. It was developed in consultation with leading experts in early literacy and
reading by reviewing research and curricular documents and standards, including
the National Assessment of Education Progress (NAEP) Reading framework,
Texas Essential Knowledge and Skills, and state reading standards. The
Renaissance Core Progress for Reading Learning Progression is supported by
calibration data and psychometric analyses and is regularly refined and updated.
Item calibration data from Star Reading continually shows that there is a strong
correlation between rank ordering of skills in the Renaissance Core Progress for
Reading Learning Progression and the item difficulty estimates of items written to
measure those skills that are used in Star Reading.

Figure 1 illustrates the relationship between the sequential order of skills in the
Renaissance Core Progress for Reading Learning Progression and the average
difficulty of the Star Early Literacy and Star Reading items measuring that skill on
the Star Reading Unified scale. Each skill is represented by a single data point
with skills in each learning progression domain represented by different color
points. The figure shows that skills that are ordered later in the Renaissance Core
Progression for Reading Learning Progression are often more difficult than skills
that are represented earlier in the progression.

Star Assessments™ for Reading


Technical Manual 28
Content and Item Development
Star Reading and Renaissance Learning Progressions for Reading

Figure 1: Renaissance Core Progress for Reading Learning Progression

The relationships shown in Figure 1continue to evolve as the validation process


is ongoing and new items continue to be written. The continual updating of the
Renaissance Core Progress for Reading Learning Progression is important
to ensure that the ordering of the skills in the Renaissance Core Progress for
Reading Learning Progression is an accurate representation of the order in which
students learn early literacy and reading skills and concepts. To that end, item
calibration data collected from Star Reading is continuously used to validate and
refine learning progressions.

Renaissance now develops individualized learning progressions for all 50 states


and the District of Columbia. These state specific learning progressions are also
updated as state standards change. The state-specific learning progressions
cover specific skills represented in each state’s grade-level content standards. To
create these state-specific learning progressions, each state’s content standards
are analyzed, tagged, and mapped to skills in the universal skills pool. When
standards address areas of learning not yet addressed in the universal skills pool,
new skills are developed and added to the universal skills pool and potentially
added as new Star Reading skills. Since Star Reading CAT items are written
to specific skills which are in turn mapped to skills in the universal skills pool,
this allows data from Star Reading CAT items to inform state specific learning
progression and allows Star Reading to report results on state specific content
standards and learning progressions. This mapping of Star Reading CAT items
to skills in the universal skills pool which are in turn mapped to each state’s
grade-level content standards is one way in which Renaissance works to ensure
alignment between Star Reading and state content standards.

Star Assessments™ for Reading


Technical Manual 29
Content and Item Development
Star Reading and Renaissance Learning Progressions for Reading

When a student completes a Star Reading assessment, the program uses that
student’s performance to place the student at the appropriate point in the learning
progression designated for that school. This learning progression is usually the
state specific learning progression for the state in which the school is located.
Locating students in the learning progression helps teachers to identify the skills
that students are likely to have already learned and the skills they are ready
to learn next. It also indicates whether students are meeting the grade-level
performance expectations established by state content standards.

Star Assessments™ for Reading


Technical Manual 30
Item and Scale Calibration

Background
Star Reading was initially published in 1996, and quickly became one of the first
applications of computerized adaptive testing (CAT) to educational assessment
at the primary and secondary school levels. Unlike other early CAT applications,
the initial version of Star Reading was not based on item response theory (IRT).
Instead, it was an instance of stratified adaptive testing (Weiss, 19731). The items
in its item bank were sorted into grade levels (strata) based on their vocabulary
levels. Examinees started the test at the stratum corresponding to their school
grade; an algorithm branched them to easier or more difficult levels, contingent on
their performance.

IRT was introduced in Version 2 of Star Reading. At that time, hundreds of new
test items were developed, and both the new and the original items from Version
1 were calibrated as to difficulty on a vertical scale using the Rasch model. Star
Reading uses the calibrated Rasch difficulty of the test items as the basis for
adaptive item selection. And it uses the Rasch difficulty of the items administered
to a student, along with the pattern of right and wrong answers, to calculate a
maximum likelihood estimate of the location of the student on the Rasch scale.
To provide continuity with the non-IRT score scale of Version 1, equipercentile
equating was used to transform the Rasch scores to the original Star Reading
score scale.

Version 2’s Rasch model-based scale of item difficulty and student ability has
continued in use in all subsequent versions of Star Reading. This chapter begins
by presenting technical details of the development of that Rasch scale. Later, it
will describe improvements that have been made to the method of calibrating the
Rasch difficulty of new items. Finally, it will present details of the development of
a new scale for reporting Star Reading test scores—the Unified Score Scale, first
introduced in the 2017–2018 school year.

Calibration of Star Reading Items for Use in Version 2


This section summarizes the psychometric research and development undertaken
to prepare the large pool of calibrated reading test questions first used in Star

1. Weiss, D.J. The stratified adaptive computerized ability test (Research Report 73-3).
Minneapolis: University of Minnesota, Department of Psychology, Psychometric Method
Program, 1973. [Link]

Star Assessments™ for Reading


Technical Manual 31
Item and Scale Calibration
Sample Description

Reading 2, as well as the linkage of Star Reading 2 scores to the original Star
Reading 1 score scale. This research took place in two stages: item calibration
and score scale calibration. These are described in their respective sections
below.

In Star Reading 2 development, a large-scale item calibration program was


conducted in the spring of 1998. The Star Reading 2 item calibration study
incorporated all of the newly written vocabulary-in-context and authentic text
passage items, as well as over 800 vocabulary items in the Star Reading 1 item
bank. Two distinct phases comprised the item calibration study. The first phase
was the collection of item response data from a multi-level national student
sample. The second phase involved the fitting of item response models to the
data, and developing a single IRT difficulty scale spanning all levels from grades
1–12.

Sample Description
The data collection phase of the Star Reading 2 calibration study began with a
total item pool of over 2100 items. A nationally representative sample of students
tested these items. A total of 27,807 students from 247 schools participated in the
item calibration study. Table 9 provides the numbers of students in each grade who
participated in the study.

Table 9: Numbers of Students Tested by Grade, Star Reading 2 Item


Calibration Study—Spring 1998

Number of Number of Number of


Grade Students Grade Students Grade Students
Level Tested Level Tested Level Tested
1 4,037 5 2,167 9 2,030
2 3,848 6 1,868 10 1,896
3 3,422 7 1,126 11 1,326
4 3,322 8 713 12 1,715
Not Given 337

Star Assessments™ for Reading


Technical Manual 32
Item and Scale Calibration
Sample Description

Table 10 presents descriptive statistics concerning the makeup of the calibration


sample. This sample included 13,937 males and 13,626 females (244 student
records did not include gender information). As Table 10 illustrates, the tryout
sample approximated the national school population fairly well.

Table 10: Sample Characteristics, Star Reading 2 Calibration Study—Spring


1998 (N = 27,807 Students)

Students

National % Sample %
Geographic Region Northeast 20% 16%
Midwest 24% 34%
Southeast 24% 25%
West 32% 25%
District Socioeconomic Low: 31–100% 30% 28%
Status
Average: 15–30% 29% 26%
High: 0–14% 31% 32%
Non-Public 10% 14%
School Type & District Public
Enrollment < 200 17% 15%
200–499 19% 21%
27% 25%
500–2,000 28% 24%
> 2,000
Non-Public 10% 14%

Table 11 provides information about the ethnic composition of the calibration


sample. As Table 11 shows, the students participating in the calibration sample
closely approximate the national school population.

Table 11: Ethnic Group Participation, Star Reading 2 Calibration Study—


Spring 1998 (N = 27,807 Students)

Students

National % Sample %
Ethnic Group Asian 3% 3%
Black 15% 13%
Hispanic 12% 9%
Native American 1% 1%
White 59% 63%
Unclassified 9% 10%

Star Assessments™ for Reading


Technical Manual 33
Item and Scale Calibration
Item Presentation

Item Presentation
For the calibration research study, seven levels of test booklets were constructed
corresponding to varying grade levels. Because reading ability and vocabulary
growth are much more rapid in the lower grades, only one grade was assigned
per test level for the first four levels of the test (through grade 4). As grade level
increases, there is more variation among both students and school curricula,
so a single test can cover more than one grade level. Grades were assigned to
test levels after extensive consultation with reading instruction experts as well as
considering performance data for items as they functioned in the Star Reading
1 test. Items were assigned to grade levels such that the resulting test forms
sampled an appropriate range of reading ability typically represented at or near the
targeted grade levels.

Grade levels corresponding to each of the seven test levels are shown in the first two
columns of Table 12. Students answered a set number of questions at their current
grade level, as well as a number of questions one grade level above and one grade
level below their grade level. Anchor items were included to support vertically scaling
the test across the seven test levels. Table 12 breaks down the composition of test
forms at each test level in terms of types and number of test questions, as well as the
number of calibration test forms at each level.

Table 12: Calibration Test Forms Design by Test Level, Star Reading 2
Calibration Study—Spring 1998

Anchor Unique
Grade Items per Items per Items per Number of
Test Level Levels Form Form Form Test Forms
A 1 44 21 23 14
B 2 44 21 23 11
C 3 44 21 23 11
D 4 44 21 23 11
E 5–6 44 21 23 14
F 7–9 44 21 23 14
G 10–12 44 21 23 15

Each of the calibration test forms within a test level consisted of a set of 21
anchor items which were common across all test forms within a test level. Anchor
items consisted of items: a) on grade level, b) one grade level above, and c) one
grade level below the targeted grade level. The use of anchor items facilitated
equating of both test forms and test levels for purposes of data analysis and the
development of the overall score scale.

Star Assessments™ for Reading


Technical Manual 34
Item and Scale Calibration
Item Presentation

In addition to the anchor items were a set of 23 additional items that were unique
to a specific test form (within a level). Items were selected for a specific test
level based on Star Reading 1 grade level assignment, EDL vocabulary grade
designation, or expert judgment. To avoid problems with positioning effects
resulting from the placement of items within each test booklet form, items were
shuffled within each test form. This created two variations of each test form such
that items appeared in different sequential positions within each “shuffled” test
form. Since the final items would be administered as part of a computer-adaptive
test, it was important to remove any effects of item positioning from the calibration
data so that each item could be administered at any point during the test.

The number of field test forms constructed for each of the seven test levels is
shown in the last column of Calibration Test Forms Design by Test Level, Star
Reading 2 Calibration Study—Spring 1998 (varying from 11–15 forms per level).
Calibration test forms were spiraled within a classroom such that each student
received a test form essentially at random. This design ensured that no more
than two or three students in any classroom attempted any particular tryout item.
Additionally, it ensured a balance of student ability across the various tryout forms.
Typically, 250–300 students at the designated grade level of the test item received
a given question on their test.

It is important to note that some performance data already existed for the majority
of the questions in the Star Reading 2 calibration study. All of the questions from
the Star Reading 1 item bank were included, as were many items that were
previously field tested, but were not included in the Star Reading 1 test.

Following extensive quality control checks, the Star Reading 2 calibration research
item response data were analyzed, by level, using both traditional item analysis
techniques and IRT methods. For each test item, the following information was derived
using traditional psychometric item analysis techniques:
X The number of students who attempted to answer the item
X The number of students who did not attempt to answer the item
X The percentage of students who answered the item correctly (a traditional
measure of difficulty)
X The percentage of students who selected each answer choice
X The correlation between answering the item correctly and the total score
(a traditional measure of item discrimination)
X The correlation between the endorsement of an alternative answer and the
total score

Star Assessments™ for Reading


Technical Manual 35
Item and Scale Calibration
Item Difficulty

Item Difficulty
The difficulty of an item, in traditional item analysis, is the percentage of students
who answer the item correctly. This is typically referred to as the “p-value” of the
item. Low p-values (such as 15 percent) indicate that the item is difficult since only
a small percentage of students answered it correctly. High p-values (such as 90
percent) indicate that almost all students answered the item correctly, and thus the
item is easy. It should be noted that the p-value only has meaning for a particular
item relative to the characteristics of the sample of students who responded to it.

Item Discrimination
The traditional measure of the discrimination of an item is the correlation between
the “score” on the item (correct or incorrect) and the total test score. Items that
correlate well with total test score also tend to correlate well with one another and
produce a test that has more reliable scores (more internally consistent). For the
correct answer, the higher the correlation between item score and total score, the
better the item is at discriminating between low scoring and high scoring students.
Such items generally will produce optimal test performance. When the correlation
between the correct answer and total test score is low (or negative), it typically
indicates that the item is not performing as intended. The correlation between
endorsing incorrect answers and total score should generally be low since there
should not be a positive relationship between selecting an incorrect answer and
scoring higher on the overall test.

Item Response Function


In addition to traditional item analyses, the Star Reading calibration data were
analyzed using Item Response Theory (IRT) methods. Although IRT encompasses
a family of mathematical models, the Rasch model was selected for the Star
Reading 2 data both for its simplicity and its ability to accurately model the
performance of the Star Reading 2 items.

IRT attempts to model quantitatively what happens when a student with a specific
level of ability attempts to answer a specific question. IRT calibration places the
item difficulty and student ability on the same scale; the relationship between them
can be represented graphically in the form of an item response function (IRF),
which describes the probability of answering an item correctly as a function of the
student’s ability and the difficulty of the item.

Figure 2 is a plot of three item response functions: one for an easy item, one for
a more difficult one, and one for a very difficult item. Each plot is a continuous
S-shaped (ogive) curve. The horizontal axis is the scale of student ability, ranging

Star Assessments™ for Reading


Technical Manual 36
Item and Scale Calibration
Item Response Function

from very low ability (–5.0 on the scale) to very high ability (+5.0 on the scale). The
vertical axis is the percent of students expected to answer each of the three items
correctly at any given point on the ability scale. Notice that the expected percent
correct increases as student ability increases, but varies from one item to another.

In Figure 2, each item’s difficulty is the scale point where the expected percent
correct is exactly 50. These points are depicted by vertical lines going from the 50
percent point to the corresponding locations on the ability scale. The easiest item
has a difficulty scale value of about –1.67; this means that students located at –1.67
on the ability scale have a 50-50 chance of answering that item right. The scale
values of the other two items are approximately +0.20 and +1.25, respectively.

Calibration of test items estimates the IRT difficulty parameter for each test
item and places all of the item parameters onto a common scale. The difficulty
parameter for each item is estimated, along with measures to indicate how well the
item conforms to (or “fits”) the theoretical expectations of the presumed IRT model.

Also plotted in Figure 2 are “empirical item response functions (EIRF)”: the actual
percentages of correct responses of groups of students to all three items. Each
group is represented as a small triangle, circle, or diamond. Each of those geometric
symbols is a plot of the percent correct against the average ability level of the group.
Ten groups’ data are plotted for each item; the triangular points represent the groups
responding to the easiest item. The circles and diamonds, respectively, represent
the groups responding to the moderate and to the most difficult item.
Figure 2: Example of Item Statistics Database Presentation of Information

Star Assessments™ for Reading


Technical Manual 37
Item and Scale Calibration
Item Response Function

For purposes of the Star Reading 2 calibration research, two different “fit”
measures (both unweighted and weighted) were computed. Additionally, if the IRT
model is functioning well, then the EIRF points should approximate the (estimated)
theoretical IRF. Thus, in addition to the traditional item analysis information, the
following IRT-related information was determined for each item administered
during the calibration research analyses:
X The IRT item difficulty parameter
X The unweighted measure of fit to the IRT model
X The weighted measure of fit to the IRT model
X The theoretical and empirical IRF plots

Rules for Item Retention


Following these analyses, each test item, along with both traditional and IRT
analysis information (including IRF and EIRF plots) and information about the
test level, form, and item identifier, were stored in an item statistics database.
A panel of content reviewers then examined each item, within content strands,
to determine whether the item met all criteria for inclusion into the bank of items
that would be used in the norming version of the Star Reading 2 test. The item
statistics database allowed experts easy access to all available information about
an item in order to interactively designate items that, in their opinion, did not meet
acceptable standards for inclusion in the Star Reading 2 item bank.

Items were eliminated when they met one or more of the following criteria:
X Item-total correlation (item discrimination) was < 0.30
X Some other answer option had an item discrimination that was high
X Sample size of students attempting the item was less than 300
X The traditional item difficulty indicated that the item was too difficult or too easy
X The item did not appear to fit the Rasch model

For Star Reading version 2, after each content reviewer had designated certain
items for elimination, their recommendations were combined and a second review
was conducted to resolve issues where there was not uniform agreement among
all reviewers.

Of the initial 2100+ items administered in the Star Reading 2 calibration research
study, 1,409 were deemed of sufficient quality to be retained for further analyses.
Traditional item-level analyses were conducted again on the reduced data set that
excluded the eliminated items. IRT calibration was also performed on the reduced

Star Assessments™ for Reading


Technical Manual 38
Item and Scale Calibration
Scale Calibration and Linking

data set and all test forms and levels were equated based on the information
provided by the embedded anchor items within each test form. This resulted in
placing the IRT item difficulty parameters for all items onto a single scale spanning
grades 1–12.

Table 13 summarizes the final analysis information for the test items included
in the calibration test forms by test level (A–G). As shown in the table, the item
placements in test forms were appropriate: the average percentage of students
correctly answering items is relatively constant across test levels. Note, however,
that the average scaled difficulty of the items increases across successive levels
of the calibration tests, as does the average scaled ability of the students who
answered questions at each test level. The median point-biserial correlation, as
shown in the table, indicates that the test items were performing well.

Table 13: Calibration Test Item Summary Information by Test Level, Star Reading 2 Calibration
Study—Spring 1998

Average Median Median Average Average


Test Grade Number Sample Percent Percent Point- Scaled Scaled
Level Level(s) of Items Size Correct Correct Biserial Difficulty Ability
A 1 343 4,226 67 75 0.56 –3.61 –2.36
B 2 274 3,911 78 88 0.55 –2.35 –0.07
C 3 274 3,468 76 89 0.51 –1.60 0.76
D 4 274 3,340 69 81 0.51 –0.14 1.53
E 5–6 343 4,046 62 73 0.47 1.02 2.14
F 7–9 343 3,875 68 76 0.48 2.65 4.00
G 10–12 366 4,941 60 60 0.37 4.19 4.72

Scale Calibration and Linking


The outcome of the item calibration study described above was a sizable bank of
test items suitable for use in the Star Reading 2 test, with an IRT difficulty scale
parameter for each item. The item difficulty scale itself was devised such that it
spanned a range of item difficulty from grades 1–12. An important feature of Item
Response Theory is that the same scale used to characterize the difficulty of
the test items is also used to characterize examinees’ ability; in fact, IRT models
express the probability of a correct response as a function of the difference
between the scale values of an item’s difficulty and an examinee’s ability. The IRT
ability/difficulty scale is continuous; values of observed Rasch ability ranged from
about –20 to +20, with the zero value occurring at about the sixth-grade level.

Star Assessments™ for Reading


Technical Manual 39
Item and Scale Calibration
Scale Calibration and Linking

This continuous Rasch score scale is very different from the Scaled Score metric
used in Star Reading version 1. Star Reading version 1 scaled scores ranged
from 50–1,350, in integer units. The relationship of those scaled scores to the IRT
ability scale introduced in Star Reading version 2 was expected to be direct, but
not necessarily linear. For continuity between Star Reading 1 and Star Reading 2
scoring, it was desirable to be able to report Star Reading 2 scores on the same
scale used in Star Reading 1. To make that possible, a scale linking study was
undertaken in conjunction with Star Reading 2 norming. At every grade from
1–12, a portion of the norming sample was asked to take both versions of the Star
Reading test: versions 1 and 2. The test score data collected in the course of the
linking study were used to link the two scales, providing a conversion table for
transforming Star Reading 2 ability scores into equivalent Star Reading 1 Scaled
Scores.

From around the country and spanning all 12 grades, 4,589 students participated
in the linking study. Linking study participants took both Star Reading 1 and Star
Reading 2 tests within a few days of each other. The order in which they took the
two test versions was counterbalanced to account for the effects of practice and
fatigue. Test score data collected were edited for quality assurance purposes,
and 38 cases with anomalous data were eliminated from the linking analyses;
the linking was accomplished using data from 4,551 cases. The linking of the two
score scales was accomplished by means of an equipercentile equating involving
all 4,551 cases, weighted to account for differences in sample sizes across
grades. The resulting table of 99 sets of equipercentile equivalent scores was then
smoothed using a monotonic spline function, and that function was used to derive
a table of Scaled Score equivalents corresponding to the entire range of IRT
ability scores observed in the norming study. These Star Reading 2 Scaled Score
equivalents range from 0–1400; the same scale has been used for all subsequent
Star Reading versions, from version 3 to the present.

Summary statistics of the test scores of the 4,551 cases included in the linking
analysis are listed in Table 14. The table lists actual Star Reading 1 Scaled
Score means and standard deviations, as well as the same statistics for Star
Reading 2 IRT ability estimates and equivalent Scaled Scores calculated using
the conversion table from the linking study. Comparing the Star Reading 1 Scaled
Score means to the IRT ability score means illustrates how different the two
metrics are.

Comparing the Star Reading 1 Scaled Score means to the Star Reading 2
Equivalent Scale Scores in the rightmost two columns of Table 14 illustrates how
successful the scale linking was.

Star Assessments™ for Reading


Technical Manual 40
Item and Scale Calibration
Online Data Collection for New Item Calibration

Table 14: Summary Statistics of Star Reading 1 and 2 Scores from the Linking Study, by Grade—
Spring 1999 (N = 4,551 Students)

Star Reading 1 Star Reading 2 Star Reading 2


Scaled Scores IRT Ability Scores Equivalent Scale Scores
Sample
Grade Level Size Mean S.D. Mean S.D. Mean S.D.
1 284 216 95 –1.98 1.48 208 109
2 772 339 115 –0.43 1.60 344 148
3 476 419 128 0.33 1.53 419 153
4 554 490 152 0.91 1.51 490 187
5 520 652 176 2.12 1.31 661 213
6 219 785 222 2.98 1.29 823 248
7 702 946 228 3.57 1.18 943 247
8 545 958 285 3.64 1.40 963 276
9 179 967 301 3.51 1.59 942 292
10 81 1,079 292 4.03 1.81 1,047 323
11 156 1,031 310 3.98 1.53 1,024 287
12 63 1,157 299 4.81 1.42 1,169 229
1–12 4,551 656 345 1.73 2.36 658 353

Data from the linking study made it clear that Star Reading 2 software measures
ability levels extending beyond the minimum and maximum Star Reading 1 Scaled
Scores. In order to retain the superior bandwidth of Star Reading 2 software,
extrapolation procedures were used to extend the Scaled Score range below 50
and above 1,350; the range of reported scale scores for Star Reading versions 2
and later is 0 to 1400 for the Enterprise Scale. The Unified Scale reports scores
that range from 600 to 1400.

Online Data Collection for New Item Calibration


As described above, beginning with Star Reading Version 2, item calibration
involved administering new items and scale anchoring items to national student
samples in printed test booklets. Beginning with Star Reading version 4.3, data
needed for item calibration have been collected on-line, by embedding small
numbers of uncalibrated items within Star Reading tests. After sufficient numbers
of item responses have accumulated, the Rasch difficulty of each new item
is estimated by fitting a logistic model to the item response data and the Star
Reading Rasch scores of the students’ tests. Renaissance Learning calls this
overall process “dynamic calibration.”

Star Assessments™ for Reading


Technical Manual 41
Item and Scale Calibration
Online Data Collection for New Item Calibration

Typically, dynamic calibration is done in batches of several hundred new test


items. Each student’s test may include between 1 and 5 uncalibrated items. Each
item is tagged with a grade level, and is typically administered only to students at
that grade level and the next higher grade. The selection of the uncalibrated items
to be administered to each student is at random, resulting in nearly equivalent
distributions of student ability for each item at a given grade level.

Both traditional and IRT item analyses are conducted of the item response data
collected. The traditional analyses yielded proportion correct statistics, as well as
biserial and point-biserial correlations between scores on the new items and actual
scores on the Star Reading tests. The IRT analyses differed from those used in
the calibration of Star Reading 2 items, in that the relationships between scores
on each new item and the actual Star Reading scores were used to calibrate the
Rasch difficulty parameters.

For dynamic calibration, a minimum of 1,000 responses per item is the data
collection target. In practice, because of the very large number of Star Reading
tests administered each year, the average number of students responding to
each new test item is typically several times the target. The calibration analysis
proceeds one item at a time, using SAS/STAT™ software to estimate the threshold
(difficulty) parameter of every new item by calculating the non-linear regression
of each new item score (0 or 1) on the Star Reading Rasch ability estimates.
The accuracy of the non-linear regression approach has been corroborated by
conducting parallel analyses using Winsteps software. In tests, the two methods
yielded virtually identical results.

Table 15 summarizes the final analysis information for the 854 new test items
introduced in Star Reading Version 4.3, in 2007, by the target grades tagged to
each item. Since that time, several thousand more Star Reading items have gone
through dynamic calibration; currently the Star Reading operational item bank
contains more than 6,000 items.

Star Assessments™ for Reading


Technical Manual 42
Item and Scale Calibration
Computer-Adaptive Test Design

Table 15: Calibration Test Item Summary Information by Test Item Grade Level, Star Reading 4.3
Calibration Study–Fall 2007

Average Median Median Average Average


Item Grade Number Sample Percent Percent Point- Scaled Scaled
Level of Items Sizea Correct Correct Biserial Difficulty Ability
K 51 230,580 78 78 47 –3.77 –1.65
1 68 238,578 82 82 45 –3.68 –1.23
2 99 460,175 76 76 51 –2.91 –1.06
3 130 693,184 74 78 47 –1.91 –0.23
4 69 543,554 74 78 41 –1.05 0.64
5 44 514,146 70 72 40 –0.14 1.24
6 32 321,855 71 72 38 0.15 1.62
7 42 402,530 60 58 37 1.40 2.07
8 46 317,110 55 53 33 2.10 2.36
9 36 174,906 54 50 33 2.39 2.59
10 56 99,387 51 54 31 2.95 2.91
11 68 62,596 47 43 22 3.50 3.12
12 51 43,343 44 41 18 3.60 3.11
> 12 62 52,359 34 31 11 4.30 3.10
a. Sample size” in this table is the total number of item responses. Each student was presented with 3, 4, or 5 new items, so
the sample size substantially exceeds the number of students.

Computer-Adaptive Test Design


In computer-adaptive tests like the Star Reading test, the items taken by a student
are dynamically selected in light of that student’s performance during the testing
session. Thus, a low-performing student’s reading skills may branch to easier
items in order to better estimate his or her reading achievement level. High-
performing students may branch to more challenging reading items in order to
better determine the breadth of their reading skills and their reading achievement
level.

During a Star Reading test, a student may be “routed” to items at the lowest
reading level or to items at higher reading levels within the overall pool of items,
depending on the student’s unfolding performance during the testing session. In
general, when an item is answered correctly, the student is then given a more
difficult item. When an item is answered incorrectly, the student is then given
an easier item. Item difficulty here is defined by results of the Star Reading item
calibration studies.

Star Assessments™ for Reading


Technical Manual 43
Item and Scale Calibration
Scoring in the Star Reading Tests

Students who have not taken a Star Reading test within six months initially receive
an item whose difficulty level is relatively easy for students at the examinee’s
grade level. The selection of an item that is a bit easier than average minimizes
any effects of initial anxiety that students may have when starting the test and
serves to better facilitate the student’s initial reactions to the test. These starting
points vary by grade level and were based on research conducted as part of the
national item calibration study.

When a student has taken a Star Reading test within the last 120 days, the difficulty of
the first item depends on that student’s previous Star Reading test score information.
After the administration of the initial item, and after the student has entered an answer,
Star Reading software estimates the student’s reading ability. The software then
selects the next item randomly from among all of the items available that closely
match the student’s estimated reading ability.

Randomization of items with difficulty values near the student’s adjusted reading
ability allows the program to avoid overexposure of test items. Items that have
been administered to the same student within the past 120 days are not available
for administration. The large numbers of items available in the item pools,
however, ensure that this constraint has negligible impact on the quality of each
Star Reading computer-adaptive test.

Scoring in the Star Reading Tests


Following the administration of each Star Reading item, and after the student
has selected an answer, an updated estimate of the student’s reading ability
is computed based on the student’s responses to all items that have been
administered up to that point. A proprietary Bayesian-modal Item Response Theory
(IRT) estimation method is used for scoring until the student has answered at
least one item correctly and one item incorrectly. Once the student has met the
1-correct/1-incorrect criterion, Star Reading software uses a proprietary Maximum-
Likelihood IRT estimation procedure for scoring.

This approach to scoring enables Star Reading to provide Scaled Scores that
are statistically consistent and efficient. Accompanying each Scaled Score is an
associated measure of the degree of uncertainty, called the conditional standard
error of measurement (CSEM). The CSEM values for the Star Reading test are
unique for each student. CSEM values are dependent on the particular items the
student received and on the student’s performance on those items.

Scaled Scores are expressed on a common scale that spans all grade levels
covered by Star Reading (grades K–12). Because of this common scale, Scaled
Scores are directly comparable with each other, regardless of grade level. Other

Star Assessments™ for Reading


Technical Manual 44
Item and Scale Calibration
A New Scale for Reporting Star Reading Test Scores

scores, such as Percentile Ranks and Grade Equivalents, are derived from the
Scaled Scores.

A New Scale for Reporting Star Reading Test Scores


In 2001, five years following the publication of Star Reading Version 1,
Renaissance Learning released Star Early Literacy, an assessment of pre-literacy
skills that must be developed in order to learn to read. Although the Early Literacy
test measures constructs that are different from those assessed in Star Reading,
the two assessments are related developmentally, and scores on the two are
moderately highly correlated. Over time, many users of Star Reading have also
adopted Star Early Literacy; a frequent practice is to transition children from
the Early Literacy assessment to Star Reading when they are ready to take the
reading assessment. However, the two assessments had very different score
scales, making it difficult to recognize the transition point, and impossible to
assess growth in cases where Star Early Literacy was used early in the school
year, and replaced by Star Reading later in the same year.

What was needed was a common scale that can be used to report scores on
both tests. Such a scale, the Unified Score Scale, has been developed, and was
introduced into use in the 2017–2018 school year as an optional alternative scale
for reporting achievement on both tests. The Unified Scale is the default scale for
reporting test results starting in the 2022–2023 school year.

The Unified Score Scale is derived from the Star Reading Rasch scale of ability
and difficulty, which was first introduced with the development of Star Reading
Version 2.

The unified Star Early Learning scale was developed by performing the following
steps:
X The Rasch scale used by Star Early Literacy was linked (transformed) to the
Star Reading Rasch scale.
X A linear transformation of the transformed Rasch scale was developed that
spans the entire range of knowledge and skills measured by both Star Early
Literacy and Star Reading.

Details of these two steps are presented below.

1. The Rasch scale used by Star Early Literacy was linked to the Star Reading
Rasch scale.

In this step, a linear transformation of the Star Early Literacy Rasch scale to
the Rasch scale used by Star Reading was developed, using a method for

Star Assessments™ for Reading


Technical Manual 45
Item and Scale Calibration
A New Scale for Reporting Star Reading Test Scores

linear equating of IRT (item response theory) scales described by Kolen and
Brennan (2004, pages 162–165).

2. Because Rasch scores are expressed as decimal fractions, and may be either
negative or positive, a more user-friendly scale score was developed that uses
positive integer numbers only. A linear transformation of the extended Star
Reading Rasch scale was developed that spans the entire range of knowledge
and skills measured by both Star Early Literacy and Star Reading. The
transformation formula is as follows:

Unified Scale Score = INT (42.93 * Star Reading Rasch Score + 958.74)

where the Star Reading Rasch score has been extended downwards to values
as low as –20.00.

Following are some features and considerations in the development of that


scale, called here the “unified scale.”

a. The Unified Scale’s range is from 0 to approximately 1400. Anchor


points were chosen such that the 0 point is lower than the Star
Reading Rasch scale equivalent of the lowest obtainable SEL scale
score, and the lowest obtainable Star Early Literacy (SEL) and Star
Reading (SR) scale scores correspond to cardinal numbers on the
new scale.

i. The minimum SEL scale score of 300 was set equal to 200 on the
Unified Scale.

ii. An SR scale score of 0 was set equal to 600 on the Unified Scale.

b. The scale uses integer scale scores. New scale scores from 200 to
1400 correspond respectively to the lowest current SEL scale score
of 300, and a point slightly higher than the highest current SR scale
score of 1400.

c. The scale is extensible upwards and downwards. Currently, the


highest point on the unified scale is just under 1400; but there is
no theoretical limit: If SR content were extended beyond the high
school reading level, the range of the new scale can be extended
upward without limit, as needed. The lowest point is now set at 200—
equivalent to the lowest current SEL scale score (300); but the scale
can readily be extended downward as low as 0, if a reason arises to
do so.

Further details of the transformation of SEL Rasch scores to the SR Rasch scale
may be found in the 2018 edition of the Star Early Literacy Technical Manual.

Star Assessments™ for Reading


Technical Manual 46
Item and Scale Calibration
A New Scale for Reporting Star Reading Test Scores

Table 16 contains a table of selected Star Reading Rasch ability scores and their
equivalents on the Star Reading and Unified Score scales.

Table 16: Some Star Reading Rasch Scores and Their Equivalents on the Star
Reading and Unified Score Scales

Star Reading Scaled


Minimum Rasch Score Score Unified Scale Score
–8.3500 0 600
—6.2845 50 688
–3.1790 100 822
—2.5030 150 851
–1.9030 200 877
–1.2955 250 903
–0.7075 300 928
–0.1805 350 950
0.3390 400 973
0.7600 450 991
1.2450 500 1012
1.6205 550 1028
1.9990 600 1044
2.3240 650 1058
2.5985 700 1070
2.8160 750 1079
3.0090 800 1087
3.2120 850 1096
3.4570 900 1107
3.7435 950 1119
3.9560 1000 1128
4.0780 1050 1133
4.2120 1100 1139
4.3650 1150 1146
4.5790 1200 1155
4.8280 1250 1166
5.0940 1300 1177

Star Assessments™ for Reading


Technical Manual 47
Reliability and Measurement Precision

Measurement is subject to error. A measurement that is subject to a great deal


of error is said to be imprecise; a measurement that is subject to relatively little
error is said to be reliable. In psychometrics, the term reliability refers to the
degree of measurement precision, expressed as a proportion. A test with perfect
score precision would have a reliability coefficient equal to 1, meaning that 100
percent of the variation among persons’ scores is attributable to variation in the
attribute the test measures, and none of the variation is attributable to error.
Perfect reliability is probably unattainable in educational measurement; for
example, a test with a reliability coefficient of 0.90 is more likely. On such a test,
90 percent of the variation among students’ scores is attributable to the attribute
being measured, and 10 percent is attributable to errors of measurement. Another
way to think of score reliability is as a measure of the consistency of test scores.
Two kinds of consistency are of concern when evaluating a test’s measurement
precision: internal consistency and consistency between different measurements.
First, internal consistency refers to the degree of confidence one can have in the
precision of scores from a single measurement. If the test’s internal consistency
is 95 percent, just 5 percent of the variation of test scores is attributable to
measurement error.

Second, reliability as a measure of consistency between two different


measurements indicates the extent to which a test yields consistent results from
one administration to another and from one test form to another. Tests must yield
somewhat consistent results in order to be useful; the reliability coefficient is
obtained by calculating the coefficient of correlation between students’ scores on
two different occasions, or on two alternate versions of the test given at the same
occasion. Because the amount of the attribute being measured may change over
time, and the content of tests may differ from one version to another, the internal
consistency reliability coefficient is generally higher than the correlation between
scores obtained on different administrations.

There are a variety of methods of estimating the reliability coefficient of a


test. Methods such as Cronbach’s alpha and split-half reliability are single
administration methods and assess internal consistency. Coefficients of correlation
calculated between scores on alternate forms, or on similar tests administered two
or more times on different occasions, are used to assess alternate forms reliability,
or test-retest reliability (stability).

In a computerized adaptive test such as Star Reading, content varies from one
administration to another, and it also varies with each student’s performance.
Another feature of computerized adaptive tests based on Item Response Theory

Star Assessments™ for Reading


Technical Manual 48
Reliability and Measurement Precision


(IRT) is that the degree of measurement error can be expressed for each student’s
test individually.

The Star Reading tests provide two ways to evaluate the reliability of scores:
reliability coefficients, which indicate the overall precision of a set of test scores,
and conditional standard errors of measurement (CSEM), which provide an
index of the degree of error in an individual test score. A reliability coefficient is a
summary statistic that reflects the average amount of measurement precision in a
specific examinee group or in a population as a whole. In Star Reading, the CSEM
is an estimate of the unreliability of each individual test score. While a reliability
coefficient is a single value that applies to the test in general, the magnitude of the
CSEM may vary substantially from one person’s test score to another’s.

Another part of evaluating reliability is looking at the reliability of classification


decisions. In many applications of Star Reading, three normative benchmarks,
set at the 10th, 25th, and 40th percentile ranks, are used to classify students
into the performance categories of intensive intervention, intervention, on watch,
and at/above benchmark. These classifications are often used in a response-
to-intervention (RTI) and multi-tiered system of supports (MTSS) framework by
schools. To show reliability of classifications based on benchmarks, decision
accuracy and decision consistency indices can be computed. Like reliability
coefficients based on test scores, decision accuracy and consistency indices
range from 0 to 1 with values close to 1 indicating more accurate and consistent
classifications.

This chapter presents three different types of reliability coefficients: generic


reliability, split-half reliability, and alternate forms (test-retest) reliability. This is
followed by statistics on the conditional standard error of measurement of Star
Reading test scores. The chapter also presents indices of decision accuracy and
consistency.

The reliability and measurement error presentation is divided into two sections
below: First is a section describing the reliability coefficients, standard errors of
measurement, and decision accuracy and consistency indices for the 34-item
Star Reading tests. Second, another brief section presents reliability coefficients,
standard errors of measurement, and decision accuracy and consistency indices
for the 25-item Star Reading progress monitoring tests..

Star Assessments™ for Reading


Technical Manual 49
Reliability and Measurement Precision
34-Item Star Reading Tests

34-Item Star Reading Tests


Generic Reliability
Test reliability is generally defined as the proportion of test score variance that is
attributable to true variation in the trait the test measures. This can be expressed
analytically as

where σ2error is the variance of the errors of measurement and σ2total is the
variance of test scores. In Star Reading, the variance of the test scores is easily
calculated from Scaled Score data. The variance of the errors of measurement
may be estimated from the conditional standard error of measurement (CSEM)
statistics that accompany each of the IRT-based test scores, including the Scaled
Scores, as depicted below.

where the summation is over the squared values of the reported CSEM for
students i = 1 to n. In each Star Reading test, CSEM is calculated along with the
IRT ability estimate and Scaled Score. Squaring and summing the CSEM values
yields an estimate of total squared error; dividing by the number of observations
yields an estimate of mean squared error, which in this case is tantamount to error
variance. “Generic” reliability is then estimated by calculating the ratio of error
variance to Scaled Score variance, and subtracting that ratio from 1.

Using this technique with the Star Reading 2018–2019 school year data resulted
in the generic reliability estimates shown in Table 17 and Table 18 on page
53. Because this method is not susceptible to error variance introduced by
repeated testing, multiple occasions, and alternate forms, the resulting estimates
of reliability are generally higher than the more conservative alternate forms
reliability coefficients. These generic reliability coefficients are, therefore, plausible
upper-bound estimates of the internal consistency reliability of the Star Reading
computer-adaptive test.

Generic reliability estimates for scores on the Unified score scale are shown
in Table 17; Table 18 lists the reliability estimates for the older Star Reading
“Enterprise” scale scores. Results in Table 17 indicate that the overall reliability of
the Unified scale scores was about 0.98. Coefficients ranged from a low of 0.94
in grade 5 to a high of 0.97 in grade K. Results based on the Enterprise Scale in
Table 18 are slightly lower: the overall reliability of those scale scores was about

Star Assessments™ for Reading


Technical Manual 50
Reliability and Measurement Precision
34-Item Star Reading Tests

0.97; within-grade coefficients ranged from a low of 0.93 in grades 3 to 7 to a high


of 0.95 in grades K, 1, 11, and 12.

As both tables show, Star Reading reliability is quite high, grade by grade and
overall. Star Reading also demonstrates high test-retest consistency as shown
in the rightmost columns of the same tables. Star Reading’s technical quality
for an interim assessment is on a virtually equal footing with the highest-quality
summative assessments in use today.

Split-Half Reliability
While generic reliability does provide a plausible estimate of measurement
precision, it is a theoretical estimate, as opposed to traditional reliability
coefficients, which are more firmly based on item response data. Traditional
internal consistency reliability coefficients such as Cronbach’s alpha and Kuder-
Richardson Formula 20 (KR-20) are not meaningful for adaptive tests. However,
an estimate of internal consistency reliability can be calculated using the split-half
method.

A split-half reliability coefficient is calculated in three steps. First, the test is divided
into two halves, and scores are calculated for each half. Second, the correlation
between the two resulting sets of scores is calculated; this correlation is an
estimate of the reliability of a half-length test. Third, the resulting reliability value
is adjusted, using the Spearman-Brown formula, to estimate the reliability of the
full-length test.

In internal simulation studies, the split-half method provided accurate estimates


of the internal consistency reliability of adaptive tests, and so it has been used to
provide estimates of Star Reading reliability. These split-half reliability coefficients
are independent of the generic reliability approach discussed earlier and more
firmly grounded in the item response data. Split-half scores were based on all of
the 34 items of the Star Reading tests; scores based on the odd- and the even-
numbered items were calculated separately. The correlations between the two sets
of scores were corrected to a length of 34 items, yielding the split-half reliability
estimates displayed in Table 17 and Table 18 on page 53.

Results indicated that the overall split-half reliability of the Unified scores was 0.98.
The coefficients ranged from a low of 0.94 in grades 4 to 8 to a high of 0.96 in
grade 1. On the Enterprise Scale, the overall split-half reliability of the Enterprise
scores was 0.97. The coefficients ranged from a low of 0.92 in grades 4 and 5 to a
high of 0.95 in grades K, 1, and 12. These reliability estimates are quite consistent
across grades 1-12, and quite high, again a result of the measurement efficiency
inherent in the adaptive nature of the Star Reading test.

Star Assessments™ for Reading


Technical Manual 51
Reliability and Measurement Precision
34-Item Star Reading Tests

Alternate Form Reliability


Another method of evaluating the reliability of a test is to administer the test twice
to the same examinees. Next, a reliability coefficient is obtained by calculating the
correlation between the two sets of test scores. This is called a test-retest reliability
coefficient if the same test was administered both times and an alternate forms
reliability coefficient if different, but parallel, tests were used.

Content sampling, temporal changes in individuals’ performance, and growth or


decline over time can affect alternate forms reliability coefficients, usually making
them appreciably lower than internal consistency reliability coefficients.

The alternate form reliability study provided estimates of Star Reading reliability
using a variation of the test-retest method. In the traditional approach to test-retest
reliability, students take the same test twice, with a short time interval, usually a
few days, between administrations. In contrast, the Star Reading alternate form
reliability study administered two different tests by avoiding during the second test
the use of any items the student had encountered in the first test. All other aspects
of the two tests were identical. The correlation coefficient between the scores on
the two tests was taken as the reliability estimate.

The alternate form reliability estimates for the Star Reading test were calculated
using both the Star Reading Unified scaled scores and the Enterprise scaled
scores. Checks were made for valid test data on both test administrations and to
remove cases of apparent motivational discrepancies.

Table 17 and Table 18 include overall and within-grade alternate reliability, along
with an indication of the average number of days between testing occasions. The
average number of days between testing occasions ranged from 91–130 days.

Results indicated that the overall reliability of the scores on the Unified scale was
about 0.93. The alternate form coefficients ranged from a low of 0.73 in grade K
to a high of 0.87 in grade 9. Results for the Enterprise scale were similar to those
of the Unified Scale with an overall reliability of 0.93; its alternate form coefficients
ranged from a low of 0.76 in grade K to a high of 0.88 in grades 8, 9, and 10.

Because errors of measurement due to content sampling and temporal changes in


individuals’ performance can affect this correlation coefficient, this type of reliability
estimate provides a conservative estimate of the reliability of a single Star Reading
administration. In other words, the actual Star Reading reliability is likely higher
than the alternate form reliability estimates indicate.

Star Assessments™ for Reading


Technical Manual 52
Reliability and Measurement Precision
34-Item Star Reading Tests

Table 17: Reliability Estimates from the Star Reading 2018–2019 Data on the Unified Scale

Reliability Estimates—Unified Scale


Generic Split-Half Alternate Forms
Average Days
Grade N ρxx N ρxx N ρxx between Testing
K 50,000 0.97 20,000 0.95 7,000 0.73 91
1 1,000,000 0.96 20,000 0.96 200,00 0.76 100
2 1,000,000 0.96 20,000 0.95 200,000 0.83 114
3 1,000,000 0.95 20,000 0.95 200,000 0.85 113
4 1,000,000 0.95 20,000 0.94 200,000 0.86 115
5 1,000,000 0.94 20,000 0.94 200,000 0.86 115
6 1,000,000 0.95 20,000 0.94 200,000 0.86 117
7 1,000,000 0.95 20,000 0.94 200,000 0.86 121
8 1,000,000 0.95 20,000 0.94 200,000 0.86 120
9 500,000 0.96 20,000 0.95 100,000 0.87 127
10 500,000 0.96 20,000 0.95 100,000 0.86 125
11 200,000 0.96 20,000 0.95 40,000 0.85 130
12 200,000 0.96 20,000 0.95 40,000 0.85 122
Overall 9,450,000 0.98 260,000 0.98 1,887,000 0.93 116

Star Assessments™ for Reading


Technical Manual 53
Reliability and Measurement Precision
34-Item Star Reading Tests

Table 18: Reliability Estimates from the Star Reading 2018–2019 Data on the Enterprise Scale

Reliability Estimates—Enterprise Scale


Generic Split-Half Alternate Forms
Average Days
Grade N ρxx N ρxx N ρxx between Testing
K 50,000 0.95 20,000 0.95 7,000 0.76 91
1 1,000,000 0.95 20,000 0.95 200,000 0.80 100
2 1,000,000 0.94 20,000 0.95 200,000 0.85 114
3 1,000,000 0.93 20,000 0.94 200,000 0.86 113
4 1,000,000 0.93 20,000 0.93 200,000 0.86 115
5 1,000,000 0.93 20,000 0.92 200,000 0.86 115
6 1,000,000 0.93 20,000 0.92 200,000 0.87 117
7 1,000,000 0.93 20,000 0.93 200,000 0.87 121
8 1,000,000 0.94 20,000 0.93 200,000 0.88 120
9 500,000 0.94 20,000 0.94 100,000 0.88 127
10 500,000 0.94 20,000 0.94 100,000 0.88 125
11 200,000 0.95 20,000 0.94 40,000 0.87 130
12 200,000 0.95 20,000 0.95 40,000 0.87 122
Overall 9,450,000 0.97 260,000 0.97 1,887,000 0.93 116

Star Reading was designed to be a standards-based assessment, meaning


that its item bank measures skills identified by exhaustive analysis of national
and state standards in Reading, from grades K–12. The 34-item Star Reading
content covers many more skills than Star Reading versions 1 through 4.3, which
administered only 25 items.

The increased length of the current version of Star Reading, combined with its
increased breadth of skills coverage and enhanced technical quality, was expected
to result in improved measurement precision; this showed up as slightly increased
reliability, in both internal consistency reliability and alternate form reliability as
shown in the tables above. For comparison, see Table 22 on page 60 and Table
23 on page 61.

Standard Error of Measurement


When interpreting the results of any test instrument, it is important to remember
that the scores represent estimates of a student’s true ability level. Test scores
are not absolute or exact measures of performance. Nor is a single test score
infallible in the information that it provides. The standard error of measurement can
be thought of as a measure of how precise a given score is. The standard error of

Star Assessments™ for Reading


Technical Manual 54
Reliability and Measurement Precision
34-Item Star Reading Tests

measurement describes the extent to which scores would be expected to fluctuate


because of chance. If measurement errors follow a normal distribution, an SEM
of 17 means that if a student were tested repeatedly, his or her scores would
fluctuate within 17 points of his or her first score about 68 percent of the time, and
within 34 points (twice the SEM) roughly 95 percent of the time. Since reliability
can also be regarded as a measure of precision, there is a direct relationship
between the reliability of a test and the standard error of measurement for the
scores it produces.

The Star Reading tests differ from traditional tests in at least two respects with
regard to the standard error of measurement. First, Star Reading software
computes the SEM for each individual student based on his or her performance,
unlike most traditional tests that report the same SEM value for every examinee.
Each administration of Star Reading yields a unique “conditional” SEM (CSEM)
that reflects the amount of information estimated to be in the specific combination
of items that a student received in his or her individual test. Second, because
the Star Reading test is adaptive, the CSEM will tend to be lower than that of
a conventional test, particularly at the highest and lowest score levels, where
conventional tests’ measurement precision is weakest. Because the adaptive
testing process attempts to provide equally precise measurement, regardless of
the student’s ability level, the average CSEMs for the IRT ability estimates are very
similar for all students.

Table 19 and Table 20 contain two different sets of estimates of Star Reading
measurement error: conditional standard error of measurement (CSEM) and
global standard error of measurement (SEM). Conditional SEM was just described;
the estimates of CSEM in Table 19 and Table 20 are the average CSEM values
observed for each grade.

Global standard error of measurement is based on the traditional SEM estimation


method, using internal consistency reliability and the variance of the test scores to
estimate the SEM:

SEM = SQRT(1 – ρ) σx

where

SQRT() is the square root operator

ρ is the estimated internal consistency reliability

σx is the standard deviation of the observed scores (in this case,


Scaled Scores)

Table 19 and Table 20 summarize the distribution of CSEM values for the 2018–
2019 data, overall and by grade level. The overall average CSEM on the Unified
scale across all grades was 17 scaled score units and ranged from a low of 16 in

Star Assessments™ for Reading


Technical Manual 55
Reliability and Measurement Precision
34-Item Star Reading Tests

grades 1–3 to a high of 17 in grades K and 4–12 (Table 19).The average CSEM
based on the Unified scale is similar across all grades. The overall average unified
scale score global SEM was 18, slightly higher than the average CSEM. Table 20
shows the average CSEM values on the Enterprise Star Reading scale. Although
the adaptive testing process attempts to provide equally precise measurement,
regardless of the student’s ability level, and the average CSEMs for the IRT ability
estimates are very similar for all students, the transformation of the Star Reading
IRT ability estimates into equivalent Scaled Enterprise Scores is not linear and the
resulting SEMs in the Enterprise Scaled Score metric are less similar.

The overall average CSEM on the Enterprise scale across all grades was 54
scaled score units and ranged from a low of 20 in kindergarten to a high of 71
in grade 8. Unlike the Unified scale, the Enterprise Scale CSEM values vary by
grade and increased with grade until grade 8. The global SEMs for the Enterprise
scale scores were higher at each grade, and overall, than the average CSEMs; the
overall average SEM was 56. This is attributable to the nonlinear transformation of
the Star Reading IRT ability estimates into equivalent Enterprise Scaled Scores.
The Unified scale, in contrast, is based on a linear transformation of the IRT ability
estimates; it eliminates the issues of variable and large CSEM values that are an
artifact of the Enterprise Scaled Score nonlinear transformation.

Table 19: Standard Error of Measurement for the 2018–2019 Star Reading
Data on the Unified Scale

Standard Error of Measurement—Unified Scale


Conditional
Grade Sample Size Average Standard Deviation Global
K 50,000 17 2.3 19
1 1,000,000 16 1.3 18
2 1,000,000 16 1.2 17
3 1,000,000 16 1.3 17
4 1,000,000 17 1.3 17
5 1,000,000 17 1.3 17
6 1,000,000 17 1.4 17
7 1,000,000 17 1.5 17
8 1,000,000 17 1.7 17
9 500,000 17 1.9 18
10 500,000 17 2.1 18
11 200,000 17 2.2 18
12 200,000 17 2.5 18
All 9,450,000 17 1.5 18

Star Assessments™ for Reading


Technical Manual 56
Reliability and Measurement Precision
34-Item Star Reading Tests

Table 20: Standard Error of Measurement for the 2018–2019 Star Reading
Data on the Enterprise Scale

Standard Error of Measurement—


Enterprise Scale
Conditional
Grade Sample Size Average Standard Deviation Global
K 50,000 20 16.5 24
1 1,000,000 24 14.0 27
2 1,000,000 33 13.0 36
3 1,000,000 42 15.5 45
4 1,000,000 50 19.6 55
5 1,000,000 58 22.7 63
6 1,000,000 65 24.4 70
7 1,000,000 69 25.4 75
8 1,000,000 71 26.5 77
9 500,000 70 27.5 77
10 500,000 69 28.8 77
11 200,000 69 29.1 76
12 200,000 67 29.6 75
All 9,450,000 54 27.5 56

Decision Accuracy and Decision Consistency


Decision accuracy is generally defined as the degree to which observed
examinee classification decisions on a single assessment would agree with true
classifications for a given set of cut scores. There are multiple approaches to
estimate decision accuracy. Star Reading uses Rudner’s index (Rudner, 2001;
2005) based on item response theory (IRT), which assumes that the maximum
likelihood estimate of ability converges to a normal distribution with mean equal to
θ and standard deviation equal to the conditional standard error of measurement
(CSEM). Mathematically, this index can be computed as:

where Σ denotes the summation of all matrix elements, * denotes element-wise


matrix multiplication, Ne is the number of examinees, P is a Ne × C matrix of
expected probabilities with C being the number of performance categories on

Star Assessments™ for Reading


Technical Manual 57
Reliability and Measurement Precision
34-Item Star Reading Tests

the assessment, and W is a Ne × C matrix of binary weights used to indicate the


observed performance categories on the assessment. The P matrix is defined as:

[ ]
p^11 p^12 ... p^1C
p^21 p^22 ... p^2C
P = .. . ..
. .. . . . .
p^N 1 pN 2 . . .
^
p^N C
e e e

with the expected probability p^ic in the above matrix estimated as:

^
p^ic = ϕ(κic, κi(c+1), θ i, σ^θ ),
i
where ϕ(a, b, μ, σ) is the area from a to b under a normal curve with a mean of
^
μ and a standard deviation of σ, θ i is examinee i’s IRT ability estimate, σ^θ is the
^ i
corresponding CSEM for the ability estimate θ i, and κic and κi(c+1) are cut scores
with κi1 = –∞, κi2 being the cut score separating performance categories 1 and 2,
κi3 being the cut score separating performance categories 2 and 3, and so on with

[ ]
the last cut score κi(c+1) = ∞. The W matrix of weights is defined as:

w11 w12 ... w1C


w21 w22 ... w2C
.. . .. ,
W = . .. . . . .
wN 1 wN 2 . . . wN C
e e e

where the weight, wic, equals 1 if the student was classified into performance level
category C based on their ability estimate and 0 otherwise.

A counterpart to decision accuracy is decision consistency, defined as the degree


to which examinees would be classified into the same performance categories
given parallel replications of the same assessment. The method used to estimate
decision consistency is based on an extension to Rudner’s decision accuracy
index, which is described in Wyse and Hao (2012). This index can be estimated
as:

where Ne is the number of examinees and P is the same Ne × C matrix of


expected probabilities used when computing the decision accuracy index.

Star Assessments™ for Reading


Technical Manual 58
Reliability and Measurement Precision
34-Item Star Reading Tests

For Star Reading, three different classification decisions based on benchmarks


set at the 10th, 25th, and 40th percentile ranks in the student norms are available
by default in the Star Reading software. These cut scores are used to separate
students into four different performance categories: intensive intervention,
intervention, on watch, and at/above benchmark. Table 21 shows estimates of
decision accuracy and consistency when identifying students based on the three
individual benchmarks as well as all three benchmarks together using random
samples of students that took Star Reading in the 2018–2019 school year.

Results indicate that decision accuracy and consistency were quite high overall
and across grades. For PR10, decision accuracy ranged from a low of 0.95 to
a high of 0.99, while decision consistency ranged from 0.93 to 0.99. For PR25,
decision accuracy ranged from a low of 0.93 to a high of 0.97, while decision
consistency ranged from 0.90 to 0.96. For PR40, decision accuracy ranged from a
low of 0.92 to a high of 0.95, while decision consistency ranged from 0.89 to 0.93.
Decision accuracy when using all three benchmarks together ranged from a low
of 0.81 to a high of 0.93, while decision consistency ranged from a low of 0.74 to
a high of 0.89. These are high levels of decision accuracy and consistency when
making classification decisions based on each individual benchmark or all three
benchmarks together, and support using Star Reading in RTI/MTSS frameworks.

Table 21: Decision Accuracy and Consistency for Different Benchmarks Based on 2018–2019 Star
Reading Tests
Decision Accuracy Decision Consistency
All 3 All 3
Grade N PR10 PR25 PR40 Benchmarks PR10 PR25 PR40 Benchmarks
K 50,000 0.99 0.97 0.95 0.92 0.99 0.96 0.93 0.89
1 1,000,000 0.99 0.96 0.94 0.89 0.98 0.94 0.92 0.85
2 1,000,000 0.97 0.95 0.94 0.86 0.96 0.93 0.91 0.81
3 1,000,000 0.97 0.94 0.94 0.84 0.94 0.92 0.90 0.79
4 1,000,000 0.97 0.94 0.92 0.83 0.95 0.92 0.89 0.77
5 1,000,000 0.96 0.93 0.92 0.82 0.95 0.90 0.89 0.75
6 1,000,000 0.95 0.93 0.92 0.81 0.94 0.90 0.89 0.74
7 1,000,000 0.96 0.93 0.92 0.81 0.94 0.90 0.89 0.75
8 1,000,000 0.96 0.93 0.92 0.81 0.94 0.90 0.89 0.74
9 500,000 0.95 0.93 0.93 0.81 0.93 0.90 0.90 0.74
10 500,000 0.95 0.93 0.93 0.81 0.93 0.90 0.90 0.74
11 200,000 0.95 0.93 0.93 0.81 0.93 0.90 0.90 0.74
12 200,000 0.95 0.93 0.93 0.81 0.93 0.90 0.91 0.74
Overall 9,450,000 0.96 0.94 0.93 0.83 0.95 0.91 0.90 0.77

Star Assessments™ for Reading


Technical Manual 59
Reliability and Measurement Precision
25-Item Star Reading Progress Monitoring Tests

25-Item Star Reading Progress Monitoring Tests


Star Reading is used for both universal screening and progress monitoring.
The 34-item Star Reading test is widely used for universal screening. A shorter
version—the 25-item Star Reading progress monitoring test—exists for use
in progress monitoring. The following section summarizes the reliability and
the standard error of measurement of the progress monitoring version of Star
Reading.

Reliability Coefficients
Table 22 and Table 23 show the reliability estimates of the Star Reading progress
monitoring test on both the Unified scale and the Enterprise scale using data from
the 2017–2018 and 2018–2019 school years.

Table 22: Reliability Estimates from the 2017–2018 and 2018–2019 Star
Reading Progress Monitoring Tests on the Unified Scale

Progress Monitoring Reliability Estimates—Unified Scale


Generic Split-Half
Grade N ρxx N ρxx
1 10000 0.94 9400 0.94
2 30000 0.91 29000 0.92
3 30000 0.89 31000 0.90
4 30000 0.88 29000 0.89
5 28500 0.87 26000 0.88
6 14000 0.88 14099 0.90
7 10000 0.89 9400 0.91
8 10000 0.91 9400 0.93
9 1800 0.90 1619 0.93
10 1450 0.92 1376 0.93
11 730 0.93 686 0.96
12 480 0.94 444 0.96
Overall 166,960 0.96 161424 0.96

Star Assessments™ for Reading


Technical Manual 60
Reliability and Measurement Precision
25-Item Star Reading Progress Monitoring Tests

Table 23: Reliability Estimates from the 2017–2018 and 2018–2019 Star
Reading Progress Monitoring Tests on the Enterprise Scale

Progress Monitoring Reliability Estimates—Enterprise Scale


Generic Split-Half
Grade N ρxx N ρxx
1 10,000 0.94 9,400 0.94
2 30,000 0.92 29,000 0.92
3 30,000 0.90 31,000 0.89
4 30,000 0.89 29,000 0.88
5 28,500 0.88 26,000 0.87
6 14,000 0.89 14,099 0.88
7 10,000 0.91 9,400 0.89
8 10,000 0.93 9,400 0.91
9 1800 0.93 1619 0.91
10 1450 0.94 1376 0.92
11 730 0.95 686 0.94
12 480 0.96 444 0.95
Overall 166,960 0.94 161,424 0.94

The progress monitoring Star Reading reliability estimates are also quite high and
consistent across grades 1–12, for a test composed of only 25 items.

Overall, these coefficients also compare very favorably with the reliability
estimates provided for other published reading tests, which typically contain far
more items than the 25-item Star Reading progress monitoring tests. The Star
Reading progress monitoring test’s high reliability with minimal testing time is
a result of careful test item construction and an effective and efficient adaptive-
branching procedure.

Standard Error of Measurement


Table 24 and Table 25 show the conditional standard error of measurement
(CSEM) and the global standard error of measurement (SEM), overall and by
grade level.

Star Assessments™ for Reading


Technical Manual 61
Reliability and Measurement Precision
25-Item Star Reading Progress Monitoring Tests

Table 24: Estimates of 2017–2018 and 2018–2019 Star Reading Progress Monitoring Measurement
Precision by Grade and Overall, on the Unified Scale

Progress Monitoring Standard Error of Measurement—Unified Scale


Conditional Global
Grade Sample Size Average Standard Deviation Sample Size SEM
1 10,000 19 2.1 9,400 19
2 30,000 19 1.2 29,000 18
3 30,000 19 1.4 31,000 18
4 30,000 19 1.5 29,000 19
5 28,500 19 1.5 26,000 19
6 14,000 19 1.5 14,099 18
7 10,000 19 1.5 9,400 19
8 10,000 19 1.6 9,400 19
9 1,800 19 1.6 1,619 18
10 1,450 19 1.9 1,376 19
11 730 19 1.7 686 18
12 480 19 2.2 444 18
All 166,960 19 1.5 161,424 18

Table 25: Estimates of 2017–2018 and 2018–2019 Star Reading Progress Monitoring Measurement
Precision by Grade and Overall, on the Enterprise Scale

Progress Monitoring Standard Error of Measurement—Enterprise Scale


Conditional Global
Grade Sample Size Average Standard Deviation Sample Size SEM
1 10,000 21 15.3 9,400 25
2 30,000 34 12.9 29,000 36
3 30,000 44 12.9 31,000 45
4 30,000 52 17.4 29,000 55
5 28,500 59 20.5 26,000 63
6 14,000 64 22.4 14,099 67
7 10,000 71 26.0 9,400 76
8 10,000 78 28.8 9,400 82
9 1,800 78 29.1 1,619 78
10 1,450 78 32.7 1,376 85
11 730 78 32.7 686 79
12 480 77 33.7 444 81
All 166,960 51 24.1 161,424 56

Star Assessments™ for Reading


Technical Manual 62
Reliability and Measurement Precision
25-Item Star Reading Progress Monitoring Tests

Comparing the estimates of reliability and measurement error of Star Reading


(Table 17, Table 18, Table 19, and Table 20) with those of Star Reading progress
monitoring (Table 23, Table 24, Table 25, and Table 25) confirms that Star Reading
is slightly superior to the shorter Star Reading progress monitoring assessments in
terms of reliability and measurement precision.

Decision Accuracy and Consistency


Table 26 shows the decision accuracy and consistency indices for PR10, PR25,
and PR40 benchmarks for Star Reading Progress Monitoring based on data
collected in the 2017-2018 and 2018-2019 school years. Results suggest that the
decision accuracy and consistency for the Star Reading Progress Monitoring tests
was high, but slightly lower than the values observed for the 34-item Star Reading
tests. These high levels of decision accuracy and consistency support using Star
Reading tests in RTI/MTSS frameworks.

Table 26: Decision Accuracy and Consistency for Different Benchmarks Based on 2017–2018 and
2018–2019 Star Reading Progress Monitor Tests
Decision Accuracy Decision Consistency
All 3 All 3
Grade N PR10 PR25 PR40 Benchmarks PR10 PR25 PR40 Benchmarks
1 10,000 0.96 0.92 0.91 0.79 0.95 0.88 0.87 0.73
2 30,000 0.94 0.91 0.92 0.77 0.92 0.87 0.88 0.70
3 30,000 0.94 0.90 0.89 0.75 0.92 0.86 0.85 0.67
4 30,000 0.93 0.89 0.89 0.73 0.91 0.84 0.85 0.64
5 28,500 0.93 0.88 0.90 0.71 0.90 0.83 0.86 0.63
6 14,000 0.90 0.88 0.92 0.71 0.86 0.84 0.88 0.62
7 10,000 0.91 0.90 0.92 0.74 0.87 0.85 0.89 0.65
8 10,000 0.92 0.90 0.93 0.76 0.89 0.87 0.90 0.68
9 1,800 0.91 0.91 0.93 0.76 0.87 0.87 0.91 0.68
10 1,450 0.91 0.92 0.93 0.77 0.88 0.89 0.90 0.70
11 730 0.93 0.92 0.93 0.78 0.90 0.89 0.90 0.71
12 480 0.93 0.92 0.20 0.78 0.91 0.89 0.89 0.72
Overall 166,960 0.93 0.90 0.90 0.74 0.91 0.85 0.87 0.66

Star Assessments™ for Reading


Technical Manual 63
Validity

Test validity was long described as the degree to which a test measures what
it is intended to measure. A more current description is that a test is valid to
the extent that there are evidentiary data to support specific claims as to what
the test measures, the interpretation of its scores, and the uses for which
it is recommended or applied. Evidence of test validity is often indirect and
incremental, consisting of a variety of data that in the aggregate are consistent
with the theory that the test measures the intended construct(s), or is suitable for
its intended uses and interpretations of its scores. Determining the validity of a test
involves the use of data and other information both internal and external to the test
instrument itself.

Content Validity
One touchstone is content validity, which is the relevance of the test questions
to the attributes or dimensions intended to be measured by the test—namely
reading comprehension, reading vocabulary, and related reading skills, in the case
of the Star Reading assessments. The content of the item bank and the content
balancing specifications that govern the administration of each test together form
the foundation for “content validity” for the Star Reading assessments. These
content validity issues were discussed in detail in “Content and Item Development”
and were an integral part of the test items that are the basis of Star Reading today.

Construct Validity
Construct validity, which is the overarching criterion for evaluating a test, investigates
the extent to which a test measures the construct(s) that it claims to be assessing.
Establishing construct validity involves the use of data and other information external
to the test instrument itself. For example, Star Reading claims to provide an estimate
of a child’s reading comprehension and achievement level. Therefore, demonstration
of Star Reading’s construct validity rests on the evidence that the test provides such
estimates. There are a number of ways to demonstrate this.

For instance, in a study linking Star Reading Version 1 and the Degrees of
Reading Power comprehension assessment, a raw correlation of 0.89 was
observed between the two tests. Adjusting that correlation for attenuation due to
unreliability yielded a corrected correlation of 0.96 between the two assessments,
indicating that the constructs measured by the different tests are essentially
indistinguishable.

Star Assessments™ for Reading


Technical Manual 64
Validity
Internal Evidence: Evaluation of Unidimensionality of Star Reading

Since reading ability varies significantly within and across grade levels and
improves as a student’s grade placement increases, scores within Star Reading
should demonstrate these anticipated internal relationships; in fact, they do.
Additionally, scores for Star Reading should correlate highly with other accepted
procedures and measures that are used to determine reading achievement and
reading comprehension; this is external construct validity. This section deals
with both internal and external evidence of the validity of Star Reading as an
assessment of reading comprehension and reading skills.

Internal Evidence: Evaluation of Unidimensionality of


Star Reading
Star Reading is a 34-item computerized-adaptive assessment that measures
reading comprehension. Its items are selected adaptively for each student, from
a very large bank of reading test items, each of which is aligned to one of five
blueprint domains:
X Word knowledge and skills,
X Comprehension strategies and constructing meaning,
X Analyzing literary text,
X Analyzing argument and evaluating text, and
X Understanding author’s craft.

Star Reading is an application of item response theory (IRT); each test item’s
difficulty has been calibrated using the Rasch model. One of the assumptions of
the Rasch model is unidimensionality: that a test measures only a single construct
such as reading comprehension in the case of Star Reading. To evaluate whether
Star reading measures a single construct, factor analyses were conducted. Factor
analysis is a statistical technique used to determine the number of dimensions or
constructs that a test measures. Both exploratory and confirmatory factor analyses
were conducted across grades K to 12.

To begin, a large sample of student Star Reading data was assembled. The overall
sample consisted of 286,000 student records. That sample was divided into 2
sub-samples. The first sub-sample, consisting of 26,000 cases, was used for
exploratory factor analysis; the second sub-sample, 260,000 cases, was reserved
for confirmatory factor analyses that followed the initial exploratory analysis.

Within each sub-sample, each student’s 34 Star Reading item responses were
divided into subsets of items aligned to each of the 5 blueprint domains. Tests
administered in grades 4–12 included items from all five domains. Tests given in

Star Assessments™ for Reading


Technical Manual 65
Validity
Internal Evidence: Evaluation of Unidimensionality of Star Reading

grades K–3 included items from just 4 domains; no items measuring analyzing
argument and evaluating text were administered in these grades. For each
student, separate Rasch ability estimates (subtest scores) were calculated from
each domain-specific subset of item responses. A Bayesian sequential procedure
developed by Owen (1969, 1975) was used for the subtest scoring. The number of
items included in each subtest ranged from 2 to 18, following the Star Reading test
blueprints, which specify different numbers of items per domain, depending on the
student’s grade level.

Intercorrelations of the blueprint domain-specific Rasch subtest scores were


analyzed using exploratory factor analysis (EFA) to evaluate the number of
dimensions/ factors underlying Star Reading. Varimax rotation was used. In
each grade, the EFA analyses retained a single dominant underlying dimension
based on either the MINEIGEN (eigenvalue greater than 1) or the PROPORTION
criterion (proportion of variance explained by the factor), as expected. An example
of a scree plot from grade 2 based on the PROPORTION criterion is shown in
Figure 3.
Figure 3: Example Scree Plot from the Grade 2 Exploratory Factor Analysis in
Star Reading

Subsequent to the EFA analyses, confirmatory factor analyses (CFA) were


also conducted using the subtest scores from the CFA sub-sample. A separate
confirmatory analysis was conducted for each grade. The CFA models tested
a single underlying model as shown in Figure 4. Two CFA models were fitted
because one of the Star Reading blueprint domains is not tested in grades
K to 3.

Star Assessments™ for Reading


Technical Manual 66
Validity
Internal Evidence: Evaluation of Unidimensionality of Star Reading

Figure 4: Confirmatory Factor Analyses (CFA) in Star Reading

The results of the CFA analyses are summarized in Table 27. As that table
indicates, sample sizes ranged from 18,723 to 20,653; because the chi-square
(Χ2) test is not a reliable test of model fit when sample sizes are large, fit indices
are presented. The comparative fit index (CFI) and Tucker-Lewis Index (TLI) are
shown; for these indices, values are either 1 or very close to 1, indicating strong
evidence of a single construct/dimension. In addition, the root mean square error
of approximation (RMSEA), and the standardized root mean square residual
(SRMR) are presented. RMSEA and SRMR values less than 0.08 indicate good
fit. Cutoffs for the indices are presented in Hu and Bentler, 1999. Overall, the CFA
results strongly support a single underlying construct in Star Reading.

Table 27: Summary of the Goodness-of-Fit of the CFA Models for Star Reading by Grade

Grade N Χ2 df CFI TLI RMSEA SRMR


K 20,000 16.005 2 1.000 1.000 0.019 0.002
1 20,000 8.716 2 1.000 1.000 0.013 0.002
2 20,000 34.23 2 1.000 0.999 0.028 0.003
3 20,000 34.982 2 1.000 0.999 0.029 0.003
4 20,000 109.821 5 0.999 0.997 0.032 0.005
5 20,000 53.772 5 0.999 0.999 0.022 0.004
6 20,000 127.682 5 0.998 0.997 0.035 0.006
7 20,000 154.811 5 0.998 0.996 0.039 0.006
8 20,000 193.981 5 0.998 0.995 0.043 0.007
9 20,000 218.099 5 0.997 0.995 0.046 0.007
10 20,000 253.103 5 0.997 0.994 0.050 0.007
11 20,000 229.383 5 0.997 0.994 0.047 0.007
12 20,000 240.141 5 0.997 0.994 0.048 0.007

Star Assessments™ for Reading


Technical Manual 67
Validity
External Evidence: Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

The EFA analyses were conducted using the factor analysis procedure in R, while
the CFA analysis was conducted using R with the lavaan package Rosseel, 2012).

External Evidence: Relationship of Star Reading Scores


to Scores on Other Tests of Reading Achievement
In an ongoing effort to gather evidence for the validity of Star Reading scores,
continual research on score validity has been undertaken. In addition to original
validity data gathered at the time of initial development, numerous other studies
have investigated the correlations between Star Reading tests and other external
measures. In addition to gathering concurrent validity estimates, predictive validity
estimates have also been investigated. Concurrent validity was defined for students
taking a Star Reading test and external measures within a two-month time period.
Predictive validity provides an estimate of the extent to which scores on the Star
Reading test predicted scores on criterion measures given at a later point in time,
operationally defined as more than two months between the Star test (predictor)
and the criterion test. Studies of Star Reading tests’ concurrent and predictive
correlations with other tests between 1999 and 2013 included the following other
tests:
X AIMSweb
X Arkansas Augmented Benchmark Examination (AABE)
X California Achievement Test (CAT)
X Canadian Achievement Test (CAT)
X Colorado Student Assessment Program (CSAP)
X Comprehensive Test of Basic Skills (CTBS)
X Delaware Student Testing Program (DSTP)—Reading
X Dynamic Indicators of Basic Early Literacy Skills (DIBELS)—Oral Reading
Fluency
X Florida Comprehensive Assessment Test (FCAT, FCAT 2.0)
X Gates-MacGinitie Reading Test (GMRT)
X Idaho Standards Achievement Test (ISAT)
X Illinois Standards Achievement Test—Reading
X Iowa Test of Basic Skills (ITBS)
X Kansas State Assessment Program (KSAP)

Star Assessments™ for Reading


Technical Manual 68
Validity
External Evidence: Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

X Kentucky Core Content Test (KCCT)


X Metropolitan Achievement Test (MAT)
X Michigan Educational Assessment Program (MEAP)—English Language Arts
and Reading
X Mississippi Curriculum Test (MCT2)
X Missouri Mastery Achievement Test (MMAT)
X New Jersey Assessment of Skills and Knowledge (NJ ASK)
X New York State Assessment Program
X North Carolina End-of-Grade (NCEOG) Test
X Ohio Achievement Assessment (OAA)
X Oklahoma Core Curriculum Test (OCCT)
X South Dakota State Test of Educational Progress (DSTEP)
X Stanford Achievement Test (SAT)
X State of Texas Assessments of Academic Readiness Standards Test (STAAR)
X Tennessee Comprehensive Assessment Program (TCAP)
X TerraNova
X Texas Assessment of Academic Skills (TAAS)
X Transitional Colorado Assessment Program (TCAP)
X West Virginia Educational Standards Test 2 (WESTEST 2)
X Woodcock Reading Mastery (WRM)
X Wisconsin Knowledge and Concepts Examination (WKCE)
X Wide Range Achievement Test 3 (WRAT 3)

Table 28 and Table 29 present summary evidence of concurrent validity collected


between 1999 and 2013; between them, these tables summarize some 269
different analyses of concurrent validity with other tests, based on test scores of
more than 300 thousand school children. The within-grade average concurrent
validity coefficients for grades 1–6 varied from 0.72–0.80, with an overall average
of 0.74. The within-grade average concurrent validity for grades 7–12 ranged from
0.65–0.76, with an overall average of 0.72.

Table 30 and Table 31 present summary evidence of predictive validity collected


over the same time span: 1999 through 2013. These two tables display summaries

Star Assessments™ for Reading


Technical Manual 69
Validity
External Evidence: Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

of data 300 coefficients of correlation between Star Reading and other measures
administered at points in time at least two months later than Star Reading; more
than 1.45 million students’ test scores are represented in these two tables.
Predictive validity coefficients ranged from 0.69–0.72 in grades 1–6, with an
average of 0.71. In grades 7–12 the predictive validity coefficients ranged from
0.72–0.87 with an average of 0.80.

In general, these correlation coefficients reflect very well on the validity of the Star
Reading test as a tool for placement, achievement and intervention monitoring in
Reading. In fact, the correlations are similar in magnitude to the validity coefficients of
these measures with each other. These validity results, combined with the supporting
evidence of reliability and minimization of SEM estimates for the Star Reading test,
provide a quantitative demonstration of how well this innovative instrument in reading
achievement assessment performs.

For a compilation of all detailed validation information, see tables of correlations in


“Appendix B: Detailed Evidence of Star Reading Validity”.

Table 28: Concurrent Validity Data: Star Reading Correlations (r) with External Tests Administered
Spring 1999–Spring 2013, Grades 1–6

Summary
Grade(s) All 1 2 3 4 5 6
Number of students 255,538 1,068 3,629 76,942 66,400 54,173 31,686
Number of coefficients 195 10 18 47 41 32
Average validity 0.80 0.73 0.72 0.72 0.74 0.72
Overall average 0.74

Table 29: Concurrent Validity Data: Star Reading Correlations (r) with External Tests Administered
Spring 1999–Spring 2013, Grades 7–12

Summary
Grade(s) All 7 8 9 10 11 12
Number of students 48,789 25,032 21,134 1,774 755 55 39
Number of coefficients 74 30 29 7 5 2 1
Average validity 0.74 0.73 0.65 0.76 0.70 0.73
Overall average 0.72

Star Assessments™ for Reading


Technical Manual 70
Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Table 30: Predictive Validity Data: Star Reading Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6

Summary
Grade(s) All 1 2 3 4 5 6
Number of students 1,227,887 74,887 188,434 313,102 289,571 217,416 144,477
Number of coefficients 194 6 10 49 43 47 39
Average validity 0.69 0.72 0.70 0.71 0.72 0.71
Overall average 0.71

Table 31: Predictive Validity Data: Star Reading Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12

Summary
Grade(s) All 7 8 9 10 11 12
Number of students 224,179 111,143 72,537 9,567 21,172 6,653 3,107
Number of coefficients 106 39 41 8 10 6 2
Average validity 0.72 0.73 0.81 0.81 0.87 0.86
Overall average 0.80

Relationship of Star Reading Scores to Scores on State


Tests of Accountability in Reading
The No Child Left Behind (NCLB) Act of 2001 required states to develop and
employ their own accountability tests to assess students in ELA/Reading and
Math in grades 3 through 8, and one high school grade. Until 2014, most states
used their own accountability tests for this purpose. Renaissance Learning was
able to obtain accountability test scores for many students who also took Star
Reading; in such cases, it was feasible to calculate coefficients of correlation
between Star Reading scores and the state test scores. Observed concurrent and
predictive validity correlations are summarized below for the relationship between
Star Reading and state accountability test scores for grades 3–8 for tests of both
reading and language arts. Table 32 and Table 33 provide summaries from a
variety of concurrent and predictive validity coefficients, respectively, for grades
3–8. Numerous state accountability tests have been used in this research.

Star Assessments™ for Reading


Technical Manual 71
Validity
Relationship of Star Reading Scores to Scores on Multi-State Consortium Tests in Reading

Table 32: Concurrent Validity Data: Star Reading Correlations (r) with State Accountability Tests,
Grades 3–8

Summary
Grade(s) All 3 4 5 6 7 8
Number of students 11,045 2,329 1,997 2,061 1,471 1,987 1,200
Number of coefficients 61 12 13 11 8 10 7
Average validity 0.72 0.73 0.73 0.71 0.74 0.73
Overall average 0.73

Table 33: Predictive Validity Data: Star Reading Scaled Scores Predicting Later Performance for
Grades 3–8 on Numerous State Accountability Tests

Summary
Grade(s) All 3 4 5 6 7 8
Number of students 22,018 4,493 2,974 4,086 3,624 3,655 3,186
Number of coefficients 119 24 19 23 17 17 19
Average validity 0.66 0.68 0.70 0.68 0.69 0.70
Overall average 0.68

For Grades 3 to 8, Star Reading concurrent validity correlations by grade ranged


between 0.71 to 0.74 with an overall average validity correlation of 0.71. For
Grades 3 to 8, Star Reading predictive validity correlations by grade ranged
between 0.66 to 0.70 with an overall average validity correlation of 0.68.

Relationship of Star Reading Scores to Scores on Multi-


State Consortium Tests in Reading
In recent years, the National Governors’ Association, in collaboration with the
Council of Chief State School Officers (CCSSO), developed a proposed set of
curriculum standards in English Language Arts and Math, called the Common
Core State Standards. Forty-five states voluntarily adopted those standards;
subsequently, many states have dropped them, but 14 states continue to use
them or base their own state standards on them. Two major consortia were formed
to develop assessments systems that embodied those standards: the Smarter
Balanced Assessment Consortium (SBAC) and Partnership for Assessment of
Readiness for College and Careers (PARCC). SBAC and PARCC end-of-year
assessments have been administered in numerous states in place of those states’
previous annual accountability assessments. Renaissance Learning was able to
obtain SBAC and PARCC scores of many students who had taken Star Reading
earlier in the same school years. Table 34 and Table 35, below, contain coefficients
of correlation between Star Reading and the consortium tests.

Star Assessments™ for Reading


Technical Manual 72
Validity
Meta-Analysis of the Star Reading Validity Data

Table 34: Concurrent Predictive Validity Data: Star Reading Scaled Scores Predicting Later
Performance for Grades 3–8 on Smarter Balanced Assessment Consortium Test

Star Reading Predictive and Concurrent Correlations with Smarter Balanced Assessment Scores
Grade(s) All 3 4 5 6 7 8
Number of students 3,539 709 690 697 567 459 417
Fall Predictive 0.78 0.78 0.76 0.77 0.79 0.80
Winter Predictive 0.78 0.78 0.79 0.78 0.79 0.81
Spring Concurrent 0.79 0.82 0.80 0.70 0.79 0.81

Table 35: Concurrent and Predictive Validity Data: Star Reading Scaled Scores Correlations for
Grades 3–8 with PARCC Assessment Consortium Test Scores

Star Reading Predictive and Concurrent Correlations with PARCC Assessment Scores
Grade(s) All 3 4 5 6 7 8
Number of students 22,134 1770 3950 3843 4370 4236 3965
Predictive Concurrent 0.82 0.85 0.82 0.81 0.83 0.80
Concurrent 0.83 0.82 0.78 0.79 0.80 0.77

The average of the concurrent correlations was approximately 0.79 for SBAC
and 0.80 for PARCC. The average predictive correlation was 0.78 for the SBAC
assessments, and 0.82 for PARCC.

Meta-Analysis of the Star Reading Validity Data


Meta-analysis is a statistical procedure for combining results from different
sources or studies. When applied to a set of correlation coefficients that estimate
test validity, meta-analysis combines the observed correlations and sample sizes
to yield estimates of overall validity. In addition, standard errors and confidence
intervals can be computed for overall validity estimates as well as within-grade
validity estimates. To conduct a meta-analysis of the Star Reading validity data,
789 correlations reported in the Star Reading Technical Manual were combined
and analyzed using a fixed-effects model for meta-analysis (see Hedges and
Olkin, 1985, for a methodology description).

The results are displayed in Table 36. The table lists correlations within each
grade, as well as results from combining data from all twelve grades. For each
set of results, the table gives an estimate of the true validity, a standard error, and
the lower and upper limits of a 95 percent confidence interval for the expected
validity coefficient. Using the 789 correlation coefficients, the overall estimate of
the validity of Star Reading is 0.79, with a standard error of 0.001. The 95 percent
confidence interval allows one to conclude that the true validity coefficient for Star
Reading is approximately 0.79. The probability of observing the 789 correlations

Star Assessments™ for Reading


Technical Manual 73
Validity
Additional Validation Evidence for Star Reading

reported in Table 28 through Table 35 if the true validity were zero, would be
virtually zero. Because the 789 correlations were obtained with widely different
tests, and among students from twelve different grades, these results provide
strong support for the validity of Star Reading as a measure of reading skills.

Table 36: Results of the Meta-Analysis of Star Reading Correlations with Other Tests

Effect Size 95% Confidence Level

Validity Standard Total


Grade Estimate Error Lower Limit Upper Limit Correlations Total N
1 0.70 0.00 0.69 0.70 18 78,022
2 0.78 0.00 0.78 0.78 32 196,114
3 0.79 0.00 0.79 0.79 131 628,336
4 0.79 0.00 0.79 0.79 125 594,712
5 0.79 0.00 0.79 0.79 123 518,411
6 0.79 0.00 0.79 0.79 106 330,475
7 0.79 0.00 0.79 0.79 98 276,218
8 0.79 0.00 0.79 0.79 98 225,704
9 0.78 0.01 0.78 0.79 19 27,952
10 0.82 0.01 0.81 0.82 21 34,913
11 0.74 0.01 0.73 0.74 15 32,798
12 0.86 0.02 0.85 0.87 3 3,146
All Grades 0.79 0.00 0.79 0.79 789 2,946,801

Additional Validation Evidence for Star Reading


This section provides summaries of new validation data along with tables of
results. Data from four sources are presented here. They include a predictive
validity study, a longitudinal study, a concurrent validity study in England, and a
study of Star Reading’s construct validity as a measure of reading comprehension.

A Longitudinal Study: Correlations with SAT9


Sadusky and Brem (2002) conducted a study to determine the effects of
implementing Reading Renaissance (RR)1 at a Title I school in the southwest
from 1997–2001. This was a retrospective longitudinal study. Incidental to the
study, they obtained students’ Star Reading posttest scores and SAT9 end-of-year

1. Reading Renaissance is a supplemental reading program that uses Star Reading and
Accelerated Reader.

Star Assessments™ for Reading


Technical Manual 74
Validity
Additional Validation Evidence for Star Reading

Total Reading scores from each year and calculated correlations between them.
Students’ test scores were available for multiple years, spanning grades 2–6. Data
on gender, ethnic group, and Title I eligibility were also collected.

Table 37 displays the observed correlations for the overall group. Table 38 displays
the same correlations, broken out by ethnic group.

Overall correlations by year ranged from 0.66–0.73. Sadusky and Brem concluded
that “Star results can serve as a moderately good predictor of SAT9 performance
in reading.”

Enough Hispanic and white students were identified in the sample to calculate
correlations separately for those two groups. Within each ethnic group, the
correlations were similar in magnitude, as Table 38 shows. This supports the
assertion that Star Reading is valid for multiple student ethnicities.

Table 37: Correlations of the Star Posttest with the SAT9 Total Reading Scores
1998–2002a

Year Grades N Correlation


1998 3–6 44 0.66
1999 2–6 234 0.69
2000 2–6 389 0.67
2001 2–6 361 0.73
a. All correlations significant, p < 0.001.

Table 38: Correlations of the Star Posttest with the SAT9 Total Reading
Scores, by Ethnic Group, 1998–2002a

Hispanic White
Year Grade N Correlation N Correlation
1998 3–6 7 (n.s.) 0.55 35 0.69
1999 2–6 42 0.64 179 0.75
2000 2–6 67 0.74 287 0.71
2001 2–6 76 0.71 255 0.73
a. All correlations significant, p < 0.001, unless otherwise noted.

Concurrent Validity: An International Study of Correlations with Reading


Tests in England
NFER, the National Foundation for Educational Research, conducted a study
of the concurrent validity of both Star Reading and Star Math in 16 schools in
England in 2006 (Sewell, Sainsbury, Pyle, Keogh and Styles, 2007). English
primary and secondary students in school years 2–9 (equivalent to US grades

Star Assessments™ for Reading


Technical Manual 75
Validity
Additional Validation Evidence for Star Reading

1–8) took both Star Reading and one of three age-appropriate forms of the
Suffolk Reading Scale 2 (SRS2) in the fall of 2006. Scores on the SRS2 included
traditional scores, as well as estimates of the students’ Reading Age (RA), a
scale that is roughly equivalent to the Grade Equivalent (GE) scores used in the
US. Additionally, teachers conducted individual assessments of each student’s
attainment in terms of curriculum levels, a measure of developmental progress
that spans the primary and secondary years in England.

Correlations with all three measures are displayed in Table 39, by grade and
overall. As the table indicates, the overall correlation between Star Reading
and Suffolk Reading Scaled Scores was 0.91, the correlation with Reading Age
was 0.91, and the correlation with teacher assessments was 0.85. Within-form
correlations with the SRS ability estimate ranged from 0.78–0.88, with a median
correlation of 0.84, and ranged from 0.78–0.90 on Reading Age, with a median of
0.85.

Table 39: Correlations of Star Reading with Scores on the Suffolk Reading
Scale and Teacher Assessments in a Study of 16 Schools in England

Teacher
Suffolk Reading Scale Assessments
School Test SRS Reading Assessment
Yearsa Form N Scoreb Age N Levels
2–3 SRS1A 713 0.84 0.85 n/a n/a
4–6 SRS2A 1,255 0.88 0.90 n/a n/a
7–9 SRS3A 926 0.78 0.78 n/a n/a
Overall 2,694 0.91 0.91 2,324 0.85
a. UK school year values are 1 greater than the corresponding US school grade. Thus, Year 2
corresponds to Grade 1, etc.
b. Correlations with the individual SRS forms were calculated with within-form raw scores. The
overall correlation was calculated with a vertical Scaled Score.

Construct Validity: Correlations with a Measure of Reading Comprehension


The Degrees of Reading Power (DRP) test is widely recognized as a measure
of reading comprehension. Yoes (1999) conducted an analysis to link the Star
Reading Rasch item difficulty scale to the item difficulty scale of DRP. As part
of the study, nationwide samples of students in grades 3, 5, 7, and 10 took two
tests each (leveled forms of both the DRP and of Star Reading calibration tests).
The forms administered were appropriate to each student’s grade level. Both
tests were administered in paper-and-pencil format. All Star Reading test forms
consisted of 44 items, a mixture of vocabulary-in-context and extended passage
comprehension item types. The grade 3 DRP test form (H-9) contained 42 items
and all remaining grades (5, 7, and 10) consisted of 70 items on the DRP test.

Star Assessments™ for Reading


Technical Manual 76
Validity
Additional Validation Evidence for Star Reading

Star Reading and DRP test score data were obtained on 273 students at grade 3,
424 students at grade 5, 353 students at grade 7, and 314 students at grade 10.

Item-level factor analysis of the combined Star and DRP response data indicated
that the tests were essentially measuring the same construct at each of the four
grades. Eigenvalues from the factor analysis of the tetrachoric correlation matrices
tended to verify the presence of an essentially unidimensional construct. In
general, the eigenvalue associated with the first factor was very large in relation to
the eigenvalue associated with the second factor. Overall, these results confirmed
the essential unidimensionality of the combined Star Reading and DRP data.
Since DRP is an acknowledged measure of reading comprehension, the factor
analysis data support the claim that Star Reading likewise measures reading
comprehension.

Subsequent to the factor analysis, the Star Reading item difficulty parameters
were transformed to the DRP difficulty scale, so that scores on both tests could
be expressed on a common scale. Star Reading scores on that scale were then
calculated using the methods of Item Response Theory. Table 40 below shows
the correlations between Star Reading and DRP reading comprehension scores
overall and by grade.

Table 40: Correlations between Star Reading and DRP Test Scores, Overall
and by Grade
Number of
Test Form Items
Sample
Grade Size Star Calibration DRP Star DRP Correlation
3 273 321 H-9 44 42 0.84
5 424 511 H-7 44 70 0.80
7 353 623 H-6 44 70 0.76
10 314 701 H-2 44 70 0.86
Overall 1,364 0.89

In summary, using item factor analysis Yoes (1999) showed that Star
Reading items measure the same underlying construct as the DRP: reading
comprehension. The overall correlation of 0.89 between the DRP and Star
Reading test scores corroborates that. Furthermore, correcting that correlation
coefficient for the effects of less than perfect reliability yields a corrected
correlation of 0.96. Thus, both at the item level and at the test score level, Star
Reading was shown to measure essentially the same construct as the DRP.

Star Assessments™ for Reading


Technical Manual 77
Validity
Additional Validation Evidence for Star Reading

Investigating Oral Reading Fluency and Developing the Estimated Oral


Reading Fluency Scale
During the fall of 2007 and winter of 2008, 32 schools across the United States
that were then using both Star Reading and DIBELS oral reading fluency
(DORF) for interim assessments participated in a research study to evaluate the
relationship of Star Reading scores to oral reading fluency. Below are highlights of
the methodology and results of the study.

A single-group design provided data for both evaluation of concurrent validity


and the linking of the two score scales. For the linking analysis, an equipercentile
methodology was used. Analysis was done independently for each of grades
1–4. To evaluate the extent to which the linking accurately approximated student
performance, 90 percent of the sample was used to calibrate the linking model,
and the remaining 10 percent were used for cross-validating the results. The 10
percent were chosen by a simple random function.

The 32 schools in the sample came from 9 states: Alabama, Arizona, California,
Colorado, Delaware, Illinois, Michigan, Tennessee, and Texas. This represented a
broad range of geographic areas, and resulted in a large number of students (N =
12,220). The distribution of students by grade was as follows:
X 1st grade: 2,001
X 2nd grade: 4,522
X 3rd grade: 3,859
X 4th grade: 1,838

The sample was composed of 61 percent of students of European ancestry; 21


percent of African ancestry; 11 percent of Hispanic ancestry; with the remaining 7
percent of Native American, Asian, or other ancestry.

Students were individually assessed using the DORF (DIBELS Oral Reading
Fluency) benchmark passages. The students read the three benchmark passages
under standardized conditions. The raw score for passages was computed as
the number of words read correctly within the one-minute limit (WCPM, Words
Correctly read Per Minute) for each passage. The final score for each student was
the median WCPM across the benchmark passages, and was the score used for
analysis. Each student also took a Star Reading assessment within two weeks of
the DORF assessment.

Descriptive statistics for each grade in the study on Star Reading Scaled Scores and
DORF WCPM (words correctly read per minute) are found in Table 41.

Correlations between the Star Reading Scaled Score and DORF WCPM at
all grades were significant (p < 0.01) and diminished consistently as grades

Star Assessments™ for Reading


Technical Manual 78
Validity
Additional Validation Evidence for Star Reading

increased. Figure 5 visualizes the scatterplot of observed DORF WCPM and SR


Scaled Scores, with the equipercentile linking function overlaid. The equipercentile
linking function appeared linear; however, deviations at the tails of the distribution
for higher and lower performing students were observed. The root mean square
errors of linking for grades 1–4 was found to be 14, 19, 22, and 25 WCPM,
respectively.

Table 41: Descriptive Statistics and Correlations between Star Reading


Scale Scores and DIBELS Oral Reading Fluency for the Calibration
Sample

Star Reading Scale


Score DORF WCPM
Grade N Mean SD Mean SD Correlation
1 1,794 172.90 98.13 46.05 28.11 0.87
2 4,081 274.49 126.14 72.16 33.71 0.84
3 3,495 372.07 142.95 90.06 33.70 0.78
4 1,645 440.49 150.47 101.43 33.46 0.71

Figure 5: Scatterplot of Observed DORF WCPM and SR Scale Scores for Each
Grade with the Grade Specific Linking Function Overlaid

Star Assessments™ for Reading


Technical Manual 79
Validity
Additional Validation Evidence for Star Reading

Cross-Validation Study Results


The 10 percent of students randomly selected from the original sample were used
to provide evidence of the extent to which the models based on the calibration
samples were accurate. The cross-validation sample was intentionally kept out of
the calibration of the linking estimation, and the results of the calibration sample
linking function were then applied to the cross-validation sample.

Table 42 provides descriptive information on the cross-validation sample. Means


and standard deviations for DORF WCPM and Star Reading Scaled Score for
each grade were of a similar magnitude to the calibration sample. Table 43
provides results of the correlation between the observed DORF WCPM scores
and the estimated WCPM from the equipercentile linking. All correlations were
similar to results in the calibration sample. The average differences between the
observed and estimated scores and their standard deviations are reported in
Table 43 along with the results of one sample t-test evaluating the plausibility of
the mean difference being significantly different from zero. At all grades the mean
differences were not significantly different from zero, and standard deviations of
the differences were very similar to the root mean square error of linking from the
calibration study.

Table 42: Descriptive Statistics and Correlations between Star Reading Scale
Scores and DIBELS Oral Reading Fluency for the Cross-Validation
Sample

Star Reading
Scale Score DORF WCPM
Grade N Mean SD Mean SD
1 205 179.31 100.79 45.61 26.75
2 438 270.04 121.67 71.18 33.02
3 362 357.95 141.28 86.26 33.44
4 190 454.04 143.26 102.37 32.74

Table 43: Correlation between Observed WCPM and Estimated WCPM Along
with the Mean and Standard Deviation of the Differences between
Them

Mean SD t-test on Mean


Grade N Correlation Difference Difference Difference
1 205 0.86 –1.62 15.14 t(204) = –1.54, p = 0.13
2 438 0.83 0.23 18.96 t(437) = 0.25, p = 0.80
3 362 0.78 –0.49 22.15 t(361) = –0.43, p = 0.67
4 190 0.74 –1.92 23.06 t(189) = –1.15, p = 0.25

Star Assessments™ for Reading


Technical Manual 80
Validity
Classification Accuracy of Star Reading

Classification Accuracy of Star Reading


Accuracy for Predicting Proficiency on a State Reading Assessment
Star Reading test scores have been linked statistically to numerous state reading
assessment scores. The linked values have been employed to use Star Reading
to predict student proficiency in reading on those state tests. One example of this
is a linking study conducted using a multi-state sample of students’ scores on the
PARCC consortium assessment2. The table below presents classification accuracy
statistics for grades 3 through 8.

Table 44: Classification diagnostics for predicting students’ reading proficiency on the PARCC
consortium assessment from earlier Star Reading scores

Grade
Measure 3 4 5 6 7 8
Overall classification accuracy 86% 87% 86% 86% 86% 83%
Sensitivity 64% 73% 73% 69% 73% 70%
Specificity 93% 93% 90% 91% 91% 89%
Observed proficiency rate (OPR) 26% 29% 27% 24% 28% 29%
Projected proficiency rate (PPR) 22% 26% 26% 23% 27% 28%
Proficiency status projection error –5% –3% 0% –1% –1% –1%
Area under the ROC curve 0.91 0.93 0.91 0.92 0.92 0.90

As the table shows, classification accuracy ranged from 83 to 87%, depending on


grade. Area Under the Curve (AUC) was at least 0.90 for all grades. Specificity
was especially high, and the projected proficiency rates were very close to the
observed proficiency rates at all grades.

Numerous other reports of linkages between Star Reading and state accountability
tests have been conducted. Reports are available at
[Link]

Accuracy for Identifying At-Risk Students


In many settings, Star Reading is used to identify students considered “at risk”
for reading difficulties requiring intervention, for example long in advance of state
accountability assessment that will be used to classify students at the end of the
school year. This section summarizes two studies done to evaluate the validity of
cut scores based on Star Reading as predictors of “at risk” status later in the school
year. In such cases, correlation coefficients are of less interest than classification

2. Renaissance Learning (2016). Relating Star Reading™ and Star Math™ to the Colorado
Measure of Academic Success (CMAS) (PARCC Assessments) performance.

Star Assessments™ for Reading


Technical Manual 81
Validity
Classification Accuracy of Star Reading

accuracy statistics, such as overall accuracy of classification, sensitivity and


specificity, false positives and false negatives, positive and negative predictive
power, receiver operating characteristic (ROC) curves, and a summary statistic
called AUC (Area Under the Curve).3 Summaries of the methodology and results of
the two studies are given below.

Brief Description of the Current Sample and Procedure

Initial Star Reading classification analyses were performed using state assessment
data from Arkansas, Delaware, Illinois, Michigan, Mississippi, and Kansas.
Collectively these states cover most regions of the country (Central, Southwest,
Northeast, Midwest, and Southeast). Both the Classification Accuracy and Cross
Validation study samples were drawn from an initial pool of 79,045 matched
student records covering grades 2–11.

A secondary analysis using data from a single state assessment was then
performed. The sample used for this analysis was 42,771 matched Star Reading
and South Dakota Test of Education Progress records of students in grades 3–8.

An ROC analysis was used to compare the performance data on Star Reading
to performance data on state achievement tests, with “at risk” identification as
the criterion. The Star Reading Scaled Scores used for analysis originated from
assessments 3–11 months before the state achievement tests were administered.
Selection of cut scores was based on the graph of sensitivity and specificity versus
the Scaled Score. For each grade, the Scaled Score chosen as the cut point was
equal to the score where sensitivity and specificity intersected. The classification
analyses, cut points and outcome measures are outlined in Table 45. Area Under
the Curve (AUC) values were all greater than 0.80. Descriptive notes for other
values represented in the table are provided in the table footnote.

3. For descriptions of ROC curves, AUC, and related classification accuracy statistics,
refer to Pepe, Janes, Longton, Leisenring, & Newcomb (2004) and Zhou, Obuchowski &
Obushcowski (2002).

Star Assessments™ for Reading


Technical Manual 82
Validity
Classification Accuracy of Star Reading

Table 45: Classification Accuracy in Predicting Proficiency on State


Achievement Tests in Seven Statesa

Initial Analysis Secondary Analysis


b
Statistic Value Value
False Positive Rate 21% 18%
False Negative Rate 23% 22%
Sensitivity 76% 78%
Specificity 76% 82%
Overall Classification Rate 76% 81%

Grade AUC Grade AUC


AUC (ROC) 2 0.816
3 0.839 3 0.869
4 0.850 4 0.882
5 0.841 5 0.881
6 0.833 6 0.883
7 0.829 7 0.896
8 0.843 8 0.879
9 0.847
10 0.858
11 0.840
10 777
11 1,055
a. Arkansas, Delaware, Illinois, Kansas, Michigan, Mississippi, and South Dakota.
b. The false positive rate is equal to the proportion of students incorrectly labeled “at-risk.” The
false negative rate is equal to the proportion of students incorrectly labeled not “at-risk.”
Likewise, sensitivity refers to the proportion of correct positive predictions while specificity refers
to the proportion of negatives that are correctly identified (e.g., student will not meet a particular
cut score).

Disaggregated Validity and Classification Data


In some cases, there is a need to verify that tests, such as Star Reading, as
valid for different demographic groups. For that purpose, the data must be
disaggregated, and separate analyses performed for each group. Table 46 shows
the disaggregated classification accuracy data for ethnic subgroups and also the
disaggregated validity data.

Star Assessments™ for Reading


Technical Manual 83
Validity
Classification Accuracy of Star Reading

Table 46: Disaggregated Classification and Validity Data

Classification Accuracy in Predicting Proficiency on State Achievement Tests in 6 States (Arkansas,


Delaware, Illinois, Kansas, Michigan, and Mississippi): by Race/Ethnicity
American
White, non- Black, non- Asian/Pacific Indian/
Hispanic Hispanic Hispanic Islander Alaska Native
(n = 17,567) (n = 8,962) (n = 1,382) (n = 231) (n = 111)
False Positive 31% 44% 36% 17% 12%
False Negative Rate 38% 12% 12% 24% 41%
Sensitivity 62% 88% 88% 76% 59%
Specificity 87% 56% 64% 83% 88%
Overall Classification Rate 81% 67% 73% 82% 78%
AUC (ROC)
Grade AUC Grade AUC Grade AUC Grade AUC Grade AUC
2 n/a 2 0.500 2 n/a 2 n/a 2 n/a
3 0.863 3 0.828 3 0.868 3 0.913 3 0.697
4 0.862 4 0.823 4 0.837 4 0.869 4 0.888
5 0.853 5 0.832 5 0.839 5 0.855 5 0.919
6 0.849 6 0.806 6 0.825 6 0.859 6 0.846
7 0.816 7 0.784 7 0.866 7 0.904 7 0.900
8 0.850 8 0.827 8 0.812 8 0.961 8 1.000
9 1.000 9 0.848 9 n/a 9 n/a 9 n/a
10 0.875 10 0.831 10 0.833 10 n/a 10 n/a
11 0.750 11 1.000 11 n/a 11 n/a 11 n/a

Evidence of Technical Accuracy for Informing Screening and Progress


Monitoring Decisions
Many school districts use tiered models such as Response to Intervention (RTI)
or Multi-Tiered Systems of Support (MTSS) to guide instructional decision making
and improve outcomes for students. These models represent a more proactive,
data-driven approach for better serving students as compared with prior decision-
making practices, including processes to:
X Screen all students to understand where each is in the progression of learning
in reading, math, or other disciplines
X Identify at-risk students for intervention at the earliest possible moment
X Intervene early for students who are struggling or otherwise at-risk of falling
behind; and
X Monitor student progress in order to make decisions as to whether they are
responding adequately to the instruction/intervention

Star Assessments™ for Reading


Technical Manual 84
Validity
Classification Accuracy of Star Reading

Assessment data are central to both screening and progress monitoring, and
Star Reading is widely used for both purposes. This chapter includes technical
information about Star Reading’s ability to accurately screen students according
to risk and to help educators make progress monitoring decisions. Much of this
information has been submitted to and reviewed by the Center on Response
to Intervention [Link] and/or the National Center on Intensive
Intervention [Link] two technical assistance groups
funded by the US Department of Education.

For several years running, Star Reading has enjoyed favorable technical reviews
for its use in informing screening and progress monitoring decision by the CRTI
and NCII, respectively. The most recent reviews by CRTI indicate that Star
Reading has a “convincing” level of evidence (the highest rating awarded) in the
core screening categories, including classification accuracy, reliability, and validity.
CRTI also notes that the extent of the technical evidence is “Broad” (again, the
highest rating awarded) and notes that not only is the overall evidence compelling,
but there are disaggregated data as well that shows Star Reading works equally
well among subgroups. The most recent reviews by NCII indicate that there is
full “convincing” evidence of Star Reading’s psychometric quality for progress
monitoring purposes, including reliability, validity, reliability of the slope, and
validity of the slope. Furthermore, they find fully “convincing” evidence that Star
Reading is sufficiently sensitive to student growth, has adequate alternate forms,
and provides data-based guidance to educators on end-of-year benchmarks and
when an intervention should be changed, among other categories. Readers may
find additional information on Star Reading on those sites and should note that the
reviews are updated on a regular basis, as their review standards are adjusted and
new technical evidence for Star Reading and other assessments are evaluated.

Screening
According to the Center on Response to Intervention, “Screening is conducted
to identify or predict students who may be at risk for poor learning outcomes.
Universal screening assessments are typically brief, conducted with all students at
a grade level, and followed by additional testing or short-term progress monitoring
to corroborate students’ risk status.”4

Most commonly, screening is conducted with all students at the beginning of the
year and then another two to four times throughout the school year. Star Reading
is widely used for this purpose. In this section, the technical evidence supporting
its use to inform screening decisions is summarized.

4. [Link]

Star Assessments™ for Reading


Technical Manual 85
Validity
Classification Accuracy of Star Reading

Organizations of RTI/MTSS experts such as the Center on Response to


Intervention and the RTI Action Network5 are generally consistent in how
measurement tools should be evaluated for their appropriateness as screeners.
Key categories include the following:

1. Validity and reliability. Data on Star Reading’s reliability were presented in the
“Reliability and Measurement Precision” chapter of this manual. A wide array
of validity evidence has been presented in this chapter, above; detailed tables
of correlational data can be found in “Appendix B: Detailed Evidence of Star
Reading Validity”.

2. Practicality and efficiency. Screening measures should not require much


teacher or student time. Because most students can complete a Star Reading
test in 15–20 minutes or less, and because it is group administered and scored
automatically, Star Reading is an exceptionally efficient general outcomes
measure for reading.

3. Classification accuracy metrics including sensitivity, specificity, and overall


predictive accuracy. These are arguably the most important indicators,
addressing the main purpose of screening: When a brief screening tool
indicates a student either is or is not at risk of later reading difficulties, how
often is it accurate, and what types of errors are made?

It is common to use high stakes indicators such as state summative assessments


as criterion measures for classification accuracy evaluation. Star Reading is linked
to virtually every state summative assessment in the US as well as the ACT and
SAT college entrance exams. The statistical linking of the Star Reading scale
with these other measures’ scales, combined with Star Reading growth norms
(discussed in the Norming chapter of this manual) empowers Star Reading reports
and data extracts to make predictions throughout the school year about future
student performance. These predictions inform educator screening decisions in
schools using an RTI/MTSS framework. (Educators are also free to use norm-
referenced scores such as Percentile Ranks to inform screening decisions.)

Star Reading’s classification accuracy results from several recent predictive


studies are summarized in Table 47. Each study evaluated the extent to which
Star Reading accurately predicted whether a student achieved a specific
performance level on another reading or English Language Arts measure. The
specific performance level (cut point) varies by assessment and grade. Cut points
are set by assessment developers and sponsors, which in the case of state
summative exams usually means the state department of education and/or state
board of education. State assessments generally have between three and five

5. [Link]

Star Assessments™ for Reading


Technical Manual 86
Validity
Classification Accuracy of Star Reading

performance levels, and the cut point used in these analyses refers to the level the
state has determined indicates meeting grade level reading or English Language
Arts standards. For instance, the cut point on California’s CAASPP is Level 3,
also known as “Standard Met.” On Louisiana’s LEAP 2025 the cut point is at the
“Mastery” level. In the case of ACT and SAT, the cut point established by the
developers (ACT and College Board, respectively) indicates an estimated level of
readiness for success in college.

Table 47: Summary of classification accuracy metrics from recent studies linking Star Reading with
summative reading and English Language Arts measures

Average result across all grades


Area
Study Overall under
Grade/s Date study sample classification ROC
Assessment Covered completed size accuracy Sensitivity Specificity curve
ACT English (college 11 4/22/2016 14,248 80% 76% 82% 0.87
readiness)
ACT Reading 11 4/22/2016 14,228 83% 62% 90% 0.86
(college readiness)
ACT Aspire 3–10 6/1/2017 44,877 84% 81% 84% 0.92
California 3–8 10/30/2015 51,835 84% 86% 82% 0.92
Assessment of
Student Performance
and Progress
(CAASPP) (Smarter
Balanced)
Florida Standards 3–8 6/30/2015 41,178 84% 84% 83% 0.92
Assessments (FSA)
Georgia Milestones 3–8 7/1/2017 44,436 87% 79% 90% 0.94
Illinois Partnership 3–10 7/13/2016 27,415 86% 70% 91% 0.91
for Assessment
of Readiness
for College and
Careers (PARCC)
Assessments
Louisiana Educational 3–8 12/1/2017 33,815 84% 90% 69% 0.90
Assessment Program
(LEAP 2025)
Maine Educational 3–8 7/1/2017 945 83% 78% 86% 0.93
Assessment (MEA)
Mississippi Academic 3–8 2/1/2017 13,590 84% 80% 87% 0.92
Assessment Program
(MAAP)

Missouri Assessment 3–8 3/14/2017 30,626 85% 83% 87% 0.96


Program (MAP)
Grade-Level Tests

Star Assessments™ for Reading


Technical Manual 87
Validity
Classification Accuracy of Star Reading

Table 47: Summary of classification accuracy metrics from recent studies linking Star Reading with
summative reading and English Language Arts measures

Average result across all grades


Area
Study Overall under
Grade/s Date study sample classification ROC
Assessment Covered completed size accuracy Sensitivity Specificity curve
North Carolina 3–8 2/16/2015 396,075 81% 83% 78% 0.89
READY End-of-
Grade (EOG)
Ohio State Tests 3–8 12/20/2016 27,487 85% 83% 87% 0.93
Pennsylvania’s 3–8 12/19/2016 7,383 85% 91% 72% 0.92
System of School
Assessment (PSSA)
SAT (college 11
entrance)
South Carolina 3–8 12/5/2016 10,011 86% 85% 86% 0.94
College-and Career-
Ready Assessments
(SC READY)
State of Texas 3–8 7/1/2017 3,915 83% 71% 88% 0.90
Assessments of
Academic Readiness
(STAAR)
Wisconsin Forward 3–8 12/22/2016 39,605 88% 73% 93% 0.94
Exam
Notes:
• Some tests, such as the Smarter Balanced (indicated above for California) and PARCC (indicated above for Illinois) are
used in multiple states, so those results may apply to other states not listed here.
• Overall classification accuracy refers to the percentage of correct classifications.
• Sensitivity refers to the rate at which Star Reading identifies students as being at-risk who demonstrate a poor learning
outcome at a later point in time. Sensitivity can be thought of as the true positive rate. Screening tools with high sensitivity
help ensure that students who truly need intervention will be identified to receive it.
• Specificity refers to the rate at which Star Reading identifies students as being not at-risk who perform satisfactorily at a
later point in time. Specificity can be thought of as a true negative rate. Screening tools with high specificity help ensure
that scarce resources are not invested in students who do not require extra assistance.
• Area under the ROC (Receiver Operating Characteristic) curve is a powerful indicator of overall accuracy. The ROC curve
a plot of the true positive rate (sensitivity) against the false positive rate (1-specificity) for the full range of possible screener
(Star Reading) cut points. The area under ROC Curve (AUC) is an overall indication of the diagnostic accuracy of the
curve. AUC values range between 0 and 1 with 0.5 indicating a chance level of accuracy. The Center for Response to
Intervention considers results at or above 0.85 to be an indication of convincing evidence of classification accuracy.6

Note that many states tend to not use the same assessment system for more than
a few consecutive years, and Renaissance endeavors to keep the Star Reading
classification reporting as up to date as possible. Those interested in reviewing the
full technical reports for these or other state assessments are encouraged to visit

6. [Link]
rating-system

Star Assessments™ for Reading


Technical Manual 88
Validity
Classification Accuracy of Star Reading

[Link] and search by state name


for the Star Reading linking reports (e.g., “Wisconsin linking”).

Progress Monitoring
According to the National Center on Intensive Intervention, “progress monitoring is
used to assess a student’s performance, to quantify his or her rate of improvement
or responsiveness to intervention, to adjust the student’s instructional program
to make it more effective and suited to the student’s needs, and to evaluate the
effectiveness of the intervention.”7

In an RTI/MTSS context, progress monitoring involves frequent assessment—


usually occurring once every 1–4 weeks—and often involves only those students
who are receiving additional instruction after being identified as at-risk via
the screening process. Ultimately, educators use progress monitoring data to
determine whether a student is responding adequately to the instruction, or
whether adjustments need to be made to the instructional intensity or methods.
The idea is to get to a decision quickly, with as little testing as possible, so that
valuable time is not wasted on ineffective approaches. Educators make these
decisions by comparing their performance against a goal set by the educator.
Goals should be “reasonable yet ambitious”8 as recommended by Shapiro (2008),
and Star Reading offers educators a variety of guidance to set normative or
criterion-referenced goals that meet these criteria.

The RTI Action Network, National Center on Intensive Intervention, and other
organizations offering technical assistance to schools implementing RTI/MTSS
models are generally consistent in encouraging educators to select assessments
for progress monitoring that have certain characteristics.

7. [Link]
8. Shapiro, E. S. (2008). Best practices in setting progress-monitoring goals for academic skill
improvement. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology V (pp.
141-157). Bethesda, MD: National Association of School Psychologists.

Star Assessments™ for Reading


Technical Manual 89
Validity
Classification Accuracy of Star Reading

A summary of those characteristics and relevant information about Star Reading is


provided below.

1. Evidence of psychometric quality.

a. Reliability and validity. Summaries of the available evidence


supporting Star Reading’s reliability and validity are presented in the
chapter on “Reliability and Measurement Precision” and throughout
this Validity chapter.
b. Reliability of the slope. Because progress monitoring decisions
often involve the student’s rate of progress over multiple test
administrations, the characteristics of the student’s slope of
improvement, or trend line, are also important. A study was conducted
in 2017 by Renaissance Learning to evaluate reliability of slope for at-
risk students who were being progress monitored during the 2016–17
school year. Specifically, the sample included 218,689 students who
began the year below the 30th Percentile Rank in Star Reading
and were assessed 10 or more times during the school year, with a
minimum of 140 days between first and last test.
Every student’s Star Reading test records were sorted in chronological
order. Each test record was coded as either an odd- or even-
numbered test. Slopes were estimated for each student’s odd-number
tests and also for the even-numbered tests using ordinary least
squares regression. Then, the odd and even slopes were correlated.
The table below summarizes the Pearson correlation coefficients by
grade, indicating a consistently strong association between even and
odd numbered test slopes.

Table 48: Star Reading Reliability of the Slope Coefficients by


grade, 1–12

Grade n Coefficient
1 14,179 0.76
2 43,978 0.93
3 52,670 0.94
4 37,862 0.93
5 31,326 0.93
6 16,990 0.94
7 9,683 0.94
8 7,786 0.94
9 2,483 0.94
10 1,549 0.94
11 799 0.94
12 384 0.95

Star Assessments™ for Reading


Technical Manual 90
Validity
Classification Accuracy of Star Reading

2. Produce a sufficient number of forms. Because Star Reading is computer


adaptive and its item bank comprises more than six thousand items, there
are at a minimum, several hundred alternate forms for a student at a given
ability level. This should be more than sufficient for even the most aggressive
progress monitoring testing schedule.

A variety of grade-specific evidence is available to demonstrate the extent to


which Star Reading can reliably produce consistent scores across repeated
administrations of the same or similar tests to the same individual or group.
These include:

a. Generic reliability, defined as the proportion of test score variance that is


attributable to true variation in the trait or construct the test measures.

b. Alternate forms reliability, defined as the correlation between test


scores on repeated administrations to the same examinees.

Grade-level results are summarized in the “Reliability and Measurement


Precision” chapter.

3. Practicality and efficiency. As mentioned above, most students complete Star


Reading in 15–20 minutes. It is auto-scored and can be group administered,
requiring very little educator involvement, making it an efficient progress
monitoring solution.

4. Specify criterion for adequate growth and benchmarks for end-of-year


performance levels. Goal-setting decisions are handled by local educators,
who know their students best and are familiar with the efficacy and intensity
of the instructional supports that will be offered. That said, publishers of
assessments used for progress monitoring are expected to provide empirically
based guidance to educators on setting goals.

Star Reading provides guidance to inform goal setting using a number of


different metrics, including the following:

a. Student Growth Percentile. SGP describes a student’s velocity


(slope) relative to a national sample of academic peers—those
students in the same grade with a similar score history. SGPs work
like Percentile Ranks (1–99 scale) but once an SGP goal has been
set, it is converted to a Scaled Score goal at the end date specified
by the teacher. An SGP-defined goal can be converted into an
average weekly increase in a Scaled Score metric, if educators prefer
to use that. Many teachers select either SGP 50 (indicating typical
or expected growth) as minimum acceptable growth, or something
indicating accelerated growth, such as 65 or 75. A helpful feature of
SGP is that it can be used as a “reality check” for any goal, whether it
be in an SGP metric or something else (e.g., Scaled Score, Percentile
Rank). SGP estimates the likelihood that the student will achieve a
level of growth or later performance. For example, a goal associated

Star Assessments™ for Reading


Technical Manual 91
Validity
Differential Item Functioning

with an SGP of 75 indicates that only about 25 percent of the student’s


academic peers would be expected to achieve that level of growth.

b. State test proficiency. As described in the Screening section, the fact


that Star Reading is linked to virtually every state assessment enables
educators to select values on the Star scale that are approximately
equivalent to states’ defined proficiency level cut points for each
grade.

c. Percentile Rank and Scaled Score. Educators may also enter


custom goals using Percentile Rank or Scaled Score metrics.

Additional Research on Star Reading as a Progress Monitoring Tool

A study by Bulut & Cormier (2018) evaluated Star Reading as a progress


monitoring tool, concluding:
X Although relatively little research exists on using computer adaptive measures
for progress monitoring as opposed to curriculum based measurement
probes, the study concluded it was possible to use Star Reading for progress
monitoring purposes.
X Sufficiently reliable progress monitoring slopes could be generated in as few as
five Star Reading administrations.
X The duration of Star Reading progress monitoring (i.e., over how many weeks)
should be conducted is a function of the amount of typical growth by grade in
relation to measurement error. For earlier grades (when student rates of growth
are greatest), that amount of time could be as little as six weeks. For middle
grades, 20 weeks should be sufficient.
X These two findings challenge popular rules of thumb about progress monitoring
frequency and duration (most of which are derived from CBM probe studies),
which often involve weekly testing over periods of time that are selected due to
popular convention rather than empirical evidence.
X Using Theil-Sen regression procedures to estimate slope as opposed to OLS
could reduce the influence of outlier scores, and thus provide a more accurate
picture of student growth.

Differential Item Functioning


Ensuring that an assessment is not biased against different demographic
subgroups that take the assessment is a fundamental aspect of showing test
fairness and providing validity evidence to support the interpretations and uses of
the assessment. One strategy that is often used as part of evaluating test fairness
is a strategy known as differential item functioning (DIF). DIF occurs when two or

Star Assessments™ for Reading


Technical Manual 92
Validity
Differential Item Functioning

more demographic subgroups perform differently on an item after controlling for


performance on the test (Holland & Thayer, 1988; Zumbo, 2007). In other words,
for students of similar ability, an item that displays DIF may appear to favor one
group of students based on demographics such as gender and/or race/ethnicity.

There are many different methods that one can use to investigate items for DIF,
including item response theory methods, observed score methods, and a variety
of nonparametric approaches (Zumbo, 2007). The method that Star Reading
uses to evaluate items for DIF is a method known as logistic regression (Rogers
& Swaminathan, 1993; Swaminathan & Rogers, 1990; Swaminathan, 1994). With
this approach, student item responses are regressed on student ability estimates
from Star Reading as well as their subgroup membership and the student ability
and subgroup membership interaction. To conduct a DIF analysis, a reference
group and a focal group is defined. For instance, male is the reference group for
gender while female is the focal group. Similarly, Caucasian is the reference group
for race/ethnicity with the minority race/ethnic groups being focal groups. Separate
models are run for DIF for male versus female, black versus white, Hispanic
versus white, Asian versus white, and Native American versus white.

Items are flagged for DIF using a blended approach that employs a Chi-square
test of statistical significance to determine if DIF is present and then assessing
whether any evidence of DIF is practically significant using the Nagelkerke R2
statistic (1991), a common effect size measure used in DIF investigations with
logistic regression (Jodion & Gierl, 2001). Using the Nagelkerke R2 statistic, items
are categorized as exhibiting negligible DIF if the null hypothesis is not rejected or
the R2 statistic is less than 0.035, moderate DIF if the null hypothesis is rejected
and the R2 statistic is greater than or equal to 0.035 and less than 0.070, or large
DIF if the null hypothesis is rejected and the R2 statistic is greater than or equal to
0.070 (Jodion & Gierl, 2001).

There are a couple of points in the Star Reading assessment development cycle
when items are evaluated for DIF. The first time point is when an item is included
as a field test item as part of Star Reading’s item calibration process. During
item calibration, new assessment items are tried out with different groups of
students to make sure that items have appropriate statistical and psychometric
properties before they are used operationally and count towards a student’s
score. The second time point is when the full item bank of operational test
items is recalibrated for scale maintenance to check whether the statistical and
psychometric properties of the items have remained similar after the items become
operational.

It is important to point out that just because an item is flagged for DIF against one
or more subgroups does not necessarily mean that the item is biased. There are
many possible explanations why an item may be statistically flagged for DIF. All

Star Assessments™ for Reading


Technical Manual 93
Validity
Differential Item Functioning

items that are statistically flagged as having non-negligible DIF are marked for a
bias and sensitivity review by the Content team. This review process consists of
several subject matter experts with diverse perspectives and different backgrounds
looking at and reviewing each item to see if there is any content in the item that
may be biased against a particular subgroup and might explain why the item was
statistically flagged for DIF. Items identified as being biased for any reason are
removed from the item bank and do not appear on the Star Reading test. The
statistical flagging of items for DIF as well as the bias and sensitivity review by the
Content team helps ensure test fairness and that the items that appear on Star
Reading do not favor any group of students that may take the test.

As shown in Figure 6 only 2% of about 6,000 items in the Star Reading item bank
showed any evidence of DIF when Star Reading was recalibrated in 2021.
Figure 6: Summary of Star Reading Items with DIF

Table 49 shows the DIF results by reference and focal groups from various the
DIF analyses. These results suggest that of the thousands of items analyzed very
few items were flagged for DIF. There were 1.00% of items categorized with non-
negligible DIF for male versus female, 0.04% of items flagged with non-negligible
DIF for Asian versus white, 0.34% of items flagged with non-negligible DIF for
black versus white, 0.94% of items flagged with non-negligible DIF for Hispanic
versus white, and 0.00% of items flagged with non-negligible DIF for Native
American versus white. These results provide evidence of the fairness of the Star
Reading test for different demographic subgroups that take the assessment. As
previously noted, any items that show DIF are removed from operational use.

Star Assessments™ for Reading


Technical Manual 94
Validity
Summary of Star Reading Validity Evidence

Table 49: Percentage of Items for Different DIF Analyses


DIF Analysis Percent of Items Showing DIF
Female versus Male 1.00%
Asian versus White 0.04%
Black versus White 0.34%
Hispanic versus White 0.94%
Native American versus White 0.00%

Summary of Star Reading Validity Evidence


The validity data presented in this technical manual includes evidence of Star
Reading’s concurrent, predictive, and construct validity, as well as classification
accuracy statistics, and strong measures of association with non-traditional
reading measures such as oral reading fluency. Exploratory and confirmatory
factor analyses provided evidence that Star Reading measures a unidimensional
construct, consistent with the assumption underlying its use of the Rasch model.
The Meta-Analysis section showed the average uncorrected correlation between
Star Reading and all other reading tests to be 0.79. (Many meta-analyses adjust
the correlations for range restriction and attenuation to less than perfect reliability;
had we done that here, the average correlation would have exceeded 0.85.)
Correlations with specific measures of reading ability were often higher than this
average. For example, Yoes (1999) found within-grade correlations with DRP
averaging 0.81. When these data were combined across grades, the correlation
was 0.89. The latter correlation may be interpreted as an estimate of the overall
construct validity of Star Reading as a measure of reading comprehension. Yoes
also reported that results of item factor analysis of DRP and Star Reading items
yielded a single common unidimensional factor. This provides strong support for
the claim that Star Reading is a measure of reading comprehension.

International data from the UK show even stronger correlations between Star
Reading and widely used reading measures there: overall correlations of 0.91
with the Suffolk Reading Scale, median within-form correlations of 0.84, and a
correlation of 0.85 with teacher assessments of student reading.

Finally, the data showing the relationship between the current, standards- based
Star Reading Enterprise test and scores on specific state accountability tests
and on the SBAC and PARCC Common Core consortium tests show that the
correlations with these summative measures are consistent with the meta-analysis
findings.

Star Assessments™ for Reading


Technical Manual 95
Norming

Two distinct kinds of norms are described in this chapter: test score norms and
growth norms. The former refers to distributions of test scores themselves. The
latter refers to distributions of changes in test scores over time; such changes are
generally attributed to growth in the attribute that is measured by a test. Hence
distributions of score changes over time may be called “growth norms.”

Background
National norms for Star Reading version 1 were first collected in 1996. Substantial
changes introduced in Star Reading version 2 necessitated the development of
new norms in 1999. Those norms were used until new norms were developed
in 2008. Since 2008, Star Reading norms have been updated four times (2014,
2017, 2022, and 2024). The 2024 norms went live in Star Reading in the 2024–
2025 school year. This chapter describes the development of the 2024 norms.

From 1996 through mid-2011, Star Reading was primarily a measure of reading
comprehension comprising short vocabulary-in-context items and longer passage
comprehension items. The current version of Star Reading, introduced in June
2011, is a standards-based assessment that measures a wide variety of skills and
instructional standards, as well as reading comprehension. The 2024 norms are
based on the current standards-based version of Star Reading.

The 2024 Star Reading Norms


Prior to development of the 2024 Star Reading norms, a new reporting scale was
developed, called the Unified scale. The Unified scale is a linear transformation of
Star Reading’s Rasch ability scale to a new integer scale that is also used in Star
Early Literacy. The Star Unified scale makes it possible to report performance on
both of those Star assessments on the same scale.

New U.S. norms for Star Reading assessments were introduced at the start of the
2017–18 school year. Separate early fall and late spring norms were developed for
grades Kindergarten through 12. Before the introduction of the 2017 Star Reading
norms, the reference populations for grades Kindergarten through 3 consisted
only of students taking Star Reading; students who only took Star Early Literacy
were excluded from the Star Reading norms, and vice versa. Consequently,
previous Star Reading and Star Early Literacy norms for this grade range were
not completely representative of the full range of literacy development in those
grades. To address this, the concept of “Star Early Learning” was introduced. That

Star Assessments™ for Reading


Technical Manual 96
Norming
Sample Characteristics

concept acknowledges the overlap of literacy development content between the


Star Reading and Early Literacy assessments, and encompasses in the normative
reference group all students in each of grades K–3 who have taken either the
Reading assessment, the Early Literacy assessment, or both.

The norms introduced in 2024 are based on test scores of K–12 students taking
either the Reading assessment, or the Early Literacy one, or both. These norms
are based on the use of the Unified scale, which allowed performance on both Star
Early Literacy and Star Reading to be measured on the same scale. Norms for
students in pre-K are based on students taking Star Early Literacy in this grade.
Pre-K norms are not available for Star Reading because students do not typically
take Star Reading in this grade.

The 2024 Star Reading norms are based on Star Reading and Star Early Literacy
test data collected over the course of the 2022–2023 school year. Students
participating in the norming study took assessments between August 1, 2022,
and June 30, 2023. Students took the Star Reading or Early Literacy tests under
normal test administration conditions. No specific norming test was developed
and no deviations were made from the usual test administration. Thus, students
in the norming sample took Star Reading or Star Early Literacy tests as they are
administered in everyday use.

Sample Characteristics
During the norming period, a total of 1,837,495 US students in grades Pre-K–3
took Star Early Literacy while 8,878,035 US students in grades K–12 took Star
Reading tests, using Renaissance servers hosted by Renaissance Learning. The
first step in sampling was to select representative fall and spring student samples:
students who had tested in the fall, in the spring, or in both the fall and spring of
the 2022–2023 school year.

Because these norming data were convenience samples drawn from the Star
Reading and Star Early Literacy customer base, steps were taken to ensure
the resulting norms were nationally representative of grades K–12 US student
populations with regard to certain important characteristics. A post-stratification
procedure was used to adjust the sample’s proportions to the approximate national
proportions on three key variables: geographic region, district socio-economic
status, and district/school size. These three variables were chosen because they
had previously been used in Star norming studies to draw nationally representative
samples, are known to be related to test scores, and were readily available for the
schools in the Renaissance hosted database.

The final norming sample size for grades K–12, after selecting only students with
test scores in either the fall or the spring or both fall and spring in the norming year

Star Assessments™ for Reading


Technical Manual 97
Norming
Sample Characteristics

was 6,405,124 students. There were 4,687,579 students in the fall norming sample
and 4,282,841 students in the spring norming sample; 2,565,296 students were
included in both norming samples. These students came from 22,978 schools
across the 50 states and the District of Columbia.

Table 50 and Table 51 provide a breakdown of the number of students participating


per grade in the fall and in the spring, respectively.

Table 50: Numbers of Students per Grade in the Fall Norms Sample

Grade N Grade N Grade N Grade N


K 255,676 4 523,369 8 437,350 12 104,390
1 383,509 5 535,258 9 239,576 Total 4,687,579
2 464,467 6 469,788 10 199,291
3 506,667 7 424,219 11 144,019

Table 51: Numbers of Students per Grade in the Spring Norms Sample

Grade N Grade N Grade N Grade N


K 364,368 4 445,880 8 341,036 12 51,831
1 464,805 5 433,896 9 199,448 Total 4,282,841
2 520,221 6 384,418 10 163,670
3 462,500 7 344,065 11 106,703

National estimates of US student population characteristics were obtained from


two entities: the National Center for Educational Statistics (NCES) and Agile
Education Marketing Data.
X National population estimates of students’ demographics (ethnicity and gender)
in grades K–12 were obtained from NCES; these estimates were from the
2019–2020 school year for private schools and the 2022–2023 school year
for public schools, the most recent data available. National estimates of race/
ethnicity were computed using the NCES data based on single race/ethnicity
and also a multiple race category. The NCES data reflect the most recent
census data from the US Census Bureau.
X National estimates of school-related characteristics were obtained from NCES
and Agile Education Marketing data. The Agile database contains the most
recent data on schools, some of which may not be reflected in the NCES data.

Table 52 shows national percentages of children in grades K–12 by region, school/


district enrollment, district socio-economic status, location, and school type (public
versus non-public), along with the corresponding percentages in the fall and in the
spring norming samples. Estimates of geographic region were based on the four
broad areas identified by the National Educational Association as Northeastern,

Star Assessments™ for Reading


Technical Manual 98
Norming
Sample Characteristics

Midwestern, Southeastern, and Western regions. The specific states in each


region are shown below.

Geographic region
Using the categories established by the National Center for Education Statistics
(NCES), students were grouped into four geographic regions as defined below:
Northeast, Southeast, Midwest, and West.

Northeast:

Connecticut, District of Columbia, Delaware, Massachusetts, Maryland, Maine,


New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, Vermont

Southeast:

Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North


Carolina, South Carolina, Tennessee, Virginia, West Virginia

Midwest:

Iowa, Illinois, Indiana, Kansas, Minnesota, Missouri, North Dakota, Nebraska,


Ohio, South Dakota, Michigan, Wisconsin

West:

Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, New Mexico,


Nevada, Oklahoma, Oregon, Texas, Utah, Washington, Wyoming

School size
Based on total school enrollment, schools were classified into one of three school
size groups: small schools had under 200 students enrolled, medium schools had
between 200–499 students enrolled, and large schools had 500 or more students
enrolled.

Socioeconomic status as indexed by the percent of school students with free


and reduced lunch
Schools were classified into one of four classifications based on the percentage of
students in the school who had free or reduced student lunch. The classifications
were coded as follows:
X High socioeconomic status (0%–24%)
X Above-median socioeconomic status (25%–49%)

Star Assessments™ for Reading


Technical Manual 99
Norming
Sample Characteristics

X Below-median socioeconomic status (50%–74%)


X Low socioeconomic status (75%–100%)

No students were sampled from the schools that did not report the percent of
school students with free and reduced lunch.

The norming sample also included private schools, Catholic schools, students with
disabilities, and English Language Learners.

Table 52: Sample Characteristics Along with National Population Estimates and Sample Estimates

National Fall Norming Spring Norming


Estimates Sample Sample
Region Midwest 20.7% 18.1% 17.7%
Northeast 17.8% 13.8% 15.0%
Southeast 25.8% 24.2% 23.8%
West 35.7% 43.9% 43.5%
School Enrollment < 200 5.6% 5.0% 5.1%
200–499 29.7% 35.9% 36.8%
≥ 500 64.8% 59.1% 58.1%
District Socioeconomic Status Low 24.1% 30.1% 28.9%
Below Median 25.4% 26.8% 26.3%
Above Median 26.3% 25.5% 25.2%
High 24.2% 17.6% 19.6%
Location Rural 19.7% 24.1% 22.3%
Suburban 38.9% 34.9% 38.8%
Town 10.5% 13.4% 12.7%
Urban 30.8% 27.6% 26.2%
School Type Public 92.5% 95.1% 94.0%
Non-Public 7.5% 4.9% 6.0%

Table 53 provides information on the demographic characteristics of students in


the sample and national percentages provided by NCES. No weighting was done
on the basis of these demographic variables; they are provided to help describe
the sample of students and the schools they attended. Because Star assessment
users do not universally enter individual student demographic information such
as gender and ethnicity/race, some students were missing demographic data; the
sample summaries in Table 53 are based on only those students for whom gender
and ethnicity information were available. In addition to the student demographics
shown, an estimated 5.4% of the students in the norming sample were gifted

Star Assessments™ for Reading


Technical Manual 100
Norming
Test Administration

and talented (G&T) as approximated by the number of gifted students on public


schools in the Agile Education Marketing Data. School type was defined to be
either public (including charter schools) or non-public (private, Catholic).

Table 53: Student Demographics and School Information: National Estimates and Sample
Percentages

Spring
National Fall Norming Norming
Estimate Sample Sample
Gender Public Female 48.6% 49.5% 49.5%
Male 51.4% 50.5% 50.5%
Non-Public Female – 50.5% 50.8%
Male – 49.5% 49.2%
Race/Ethnicity Public American Indian 0.9% 2.0% 2.1%
Asian 5.8% 6.6% 6.3%
Black 15.1% 14.6% 14.1%
Hispanic 28.7% 34.1% 34.6%
White 44.6% 39.7% 39.9%
a
Multiple Race 5.0% 3.0% 3.0%
Non-Public American Indian 0.6% 0.7% 0.6%
Asian 7.3% 7.6% 7.8%
Black 9.5% 8.2% 7.6%
Hispanic 12.1% 11.5% 26.4%
White 65.1% 66.5% 51.4%
a
Multiple Race 5.3% 5.4% 6.2%
a. Students identified as belonging to two or more races.

Test Administration
All students took the current version Star Reading or Early Literacy tests under
normal administration procedures. Some students in the normative sample took
the assessment two or more times within the norming windows; scores from their
initial test administration in the fall and the last test administration in the spring
were used for computing the norms.

Data Analysis
Student test records were compiled from the complete database of Star Reading
and Early Literacy Renaissance Place users. Data were from a single school year
from August 2022 to June 2023. Students’ Unified scale Rasch scores on their

Star Assessments™ for Reading


Technical Manual 101
Norming
Data Analysis

first Star Reading or Early Literacy test taken during the first month of the school
year based on grade placement were used to compute norms for the fall; students’
Rasch scores on the last Star Reading or Early Literacy test taken during the 8th
and 9th months of the school year were used to compute norms for the spring.
Interpolation was used to estimate norms for times of the year between the first
month in the fall and the last month in the spring. The norms were based on the
distribution of Rasch scores for each grade.

As noted above, a post-stratification procedure was used to approximate the


national proportions on key characteristics. Post stratification weights from the
regional, district socio-economic status, and school size strata were computed and
applied to each student’s Unified Rasch ability estimate. Norms were developed
based on the weighted Rasch ability estimates and then transformed to Unified
scaled scores.1 Table 54 provides descriptive statistics for each grade with respect
to the normative sample performance, in the Unified scaled score units.

Table 54: Descriptive Statistics for Weighted Scaled Scores by Grade for the Norming Sample in the
Unified Scale

Fall Unified Scaled Scores Spring Unified Scaled Scores


Standard Standard
Grade N Mean Deviation Median N Mean Deviation Median
K 255,676 684 73 687 364,368 797 85 797
1 383,509 764 83 756 464,805 857 91 863
2 464,467 864 93 872 520,221 927 87 940
3 506,667 929 87 941 462,500 970 84 983
4 523,369 977 82 990 445,880 1007 80 1019
5 535,258 1010 77 1020 433,896 1033 76 1043
6 469,788 1038 74 1046 384,418 1055 76 1064
7 424,219 1057 75 1066 344,065 1071 78 1080
8 437,350 1075 75 1083 341,036 1087 78 1094
9 239,576 1082 78 1090 199,448 1090 81 1099
10 199,291 1093 80 1102 163,670 1100 82 1110
11 144,019 1100 81 1110 106,703 1105 83 1116
12 104,390 1102 86 1113 51,831 1101 88 1112

1. As part of the development of the Star Early Learning Unified scale, Star Early Literacy
Rasch scores were equated to the Star Reading Rasch scale. This resulted in a downward
extension of the latter scale that encompasses the full range of both Star Early Literacy and
Reading performance. This extended Rasch scale was employed to put all students’ scores
on the same scale for purposes of norms development.

Star Assessments™ for Reading


Technical Manual 102
Norming
Growth Norms

Growth Norms
Student achievement typically is thought of in terms of status: a student’s
performance at one point in time. However, this ignores important information
about a student’s learning trajectory—how much students are growing over a
period of time. When educators are able to consider growth information—the
amount or rate of change over time—alongside current status, a richer picture
of the student emerges, empowering educators to make better instructional
decisions.

To facilitate deeper understanding of achievement, Renaissance Learning


maintains growth norms for Star Assessments that provide insight both on growth
to date and likely growth in the future. Growth norms are currently available
for Star Math, Star Reading, and Star Early Literacy, and may be available for
additional Star adaptive assessments in the coming years.

The growth model used by Star Assessments is Student Growth Percentile


(Betebenner, 2009). SGPs were developed by Dr. Damian Betebenner, originally
in partnership with several state departments of education.2 It should be noted
that the initial development of SGP involved annual state summative tests with
reasonably constrained testing periods within each state. Because Star tests may
be taken at multiple times throughout the year, a number of adaptations to the
original model were made. For more information about Star Reading SGPs, please
refer to this overview: [Link]

SGPs are norm-referenced estimates that compare a student’s growth to that


of his or her academic peers nationwide. Academic peers are defined as those
students in the same grade with a similar score history. SGPs are generated via a
process that uses quantile regression to provide a measure of how much a student
changed from one Star testing window to the next relative to other students with
similar score histories. SGPs range from 1–99 and are interpreted similarly to
Percentile Ranks, with 50 indicating typical or expected growth. For instance, an
SGP score of 37 means that a student grew as much or more than 37 percent of
her academic peers.

The Star Reading SGP package also produces a range of future growth estimates.
Those are mostly hidden from users but are presented in goal setting and
related applications to help users understand what typical or expected growth
looks like for a given student. They are particularly useful for setting future goals
and understanding the likelihood of reaching future benchmarks, such as likely
achievement of proficient on an upcoming state summative assessment.

At present, the Star Reading SGP growth norms are based on a sample of 23,
376,700 matched student records from the 2016–2017, 2017–2018, and 2018–

2. Core SGP documentation and source code are publicly available at


[Link]

Star Assessments™ for Reading


Technical Manual 103
Norming
Growth Norms

2019 school years across grades 1–12. The sample included 9,778,703 unique
students across all three school years. Table 56 below provides a summary of the
number of students and tests that were used when computing the SGP growth
norms.

Table 55: Number of Students and Number of Tests Used in Computing SGP
Growth Norms
Grade Students Tests Grade Students Tests
1 761,260 1,691,537 8 875,682 2,035,964
2 1,142,750 2,903,543 9 466,481 994,943
3 1,234,147 3,132,872 10 380,663 792,217
4 1,247,517 3,137,316 11 261,521 512,893
5 1,260,547 3,127,739 12 180,836 322,067
6 1,060,365 2,574,598 Total 9,778,703 23,376,700
7 906,934 2,151,011

Star Assessments™ for Reading


Technical Manual 104
Score Definitions

This chapter enumerates all of the scores reported by Star Reading, including
scaled scores, norm-referenced, and criterion-referenced scores.

Types of Test Scores


In a broad sense, Star Reading software provides three broad types of test scores
that measure student performance in different ways:
X Criterion-referenced scores describe a student’s performance relative to a
specific content domain or to a standard. Such scores may be expressed either
on a continuous score scale or as a classification. An example of a criterion-
referenced score on a continuous scale is a percent-correct score, which
expresses what proportion of test questions the student can answer correctly
in the content domain. An example of a criterion-referenced classification is a
proficiency category on a standards-based assessment: the student may be
said to be “proficient” or not, depending on whether the student’s score equals,
exceeds, or falls below a specific criterion (the “standard”) used to define
“proficiency” on the standards-based test. The criterion-referenced score
reported by Star Reading is the Instructional Reading Level, which compares
a student’s test performance to the 1995 updated vocabulary lists that are
based on the EDL’s Core Vocabulary list. The Instructional Reading Level is
the highest grade level at which the student is estimated to comprehend 80
percent of the text written at that level.
X Norm-referenced scores compare a student’s test results to the results of other
students who have taken the same test. In this case, scores provide a relative
measure of student achievement compared to the performance of a group of
students at a given time. Percentile Ranks and Grade Equivalents are the two
primary norm-referenced scores available in Star Reading software. Both of
these scores are based on a comparison of a student’s test results to the data
collected during the 2017 national norming program.
X Scaled scores are the fundamental scores used to summarize students’
performance on Star Reading tests. Upon completion of the test, the testing
software calculates a single-valued Star Reading Unified scale score or
Star Reading Enterprise scale score. The Unified scale score is a linear
transformation of the Rasch estimate, while the Enterprise scale is a non-linear
transformation of the Rasch ability estimate as described below.

Star Assessments™ for Reading


Technical Manual 105
Score Definitions
Types of Test Scores

Enterprise Scale Scores


For Star Reading, the “Enterprise” scale scores are the same scores that have
been reported continuously since Star Reading Version 1 was introduced, in
1996. Because Version 1 was not based on item response theory, its scores
were expressed on an ad hoc vertical (developmental) scale related to the
student’s reading grade level; scale scores ranged from 50 to 1350. The use of
item response theory was introduced into Star Reading Version 2. Beginning with
that version, Star software calculated students’ scores on the Rasch IRT ability
scale. To maintain continuity with the non-IRT score scale used in Version 1, the
Rasch ability scores were converted to scores on the original scale by means
of an equipercentile equating transformation. At that time, the range of reported
Enterprise scale scores was extended to 0 to 1400.

Unified Scale Scores


Many users of Star Reading use Star Early Literacy to assess their students until
they are ready to take Star Reading itself. Until recently, Star Reading and Star
Early Literacy used different score scales, making it difficult to monitor growth
as students transitioned from one assessment to the other. To ameliorate that
disparity in the two tests’ score scales, Renaissance developed a single score
scale that applies to both assessments: the Unified score scale. That development
began with equating the two tests’ underlying Rasch ability scales; the result was
the “unified Rasch scale”, which is a downward extension of the Rasch scale used
in all Star Reading versions since the introduction of version 2. The next step was
to develop an integer scale based on the unified Rasch scale, with scale scores
anchored to important points on the original Enterprise score scales of both tests.
The end result was a reported score scale that extends from 200 to 1400: Star
Early Literacy Unified scale scores range from 200 to 1100; Star Reading Unified
scale scores range from 600 to 1400. An added benefit of the Unified scale is
an improvement in certain properties of the scale scores: Scores on both tests
are much less variable from grade to grade; measurement error is likewise less
variable; and Unified score reliability is slightly higher than that of the Enterprise
scores.

Grade Equivalent (GE)


A Grade Equivalent (GE) indicates the grade placement of students for whom a
particular score is typical. If a student receives a GE of 3.4, this means that the
student scored as well on Star Reading as did the typical student who is 40%
through grade 3 in the norming sample. It does not necessarily mean that the
student can read independently at a third-grade level, only that they obtained a

Star Assessments™ for Reading


Technical Manual 106
Score Definitions
Types of Test Scores

Scaled Score as high as the average third-grade student who is 40% through the
school year.

GE scores are often misinterpreted as though they convey information about what
a student knows or can do—that is, as if they were criterion-referenced scores.
To the contrary, GE scores are norm-referenced. Star Reading Grade Equivalents
range from 0.0–12.9+. The minimum GE score is 0, shown as either 0 or < K;
values below 0 are shown as < K.

The scale divides the academic year into 10 equal segments and is expressed as
a decimal with the integer denoting the grade level and the tenths value indicating
10% segments through the school year. For example, if a student obtained a GE
of 4.6 on a Star Reading assessment, this would suggest that the student was
performing similarly to the average student in the fourth grade that is 60% through
the school year. Because Star Reading norms are based on fall and spring score
data only, the 10% incremental GE scores are derived through interpolation by
fitting a curve to the grade-by-grade medians. Table 56 on page 118 contains the
Scaled Score to GE conversions.

Comparing the Star Reading Test with Conventional Tests

Because the Star Reading test adapts to the reading level of the student being tested,
Star Reading GE scores are more consistently accurate across the achievement
spectrum than those provided by conventional test instruments. Grade Equivalent
scores obtained using conventional (non-adaptive) test instruments are less accurate
when a student’s grade placement and GE score differ markedly. It is not uncommon
for a fourth-grade student to obtain a GE score of 8.9 when using a conventional test
instrument. However, this does not necessarily mean that the student is performing at
a level typical of an end-of-year eighth grader; more likely, it means that the student
answered all, or nearly all, of the items correctly and thus performed beyond the range
of the fourth-grade test.

Star Reading Grade Equivalent scores are more consistently accurate—even


as a student’s achievement level deviates from the level of grade placement. A
student may be tested on any level of material, depending upon his or her actual
performance on the test; students are tested on items of an appropriate level
of difficulty, based on their individual level of achievement. Thus, a GE score of
7.6 indicates that the student’s score can be appropriately compared to that of a
typical seventh grader who has completed 60% of the school year (with the same
caveat as before—it does not mean that the student can actually handle seventh-
grade reading material).

Star Assessments™ for Reading


Technical Manual 107
Score Definitions
Types of Test Scores

Estimated Oral Reading Fluency (Est. ORF)


Estimated Oral Reading Fluency (Est. ORF) is an estimate of a student’s ability
to read words quickly and accurately in order to comprehend text efficiently.
Students with oral reading fluency demonstrate accurate decoding, automatic
word recognition, and appropriate use of the rhythmic aspects of language (e.g.,
intonation, phrasing, pitch, and emphasis).

Est. ORF is reported as an estimated number of words a student can read


correctly within a one-minute time span on grade-level-appropriate text. Grade-
level text was defined to be connected text in a comprehensible passage form
that has a readability level within the range of the first half of the school year. For
instance, an Est. ORF score of 60 for a second-grade student would be interpreted
as meaning the student is expected to read 60 words correctly within one minute
on a passage with a reading grade level between 2.0 and 2.5. Therefore, when
this estimate is compared to an observed score on a specific passage which has
a fixed level of readability, there might be noticeable differences as the Est. ORF
provides an estimate across a range of readability levels.

The Est. ORF score was computed using the results of a large-scale research
study investigating the linkage between the Star Reading scores and estimates of
oral reading fluency on a range of passages with grade-level-appropriate difficulty.
An equipercentile linking was done between Star Reading scores and oral reading
fluency, providing an estimate of the oral reading fluency for each Scaled Score
unit in Star Reading for grades 1–4 independently. Results of the analysis can be
found in “Additional Validation Evidence for Star Reading” on page 74. A table of
selected Star Reading Scaled Scores and corresponding Estimated ORF values
can be found in “Appendix B: Detailed Evidence of Star Reading Validity” on page
136.

Instructional Reading Level (IRL)


The Instructional Reading Level is a criterion-referenced score that indicates
the highest reading level at which the student can effectively be taught. In other
words, IRLs tell you the reading level at which students can recognize words and
comprehend written instructional material with some assistance. A sixth-grade
student with an IRL of 4.0, for example, would be best served by instructional
materials prepared at the fourth-grade level. IRLs are represented by either
numbers or letters indicating a particular grade. Number codes represent IRLs
for grades 1.0–12.9. IRL letter codes include PP (Pre-Primer), P (Primer, grades
.1–.9), and PHS (Post-High School, grades 13.0+).

As a construct, instructional reading levels have existed in the field of reading


education for over seventy years. During this time, a variety of assessment

Star Assessments™ for Reading


Technical Manual 108
Score Definitions
Types of Test Scores

instruments have been developed using different measurement criteria that


teachers can use to estimate IRL. Star Reading software determines IRL scores
relative to 1995 updated vocabulary lists that are based on the Educational
Development Laboratory’s (EDL) A Revised Core Vocabulary (1969). The
Instructional Reading Level is defined as the highest reading level at which
the student can read at 90–98 percent word recognition (Gickling & Haverape,
1981; Johnson, Kress & Pikulski, 1987; McCormick, 1999) and with 80 percent
comprehension or higher (Gickling & Thompson, 2001). Although Star Reading
does not directly assess word recognition, Star Reading uses the student’s
Rasch ability scores, in conjunction with the Rasch difficulty parameters of graded
vocabulary items, to determine the proportion of items a student can comprehend
at each grade level.

Special IRL Scores


If a student’s performance on Star Reading indicates an IRL below the first grade,
Star Reading software will automatically assign an IRL score of Primer (P) or
Pre-Primer (PP). Because the kindergarten-level test items are designed so that
even readers of very early levels can understand them, a Primer or Pre-Primer
IRL means that the student is essentially a non-reader. There are, however, other
unusual circumstances that could cause a student to receive an IRL of Primer
or Pre-Primer. Most often, this happens when a student simply does not try or
purposely answers questions incorrectly.

When Star Reading software determines that a student can answer 80 percent
or more of the grade 13 items in the Star Reading test correctly, the student is
assigned an IRL of Post-High School (PHS). This is the highest IRL that anyone
can obtain when taking the Star Reading test.

Understanding IRL and GE Scores


One strength of Star Reading software is that it provides both criterion-referenced
and norm-referenced scores. As such, it provides more than one frame of
reference for describing a student’s current reading performance. The two frames
of reference differ significantly, however, so it is important to understand the two
estimates and their development when making interpretations of Star Reading
results.

The Instructional Reading Level (IRL) is a criterion-referenced score. It provides


an estimate of the grade level of written material with which the student can most
effectively be taught. While the IRL, like any test result, is simply an estimate, it
provides a useful indication of the level of material on which the student should
be receiving instruction. For example, if a student (regardless of current grade

Star Assessments™ for Reading


Technical Manual 109
Score Definitions
Types of Test Scores

placement) receives a Star Reading IRL of 4.0, this indicates that the student can
most likely learn without experiencing too many difficulties when using materials
written to be on a fourth-grade level.

The IRL is estimated based on the student’s pattern of responses to the Star Reading
items. A given student’s IRL is the highest grade level of items at which it is estimated
that the student can correctly answer at least 80 percent of the items.

In effect, the IRL references each student’s Star Reading performance to the difficulty
of written material appropriate for instruction. This is a valuable piece of information in
planning the instructional program for individuals or groups of students.

The Grade Equivalent (GE) is a norm-referenced score. It provides a comparison of


a student’s performance with that of other students around the nation. If a student
receives a GE of 4.0, this means that the student scored as well on the Star Reading
test as did the typical student at the beginning of grade 4. It does not mean that the
student can read books that are written at a fourth-grade level—only that he or she
reads as well as fourth-grade students in the norms group.

In general, IRLs and GEs will differ. These differences are caused by the fact
that the two score metrics are designed to provide different information. That is,
IRLs estimate the level of text that a student can read with some instructional
assistance; GEs express a student’s performance in terms of the grade level for
which that performance is typical. Usually, a student’s GE score will be higher than
the IRL.

The score to be used depends on the information desired. If a teacher or educator


wishes to know how a student’s Star Reading score compares with that of other
students across the nation, either the GE or the Percentile Rank should be used.
If the teacher or educator wants to know what level of instructional materials a
student should be using for ongoing classroom schooling, the IRL is the preferred
score. Again, both scores are estimates of a student’s current level of reading
achievement. They simply provide two ways of interpreting this performance—
relative to a national sample of students (GE) or relative to the level of written
material the student can read successfully (IRL).

Percentile Rank (PR)


Percentile Rank is a norm-referenced score that indicates the percentage of
students in the same grade and at the same point of time in the school year who
obtained scores lower than the score of a particular student. In other words,
Percentile Ranks show how an individual student’s performance compares to that
of his or her same-grade peers on the national level. For example, a Percentile
Rank of 85 means that the student is performing at a level that exceeds 85 percent
of other students in that grade at the same time of the year. Percentile Ranks

Star Assessments™ for Reading


Technical Manual 110
Score Definitions
Types of Test Scores

simply indicate how a student performed compared to the others who took Star
Reading tests as a part of the national norming program. The range of Percentile
Ranks is 1–99.

The Percentile Rank scale is not an equal-interval scale. For example, for a
student with a grade placement of 1.7, a Unified Scaled Score of 896 corresponds
to a PR of 80, and a Unified Scaled Score of 931 corresponds to a PR of 90. Thus,
a difference of 35 Scaled Score points represents a 10-point difference in PR.
However, for students at the same 1.7 grade placement, a Unified Scaled Score
of 836 corresponds to a PR of 50, and a Unified Scaled Score of 853 corresponds
to a PR of 60. While there is now only a 17-point difference in Scaled Scores,
there is still a 10-point difference in PR. For this reason, PR scores should not
be averaged or otherwise algebraically manipulated. NCE scores are much more
appropriate for these activities.

Table 59 on page <OV> and Table 57 on page 122 contain an abridged version
of the Scaled Score to Percentile Rank conversion table that the Star Reading
software uses. The actual table includes data for all of the monthly grade
placement values from 0.0–12.9.

This table can be used to estimate PR values for tests that were taken when the
grade placement value of a student was incorrect (see “Types of Test Scores” on
page 105 for more information). If the error is caught right away, one always has
the option of correcting the grade placement for the student and then having the
student retest.

Normal Curve Equivalent (NCE)


Normal Curve Equivalents (NCEs) are scores that have been scaled in such a way
that they have a normal distribution, with a mean of 50 and a standard deviation
of 21.06 in the normative sample for a given test. Because they range from 1–99,
they appear similar to Percentile Ranks, but they have the advantage of being
based on an equal interval scale. That is, the difference between two successive
scores on the scale has the same meaning throughout the scale. NCEs are useful
for purposes of statistically manipulating norm-referenced test results, such as
when interpolating test scores, calculating averages, and computing correlation
coefficients between different tests. For example, in Star Reading score reports,
average Percentile Ranks are obtained by first converting the PR values to NCE
values, averaging the NCE values, and then converting the average NCE back to
a PR.

Table 61 on page <?> provides the NCEs corresponding to integer PR values


and facilitates the conversion of PRs to NCEs. Table 62 on page <OV> provides the

Star Assessments™ for Reading


Technical Manual 111
Score Definitions
Types of Test Scores

conversions from NCE to PR. The NCE values are given as a range of scores that
convert to the corresponding PR value.

Student Growth Percentile (SGP)


Student Growth Percentiles (SGPs) are a norm-referenced quantification of
individual student growth derived using quantile regression techniques. An SGP
compares a student’s growth to that of his or her academic peers nationwide with
a similar achievement history on Star assessments. Academic peers are students
who
X are in the same grade,
X had the same scores on the current test and (up to) two prior tests from
different windows of testing time, and
X took the most recent test and the first prior test on the same dates.
1
SGPs provide a measure of how a student changed from one Star testing window
to the next relative to other students with similar starting Star Reading scores.
SGPs range from 1–99 and interpretation is similar to that of Percentile Rank
scores; lower numbers indicate lower relative growth and higher numbers show
higher relative growth. For example, an SGP of 70 means that the student’s
growth from one test window to another exceeds the growth of 70% of students
nationwide in the same grade with a similar Star Reading score history. All
students, no matter their starting Star score, have an equal chance to demonstrate
growth at any of the 99 percentiles.

SGPs are often used to indicate whether a student’s growth is more or less than
can be expected. For example, without an SGP, a teacher would not know if a
Scaled Score increase of 100 points represents good, not-so-good, or average
growth. This is because students of differing achievement levels in different grades
grow at different rates relative to the Star Reading scale. For example, a high-
achieving second-grader grows at a different rate than a low-achieving second-
grader. Similarly, a high-achieving second-grader grows at a different rate than a
high-achieving eighth-grader.

SGPs can be aggregated to describe typical growth for groups of students—for


example, a class, grade, or school as a whole—by calculating the group’s median,
or middle, growth percentile. No matter how SGPs are aggregated, whether at the
class, grade, or school level, the statistic and its interpretation remain the same.
For example, if the students in one class have a median SGP of 62, that particular
group of students, on average, achieved higher growth than their academic peers.

1. We collect data for our growth norms during three different time periods: fall, winter, and
spring. More information about these time periods is provided on page 113.

Star Assessments™ for Reading


Technical Manual 112
Score Definitions
Types of Test Scores

SGP is calculated for students who have taken at least two tests (a current test
and a prior test) within at least two different testing windows (Fall, Winter, or
Spring).

If a student has taken more than one test in a single test window, the SGP
calculation is based off the following tests:
X The current test is always the last test taken in a testing window.
X The test used as the prior test depends on what testing window it falls in:
X Fall window: The first test taken in the Fall window is used.
X Winter window: The test taken closest to January 15 in the Winter window
is used.
X Spring window: The last test taken in the Spring window is used.

Test Windows Test Windows


Most
in Prior School Years in Current School Year*
Recent
Test Is Type of SGP Fall Winter Spring Fall Winter Spring Fall Winter Spring Fall Winter Spring
In... Calculated 8/1–11/30 12/1–3/31 4/1–7/31 8/1–11/30 12/1–3/31 4/1–7/31 8/1–11/30 12/1–3/31 4/1–7/31 8/1–11/30 12/1–3/31 4/1–7/31
Fall–Spring
the Current School Year

Fall–Winter
Winter–Spring
Spring–Fall
Spring–Spring
Fall–Fall
Fall–Spring
a Prior School Year

Fall–Winter
Winter–Spring
Spring–Fall
Spring–Spring
Fall–Fall
* Test window dates are fixed and may not correspond to the beginning/ending dates of your school year. Students will only have SGPs calculated if they have
taken at least two tests, and the date of the most recent test has to be within the past 18 months.

Two tests used to calculate SGP If more than one test was taken in a prior test
Test in window, but skipped when calculating SGP Test Window window, which is used to calculate SGP?
Third test used to calculate SGP (if available) Fall Window First test taken
Winter Window Test closest to 1/15 (red line)
Spring Window Last test taken

Lexile® Measures
In cooperation with MetaMetrics®, since August 2014, users of Star Reading
have had the option of including Lexile measures on certain Star Reading score
reports. Reported Lexile measures will range from BR400L to 1825L. (The “L”
suffix identified the score as a Lexile measure. Where it appears, the “BR” prefix
indicates a score that is below 0 on the Lexile scale; such scores are typical of
beginning readers.)

Star Assessments™ for Reading


Technical Manual 113
Score Definitions
Types of Test Scores

Lexile Measures of Students and Books: Measures of Student Reading


Achievement and Text Readability

The ability to read and comprehend written text is important for academic success.
Students may, however, benefit most from reading materials that match their
reading ability/achievement: reading materials that are neither too easy nor too
hard so as to maximize learning. To facilitate students’ choices of appropriate
reading materials, measures commonly referred to as readability measures are
used in conjunction with students’ reading achievement measures.

A text readability measure can be defined as a numeric scale, often derived


analytically, that takes into account text characteristics that influence text
comprehension or readability. An example of a readability measure is an age-
level estimate of text difficulty. Among text characteristics that can affect text
comprehension are sentence length and word difficulty.

A person’s reading measure is a numeric score obtained from a reading


achievement test, usually a standardized test such as Star Reading. A person’s
reading score quantifies his/her reading achievement level at a particular point in
time.

Matching a student with text/books that target a student’s interest and level of
reading achievement is a two-step process: first, a student’s reading achievement
score is obtained by administering a standardized reading achievement test;
second, the reading achievement score serves as an entry point into the
readability measure to determine the difficulty level of text/books that would best
support independent reading for the student. Optimally, a readability measure
should match students with books that they are able to read and comprehend
independently without boredom or frustration: books that are engaging yet slightly
challenging to students based on the students’ reading achievement and grade
level.

Renaissance Learning’s (RLI) readability measure is known as the Advantage/


TASA Open Standard for Readability (ATOS). The ATOS for Text readability formula
was developed through extensive research by RLI in conjunction with Touchstone
Applied Science Associates, Inc. (TASA), now called Questar Assessment, Inc.
A great many school libraries use ATOS book levels to index readability of their
books. ATOS book levels, which are derived from ATOS for Books measures,
express readability as grade levels; for example, an ATOS readability measure of 4.2
means that the book is at a difficulty level appropriate for students reading at a level
typical of students in the 4th grade, 2nd month. To match students to books at an
appropriate level, the widely used Accelerated Reader system uses ATOS measures
of readability and student’s Grade Equivalent (GE) scores on standardized reading
tests such as Star Reading.

Star Assessments™ for Reading


Technical Manual 114
Score Definitions
Types of Test Scores

Another widely-used system for matching readers to books at appropriate difficulty


levels is The Lexile Framework® for Reading, developed by MetaMetrics, Inc. The
Lexile scale is a common scale for both text measure (readability or text difficulty)
and reader measure (reading achievement scores); in the Lexile Framework, both
text difficulty and person reading ability are measured on the same scale. Unlike
ATOS for Books, the Lexile Framework expresses a book’s reading difficulty level
(and students’ reading ability levels) on a continuous scale ranging from below
0 to 1825 or more. Because some schools and school libraries use the Lexile
Framework to index the reading difficulty levels of their books, there was a need to
provide users of Star Reading with a student reading ability score compatible with
the Lexile Framework.

In 2014, Metametrics, Inc., developed a means to translate Star Reading scale


scores into equivalent Lexile measures of student reading ability. To do so, more
than 200 MetaMetrics reading test items that had already been calibrated on
the Lexile scale were administered in small numbers as unscored scale anchor
items at the end of Star Reading tests. More than 250,000 students in grades 1
through 12 took up to 6 of those items as part of their Star Reading tests in April
2014. MetaMetrics’ analysis of the Star Reading and Lexile anchor item response
data yielded a means of transforming Star Reading’s underlying Rasch scores
into equivalent Lexile scores. That transformation, in turn, was used to develop a
concordance table listing the Lexile equivalent of each unique Star Reading scale
score.

In some cases, a range of text/book reading difficulty in which a student can read
independently or with minimal guidance is desired. At Renaissance, we define the
range of reading difficulty level that is neither too hard nor too easy as the Zone of
Proximal Development (ZPD). The ZPD range allows, potentially, optimal learning
to occur because students are engaged and appropriately challenged by reading
materials that match their reading achievement and interest. The ZPD range is
simply an approximation of the range of reading materials that is likely to benefit
the student most. ZPD ranges are not absolute and teachers should also use their
objective judgment to help students select reading books that enhance learning.

In a separate linking procedure, MetaMetrics compared the ATOS readability


measures of thousands of books to the Lexile measures of the same books.
Analysis of those data yielded a table of equivalence between ATOS reading grade
levels and Lexile readability measures. That equivalence table supports matching
students to books regardless of whether a book’s readability is measured using
the Renaissance Learning ATOS system or the Lexile Framework created by
MetaMetrics. Additionally, it supports translating ATOS ZPD ranges into equivalent
ZPD ranges expressed on the Lexile scale.

Star Assessments™ for Reading


Technical Manual 115
Score Definitions
Special Star Reading Scores

Special Star Reading Scores


Most of the scores provided by Star Reading software are common measures of
reading performance. Star Reading software also determines the Zone of Proximal
Development.

Zone of Proximal Development (ZPD)


The Zone of Proximal Development (ZPD) defines the readability range from which
students should be selecting books in order to ensure sufficient comprehension
and therefore achieve optimal growth in reading skills without experiencing
frustration. Star Reading software uses Grade Equivalents to derive a student’s
ZPD score. Specifically, it relates the Grade Equivalent estimate of a student’s
reading ability with the range of most appropriate readability levels to use for
reading practice. Table 60 on page 127 shows the relationship between GEs and
ZPD scores.

The Zone of Proximal Development is especially useful for students who use
Accelerated Reader, which provides readability levels on over 180,000 trade books
and textbooks. Renaissance Learning developed the ZPD ranges according to
Vygotskian theory, based on an analysis of Accelerated Reader book reading data
from 80,000 students in the 1996–1997 school year. More information is available
in The research foundation for Accelerated Reader goal-setting practices (2006),
which is published by Renaissance Learning ([Link]
[Link]).

Grade Placement
Star Reading software uses students’ grade placement—grade and percent of the
school year completed—when determining norm-referenced scores. The values of
PR (Percentile Rank) and NCE (Normal Curve Equivalent) are based not only on
what Scaled Score the student achieved, but also on the grade placement of the
student at the time of the test. For example, a second-grader that is 80% through
the school year with a Unified Scaled Score of 957 would have a PR of 74, while
a third-grader that is 80% through the school year with the same Unified Scaled
Score would have a PR of 38. Thus, it is crucial that student records indicate the
proper grade placement when students take a Star Reading test.

Indicating the Appropriate Grade Placement

Grade Placement is shown as a decimal, with the integer being the student’s
grade and the decimal being the percentage of the school year that has passed
when a student completed an assignment or test; for example, a GP of 3.25 would

Star Assessments™ for Reading


Technical Manual 116
Score Definitions
Special Star Reading Scores

represent a third-grade student who is one-quarter of the way through the school
year. For purposes of this calculation, a “school year” is the span between First
Day for Students and Last Day for Students, which are entered as part of the new
school year setup process.

Compensating for Incorrect Grade Placements

Teachers cannot make retroactive corrections to a student’s grade placement by


editing the grade assignments in a student’s record or by adjusting the increments
for the summer months after students have tested. In other words, Star Reading
software cannot go back in time and correct scores resulting from erroneous grade
placement information. Thus, it is extremely important for the test administrator to
make sure that the proper grade placement procedures are being followed.

Star Assessments™ for Reading


Technical Manual 117
Conversion Tables

Table 56: Unified Scaled Score to Grade Equivalent Conversions

Unified Scaled Score


Grade Equivalent Low High
0.0 600 692
0.1 693 701
0.2 702 709
0.3 710 718
0.4 719 726
0.5 727 735
0.6 736 744
0.7 745 753
0.8 754 762
0.9 763 771
1.0 772 780
1.1 781 789
1.2 790 798
1.3 799 807
1.4 808 816
1.5 817 824
1.6 825 833
1.7 834 841
1.8 842 849
1.9 850 857
2.0 858 865
2.1 866 873
2.2 874 881
2.3 882 888
2.4 889 896
2.5 897 903
2.6 904 910
2.7 911 916
2.8 917 923
2.9 924 929
3.0 930 935
3.1 936 941

Star Assessments™ for Reading


Technical Manual 118
Conversion Tables


Table 56: Unified Scaled Score to Grade Equivalent Conversions

Unified Scaled Score


Grade Equivalent Low High
3.2 942 947
3.3 948 952
3.4 953 958
3.5 959 963
3.6 964 968
3.7 969 973
3.8 974 977
3.9 978 982
4.0 983 986
4.1 987 990
4.2 991 994
4.3 995 998
4.4 999 1002
4.5 1003 1005
4.6 1006 1009
4.7 1010 1012
4.8 1013 1015
4.9 1016 1018
5.0 1019 1021
5.1 1022 1024
5.2 1025 1026
5.3 1027 1029
5.4 1030 1031
5.5 1032 1034
5.6 1035 1036
5.7 1037 1038
5.8 1039 1040
5.9 1041 1042
6.0 1043 1045
6.1 1046 1046
6.2 1047 1048
6.3 1049 1050
6.4 1051 1052
6.5 1053 1054
6.6 1055 1056
6.7 1057 1057

Star Assessments™ for Reading


Technical Manual 119
Conversion Tables


Table 56: Unified Scaled Score to Grade Equivalent Conversions

Unified Scaled Score


Grade Equivalent Low High
6.8 1058 1059
6.9 1060 1061
7.0 1062 1062
7.1 1063 1064
7.2 1065 1065
7.3 1066 1067
7.4 1068 1068
7.5 1069 1070
7.6 1071 1071
7.7 1072 1073
7.8 1074 1074
7.9 1075 1076
8.0 1077 1077
8.1 1078 1079
8.2 1080 1080
8.3 1081 1082
8.4 1083 1083
8.5 1084 1084
8.6 1085 1086
8.7 1087 1087
8.8 1088 1088
8.9 1089 1089
9.0 1090 1091
9.1 1092 1092
9.2 1093 1093
9.3 1094 1094
9.4 1095 1095
9.5 1096 1097
9.6 1098 1098
9.7 1099 1099
9.8 1100 1100
9.9 1101 1101
10.0 1102 1102
10.1 1103 1103
10.2 1104 1104
10.3 1105 1105

Star Assessments™ for Reading


Technical Manual 120
Conversion Tables


Table 56: Unified Scaled Score to Grade Equivalent Conversions

Unified Scaled Score


Grade Equivalent Low High
10.4 1106 1106
10.5 1107 1107
10.6 1108 1108
10.7 1109 1109
10.8 1110 1110
10.9 1111 1111
11.0 1112 1112
11.1 1113 1113
11.2 1114 1114
11.3 1115 1115
11.4 1116 1116
11.5 1117 1117
11.6 1118 1118
11.7 1119 1119
11.8 1120 1120
11.9 1121 1121
12.0 1122 1122
12.1 1123 1123
12.2 1124 1124
12.3 1125 1125
12.4 1126 1126
12.5 1127 1127
12.6 1128 1128
12.7 1129 1129
12.8 1130 1130
12.9 1131 1131
13.0 1132 1400

Star Assessments™ for Reading


Technical Manual 121
Conversion Tables


Table 57: Unified Scaled Score to Percentile Rank Conversionsa

Grade Placement (First Month of School Year)


PR K 1 2 3 4 5 6 7 8 9 10 11 12
1 ― 600 600 600 600 600 600 600 600 600 600 600 600
2 ― 601 ― 723 770 814 851 870 889 891 889 895 882
3 ― ― 614 752 790 838 876 894 912 915 918 924 914
4 ― 614 676 763 807 856 892 911 929 933 937 944 936
5 ― 620 704 772 821 870 905 923 942 946 952 960 953
6 ― 631 725 781 833 882 916 933 953 957 964 972 966
7 ― 640 737 789 843 891 924 942 961 966 974 982 977
8 ― 648 744 797 853 900 932 950 969 974 982 990 986
9 ― 654 748 805 862 907 939 957 975 981 989 997 994
10 ― 660 752 812 869 914 945 963 981 987 995 1004 1001
11 ― 665 756 818 876 920 951 968 987 992 1000 1010 1007
12 ― 670 759 824 882 925 956 974 992 997 1006 1015 1014
13 600 674 763 830 887 930 960 978 996 1001 1011 1020 1019
14 601 679 766 836 893 934 964 982 1000 1006 1015 1024 1024
15 604 684 770 841 898 938 969 986 1004 1010 1020 1029 1028
16 607 687 773 846 903 943 972 990 1008 1014 1023 1033 1032
17 610 691 777 850 907 946 976 994 1011 1017 1027 1036 1036
18 613 694 780 855 912 950 979 997 1015 1021 1031 1040 1040
19 616 698 783 859 916 953 982 1000 1018 1024 1034 1043 1043
20 619 701 787 863 919 957 986 1003 1021 1027 1038 1046 1047
21 622 704 790 868 923 960 988 1005 1024 1030 1041 1049 1050
22 624 707 793 871 926 963 991 1008 1026 1033 1043 1052 1053
23 627 709 796 875 929 965 994 1011 1029 1036 1046 1055 1055
24 630 712 799 879 932 968 996 1014 1032 1038 1049 1057 1058
25 633 714 802 882 934 971 999 1017 1034 1041 1051 1060 1061
26 635 716 805 885 937 973 1001 1019 1037 1043 1054 1063 1064
27 637 719 808 888 940 976 1003 1021 1039 1046 1056 1065 1066
28 640 721 811 891 943 978 1005 1024 1041 1048 1058 1068 1069
29 643 723 813 893 945 980 1007 1026 1044 1050 1061 1070 1071
30 646 725 816 896 948 983 1010 1028 1046 1052 1064 1072 1073
31 648 726 819 899 950 985 1012 1030 1048 1054 1066 1075 1076
32 650 728 821 901 953 987 1014 1032 1050 1056 1068 1077 1078
33 652 729 824 904 955 989 1016 1034 1052 1058 1070 1079 1080
34 654 731 827 906 957 991 1018 1036 1054 1061 1072 1080 1082
a. Each entry is the lowest Scaled Score for that grade and percentile.

Star Assessments™ for Reading


Technical Manual 122
Conversion Tables


Table 57: Unified Scaled Score to Percentile Rank Conversionsa

Grade Placement (First Month of School Year)


PR K 1 2 3 4 5 6 7 8 9 10 11 12
35 656 732 830 909 959 994 1020 1038 1056 1063 1074 1082 1084
36 658 734 833 911 961 995 1022 1040 1058 1065 1076 1084 1086
37 660 736 836 914 963 997 1024 1042 1060 1067 1078 1086 1088
38 662 737 839 916 966 999 1026 1044 1062 1069 1080 1088 1090
39 663 738 842 919 968 1001 1027 1046 1064 1071 1082 1090 1092
40 665 740 845 921 970 1002 1029 1047 1066 1073 1084 1092 1094
41 666 741 848 923 972 1004 1031 1049 1067 1074 1086 1093 1095
42 668 743 850 925 974 1006 1033 1051 1069 1076 1087 1095 1097
43 671 744 853 927 976 1007 1034 1053 1071 1078 1089 1097 1099
44 673 746 856 929 978 1009 1036 1055 1073 1080 1091 1099 1101
45 676 747 858 931 980 1011 1038 1056 1074 1081 1093 1101 1103
46 679 749 861 932 981 1012 1039 1058 1076 1083 1095 1103 1105
47 681 750 863 934 983 1014 1041 1060 1078 1085 1097 1104 1107
48 683 752 866 936 985 1016 1043 1062 1079 1087 1098 1106 1108
49 685 754 869 938 987 1017 1044 1063 1081 1088 1100 1108 1110
50 686 755 871 940 989 1019 1046 1065 1082 1090 1102 1110 1112
51 688 757 874 942 991 1021 1047 1067 1083 1092 1103 1112 1114
52 690 758 876 944 993 1022 1049 1068 1085 1093 1105 1113 1115
53 692 760 878 946 994 1024 1050 1070 1087 1095 1107 1115 1117
54 693 762 881 948 996 1026 1052 1072 1088 1097 1109 1116 1119
55 695 764 883 950 998 1027 1054 1073 1090 1098 1110 1118 1121
56 697 766 886 952 999 1029 1055 1075 1091 1100 1112 1120 1122
57 699 768 888 954 1001 1030 1057 1076 1093 1101 1114 1121 1124
58 700 770 890 956 1002 1032 1058 1078 1095 1103 1115 1123 1126
59 702 772 893 958 1004 1034 1060 1079 1096 1105 1117 1124 1127
60 704 775 895 960 1005 1035 1062 1081 1098 1107 1118 1126 1129
61 705 777 897 962 1007 1037 1063 1082 1099 1108 1120 1128 1131
62 707 780 899 963 1008 1038 1065 1083 1101 1110 1122 1129 1133
63 709 782 901 965 1010 1040 1067 1085 1103 1112 1123 1131 1134
64 710 784 904 967 1012 1041 1068 1087 1104 1114 1125 1133 1136
65 712 787 906 969 1013 1043 1070 1088 1106 1115 1127 1134 1138
66 714 789 909 971 1015 1045 1072 1090 1108 1117 1129 1136 1139
67 716 791 911 973 1017 1046 1073 1092 1110 1119 1130 1138 1141
68 717 793 914 974 1019 1048 1075 1093 1112 1120 1132 1140 1142
a. Each entry is the lowest Scaled Score for that grade and percentile.

Star Assessments™ for Reading


Technical Manual 123
Conversion Tables


Table 57: Unified Scaled Score to Percentile Rank Conversionsa

Grade Placement (First Month of School Year)


PR K 1 2 3 4 5 6 7 8 9 10 11 12
69 719 795 916 977 1020 1050 1076 1095 1114 1122 1134 1141 1144
70 721 797 919 978 1022 1051 1078 1097 1115 1124 1136 1142 1145
71 723 800 921 980 1024 1053 1080 1099 1117 1126 1137 1144 1147
72 725 802 924 983 1025 1055 1081 1100 1118 1127 1139 1145 1149
73 726 804 926 985 1027 1057 1083 1102 1120 1129 1141 1147 1150
74 728 806 928 987 1029 1058 1084 1104 1122 1131 1142 1149 1152
75 730 809 931 989 1031 1060 1086 1106 1124 1133 1144 1150 1154
76 732 812 933 992 1033 1062 1088 1108 1126 1135 1146 1152 1156
77 735 816 936 994 1035 1064 1090 1110 1127 1137 1147 1154 1158
78 737 820 938 996 1037 1066 1092 1112 1129 1139 1149 1156 1160
79 739 824 941 998 1039 1068 1094 1114 1132 1141 1151 1158 1163
80 741 829 944 1000 1040 1070 1096 1116 1134 1143 1153 1160 1165
81 743 833 946 1003 1042 1072 1098 1118 1136 1144 1155 1162 1167
82 745 838 949 1005 1045 1074 1100 1120 1138 1146 1157 1165 1170
83 747 842 952 1007 1047 1076 1102 1122 1141 1149 1160 1167 1172
84 750 847 955 1009 1049 1078 1104 1124 1143 1151 1162 1170 1175
85 753 851 958 1012 1051 1080 1107 1127 1145 1153 1165 1172 1177
86 756 855 962 1014 1054 1082 1109 1129 1147 1155 1168 1175 1180
87 759 859 965 1017 1056 1085 1112 1132 1149 1158 1171 1178 1183
88 763 863 968 1020 1059 1088 1115 1135 1152 1161 1174 1181 1186
89 767 868 972 1023 1062 1090 1117 1138 1154 1164 1177 1184 1189
90 771 873 975 1027 1065 1093 1120 1141 1157 1167 1180 1187 1192
91 776 879 979 1030 1068 1097 1123 1144 1160 1171 1184 1190 1196
92 783 885 984 1034 1072 1100 1126 1147 1164 1175 1187 1194 1200
93 788 892 989 1038 1075 1104 1130 1151 1168 1179 1192 1198 1204
94 795 899 995 1042 1080 1109 1135 1154 1173 1184 1196 1203 1208
95 801 908 1001 1047 1084 1114 1140 1159 1178 1188 1201 1208 1214
96 809 918 1008 1053 1090 1119 1146 1165 1184 1194 1207 1214 1220
97 820 931 1016 1060 1097 1125 1152 1172 1191 1201 1214 1221 1228
98 838 947 1028 1069 1107 1135 1160 1182 1200 1210 1223 1231 1238
99 868 973 1044 1084 1121 1149 1175 1196 1215 1224 1238 1247 1255
a. Each entry is the lowest Scaled Score for that grade and percentile.

Star Assessments™ for Reading


Technical Manual 124
Conversion Tables


Table 58: Percentile Rank to Normal Curve Equivalent Conversions


PR NCE PR NCE PR NCE PR NCE
1 1.0 26 36.5 51 50.5 76 64.9
2 6.7 27 37.1 52 51.1 77 65.6
3 10.4 28 37.7 53 51.6 78 66.3
4 13.1 29 38.3 54 52.1 79 67.0
5 15.4 30 39.0 55 52.6 80 67.7
6 17.3 31 39.6 56 53.2 81 68.5
7 18.9 32 40.1 57 53.7 82 69.3
8 20.4 33 40.7 58 54.2 83 70.1
9 21.8 34 41.3 59 54.8 84 70.9
10 23.0 35 41.9 60 55.3 85 71.8
11 24.2 36 42.5 61 55.9 86 72.8
12 25.3 37 43.0 62 56.4 87 73.7
13 26.3 38 43.6 63 57.0 88 74.7
14 27.2 39 44.1 64 57.5 89 75.8
15 28.2 40 44.7 65 58.1 90 77.0
16 29.1 41 45.2 66 58.7 91 78.2
17 29.9 42 45.8 67 59.3 92 79.6
18 30.7 43 46.3 68 59.9 93 81.1
19 31.5 44 46.8 69 60.4 94 82.7
20 32.3 45 47.4 70 61.0 95 84.6
21 33.0 46 47.9 71 61.7 96 86.9
22 33.7 47 48.4 72 62.3 97 89.6
23 34.4 48 48.9 73 62.9 98 93.3
24 35.1 49 49.5 74 63.5 99 99.0
25 35.8 50 50.0 75 64.2

Star Assessments™ for Reading


Technical Manual 125
Conversion Tables


Table 59: Normal Curve Equivalent to Percentile Rank Conversion


NCE Range NCE Range NCE Range NCE Range
Low High PR Low High PR Low High PR Low High PR
1.0 4.0 1 36.1 36.7 26 50.3 50.7 51 64.6 65.1 76
4.1 8.5 2 36.8 37.3 27 50.8 51.2 52 65.2 65.8 77
8.6 11.7 3 37.4 38.0 28 51.3 51.8 53 65.9 66.5 78
11.8 14.1 4 38.1 38.6 29 51.9 52.3 54 66.6 67.3 79
14.2 16.2 5 38.7 39.2 30 52.4 52.8 55 67.4 68.0 80
16.3 18.0 6 39.3 39.8 31 52.9 53.4 56 68.1 68.6 81
18.1 19.6 7 39.9 40.4 32 53.5 53.9 57 68.7 69.6 82
19.7 21.0 8 40.5 40.9 33 54.0 54.4 58 69.7 70.4 83
21.1 22.3 9 41.0 41.5 34 54.5 55.0 59 70.5 71.3 84
22.4 23.5 10 41.6 42.1 35 55.1 55.5 60 71.4 72.2 85
23.6 24.6 11 42.2 42.7 36 55.6 56.1 61 72.3 73.1 86
24.7 25.7 12 42.8 43.2 37 56.2 56.6 62 73.2 74.1 87
25.8 26.7 13 43.3 43.8 38 56.7 57.2 63 74.2 75.2 88
26.8 27.6 14 43.9 44.3 39 57.3 57.8 64 75.3 76.3 89
27.7 28.5 15 44.4 44.9 40 57.9 58.3 65 76.4 77.5 90
28.6 29.4 16 45.0 45.4 41 58.4 58.9 66 77.6 78.8 91
29.5 30.2 17 45.5 45.9 42 59.0 59.5 67 78.9 80.2 92
30.3 31.0 18 46.0 46.5 43 59.6 60.1 68 80.3 81.7 93
31.1 31.8 19 46.6 47.0 44 60.2 60.7 69 81.8 83.5 94
31.9 32.6 20 47.1 47.5 45 60.8 61.3 70 83.6 85.5 95
32.7 33.3 21 47.6 48.1 46 61.4 61.9 71 85.6 88.0 96
33.4 34.0 22 48.2 48.6 47 62.0 62.5 72 88.1 91.0 97
34.1 34.7 23 48.7 49.1 48 62.6 63.1 73 91.1 95.4 98
34.8 35.4 24 49.2 49.7 49 63.2 63.8 74 95.5 99.0 99
35.5 36.0 25 49.8 50.2 50 63.9 64.5 75

Star Assessments™ for Reading


Technical Manual 126
Conversion Tables


Table 60: Grade Equivalent to ZPD Conversions

ZPD Range ZPD Range ZPD Range


GE Low High GE Low High GE Low High
0.0 0.0 1.0 4.4 3.2 4.9 8.8 4.6 8.8
0.1 0.1 1.1 4.5 3.2 5.0 8.9 4.6 8.9
0.2 0.2 1.2 4.6 3.2 5.1 9.0 4.6 9.0
0.3 0.3 1.3 4.7 3.3 5.2 9.1 4.6 9.1
0.4 0.4 1.4 4.8 3.3 5.2 9.2 4.6 9.2
0.5 0.5 1.5 4.9 3.4 5.3 9.3 4.6 9.3
0.6 0.6 1.6 5.0 3.4 5.4 9.4 4.6 9.4
0.7 0.7 1.7 5.1 3.5 5.5 9.5 4.7 9.5
0.8 0.8 1.8 5.2 3.5 5.5 9.6 4.7 9.6
0.9 0.9 1.9 5.3 3.6 5.6 9.7 4.7 9.7
1.0 1.0 2.0 5.4 3.6 5.6 9.8 4.7 9.8
1.1 1.1 2.1 5.5 3.7 5.7 9.9 4.7 9.9
1.2 1.2 2.2 5.6 3.8 5.8 10.0 4.7 10.0
1.3 1.3 2.3 5.7 3.8 5.9 10.1 4.7 10.1
1.4 1.4 2.4 5.8 3.9 5.9 10.2 4.7 10.2
1.5 1.5 2.5 5.9 3.9 6.0 10.3 4.7 10.3
1.6 1.6 2.6 6.0 4.0 6.1 10.4 4.7 10.4
1.7 1.7 2.7 6.1 4.0 6.2 10.5 4.8 10.5
1.8 1.8 2.8 6.2 4.1 6.3 10.6 4.8 10.6
1.9 1.9 2.9 6.3 4.1 6.3 10.7 4.8 10.7
2.0 2.0 3.0 6.4 4.2 6.4 10.8 4.8 10.8
2.1 2.1 3.1 6.5 4.2 6.5 10.9 4.8 10.9
2.2 2.1 3.1 6.6 4.2 6.6 11.0 4.8 11.0
2.3 2.2 3.2 6.7 4.2 6.7 11.1 4.8 11.1
2.4 2.2 3.2 6.8 4.3 6.8 11.2 4.8 11.2
2.5 2.3 3.3 6.9 4.3 6.9 11.3 4.8 11.3
2.6 2.4 3.4 7.0 4.3 7.0 11.4 4.8 11.4
2.7 2.4 3.4 7.1 4.3 7.1 11.5 4.9 11.5
2.8 2.5 3.5 7.2 4.3 7.2 11.6 4.9 11.6
2.9 2.5 3.5 7.3 4.4 7.3 11.7 4.9 11.7
3.0 2.6 3.6 7.4 4.4 7.4 11.8 4.9 11.8
3.1 2.6 3.7 7.5 4.4 7.5 11.9 4.9 11.9
3.2 2.7 3.8 7.6 4.4 7.6 12.0 4.9 12.0

Star Assessments™ for Reading


Technical Manual 127
Conversion Tables


Table 60: Grade Equivalent to ZPD Conversions

ZPD Range ZPD Range ZPD Range


GE Low High GE Low High GE Low High
3.3 2.7 3.8 7.7 4.4 7.7 12.1 4.9 12.1
3.4 2.8 3.9 7.8 4.5 7.8 12.2 4.9 12.2
3.5 2.8 4.0 7.9 4.5 7.9 12.3 4.9 12.3
3.6 2.8 4.1 8.0 4.5 8.0 12.4 4.9 12.4
3.7 2.9 4.2 8.1 4.5 8.1 12.5 5.0 12.5
3.8 2.9 4.3 8.2 4.5 8.2 12.6 5.0 12.6
3.9 3.0 4.4 8.3 4.5 8.3 12.7 5.0 12.7
4.0 3.0 4.5 8.4 4.5 8.4 12.8 5.0 12.8
4.1 3.0 4.6 8.5 4.6 8.5 12.9 5.0 12.9
4.2 3.1 4.7 8.6 4.6 8.6 13.0 5.0 13.0
4.3 3.1 4.8 8.7 4.6 8.7

Star Assessments™ for Reading


Technical Manual 128
Conversion Tables


Table 61: Scaled Score to Instructional Reading Level Conversionsa

Unified Scaled Score


IRL Low High
Pre-Primer (PP): < 0.0 600 839
Primer (P): 0.0–0.9 839 856
1.0 856 861
1.1 861 865
1.2 865 870
1.3 870 874
1.4 874 879
1.5 879 883
1.6 883 888
1.7 888 893
1.8 893 897
1.9 897 902
2.0 902 906
2.1 906 911
2.2 911 916
2.3 916 921
2.4 921 925
2.5 925 930
2.6 930 935
2.7 935 940
2.8 940 945
2.9 945 949
3.0 949 955
3.1 955 960
3.2 960 965
3.3 965 971
3.4 971 976
3.5 976 981
3.6 981 987
3.7 987 992
3.8 992 998
3.9 998 1003
a. The figures in this table only apply to individual students, not groups.

Star Assessments™ for Reading


Technical Manual 129
Conversion Tables


Table 61: Scaled Score to Instructional Reading Level Conversionsa

Unified Scaled Score


IRL Low High
4.0 1003 1007
4.1 1007 1012
4.2 1012 1016
4.3 1016 1020
4.4 1020 1025
4.5 1025 1029
4.6 1029 1033
4.7 1033 1038
4.8 1038 1042
4.9 1042 1046
5.0 1046 1048
5.1 1048 1051
5.2 1051 1053
5.3 1053 1055
5.4 1055 1057
5.5 1057 1059
5.6 1059 1061
5.7 1061 1064
5.8 1064 1066
5.9 1066 1068
6.0 1068 1070
6.1 1070 1072
6.2 1072 1074
6.3 1074 1076
6.4 1076 1079
6.5 1079 1081
6.6 1081 1083
6.7 1083 1085
6.8 1085 1087
6.9 1087 1089
7.0 1089 1091
7.1 1091 1094
a. The figures in this table only apply to individual students, not groups.

Star Assessments™ for Reading


Technical Manual 130
Conversion Tables


Table 61: Scaled Score to Instructional Reading Level Conversionsa

Unified Scaled Score


IRL Low High
7.2 1094 1096
7.3 1096 1098
7.4 1098 1100
7.5 1100 1102
7.6 1102 1104
7.7 1104 1107
7.8 1107 1109
7.9 1109 1111
8.0 1111 1113
8.1 1113 1115
8.2 1115 1117
8.3 1117 1119
8.4 1119 1122
8.5 1122 1124
8.6 1124 1126
8.7 1126 1128
8.8 1128 1130
8.9 1130 1132
9.0 1132 1135
9.1 1135 1137
9.2 1137 1139
9.3 1139 1141
9.4 1141 1143
9.5 1143 1145
9.6 1145 1147
9.7 1147 1150
9.8 1150 1152
9.9 1152 1154
10.0 1154 1156
10.1 1156 1158
10.2 1158 1160
10.3 1160 1162
a. The figures in this table only apply to individual students, not groups.

Star Assessments™ for Reading


Technical Manual 131
Conversion Tables


Table 61: Scaled Score to Instructional Reading Level Conversionsa

Unified Scaled Score


IRL Low High
10.4 1162 1165
10.5 1165 1167
10.6 1167 1169
10.7 1169 1171
10.8 1171 1173
10.9 1173 1175
11.0 1175 1177
11.1 1177 1180
11.2 1180 1182
11.3 1182 1184
11.4 1184 1186
11.5 1186 1188
11.6 1188 1190
11.7 1190 1193
11.8 1193 1195
11.9 1195 1197
12.0 1197 1199
12.1 1199 1202
12.2 1202 1204
12.3 1204 1207
12.4 1207 1209
12.5 1209 1212
12.6 1212 1214
12.7 1214 1217
12.8 1217 1219
12.9 1219 1222
Post-High School (PHS) 1222 1400
a. The figures in this table only apply to individual students, not groups.

Star Assessments™ for Reading


Technical Manual 132
Conversion Tables


Table 62: Relating Star Early Literacy Unified Scale Scores to Star Reading GE Scores and ZPD
Ranges
Star Early Literacy Star Reading
Unified Scaled Score Range
SEL Literacy ZPD Recommended
Classification Low High GE Range Assessment(s)
Emergent Reader 200 685 NA NA Star Early Literacy
686 692 0.0 0.0–1.0
693 701 0.1 0.1–1.1
702 709 0.2 0.2–1.2
710 718 0.3 0.3–1.3
719 726 0.4 0.4–1.4
727 735 0.5 0.5–1.5
736 744 0.6 0.6–1.6
745 753 0.7 0.7–1.7
754 762 0.8 0.8–1.8
763 771 0.9 0.9–1.9
772 780 1.0 1.0–2.0
Transitional Reader 781 789 1.1 1.1–2.1
SS = 786
790 798 1.2 1.2–2.2
799 807 1.3 1.3–2.3
808 816 1.4 1.4–2.4
817 824 1.5 1.5–2.5 Star Early Literacy and
Star Reading
825 833 1.6 1.6–2.6
834 841 1.7 1.7–2.7
842 849 1.8 1.8–2.8
Probable Reader 850 857 1.9 1.9–2.9
SS = 852
858 865 2.0 2.0–3.0 Star Reading
866 873 2.1 2.1–3.1
874 881 2.2 2.1–3.1
882 888 2.3 2.2–3.2
889 896 2.4 2.2–3.2
897 903 2.5 2.3–3.3
904 910 2.6 2.4–3.4
911 916 2.7 2.4–3.4

Star Assessments™ for Reading


Technical Manual 133
Conversion Tables


Table 62: Relating Star Early Literacy Unified Scale Scores to Star Reading GE Scores and ZPD
Ranges
Star Early Literacy Star Reading
Unified Scaled Score Range
SEL Literacy ZPD Recommended
Classification Low High GE Range Assessment(s)
Probable Reader 917 923 2.8 2.5–3.5 Star Reading
(continued) (continued)
924 929 2.9 2.5–3.5
930 935 3.0 2.6–3.6
936 941 3.1 2.6–3.7
942 947 3.2 2.7–3.8
948 952 3.3 2.7–3.8
953 958 3.4 2.8–3.9
959 963 3.5 2.8–4.0
964 968 3.6 2.8–4.1
969 973 3.7 2.9–4.2
974 977 3.8 2.9–4.3
978 982 3.9 3.0–4.4

Star Assessments™ for Reading


Technical Manual 134
Appendix A: Estimated Oral Reading
Fluency

Table 63: Estimated Oral Reading Fluency (Est. ORF) Given in Words Correct
per Minute (WCPM) by Grade for Selected Star Reading Scale Score
Units (SR SS)

Grade

SR SS 1 2 3 4
50 0 4 0 8
100 29 30 32 31
150 41 40 43 41
200 55 52 52 47
250 68 64 60 57
300 82 78 71 69
350 92 92 80 80
400 111 106 97 93
450 142 118 108 104
500 142 132 120 115
550 142 152 133 127
600 142 175 147 137
650 142 175 157 145
700 142 175 167 154
750 142 175 170 168
800 142 175 170 184
850–1400 142 175 170 190

Star Assessments™ for Reading


Technical Manual 135
Appendix B: Detailed Evidence of Star
Reading Validity

The Validity chapter of this technical manual places its emphasis on summaries
of Star Reading validity evidence, and on recent evidence which comes primarily
from the 34-item, standards-based version of the assessment, which was
introduced in 2011. However, the abundance of earlier evidence, and evidence
related to the 25-item Star Reading versions, is all part of the accumulation of
technical support for the validity and usefulness of Star Reading. Much of that
cumulative evidence is presented in this appendix, to ensure that the historical
continuity of research in support of Star Reading validity is not lost. The material
that follows touches on the following list of topics:
X Relationship of Star Reading Scores to Scores on Other Tests of Reading
Achievement
X Relationship of Star Reading Scores to Scores on State Tests of Accountability
in Reading
X Relationship of Star Reading Enterprise Scores to Scores on Previous
Versions
X Data from Post-Publication Studies
X Linking Star and State Assessments: Comparing Student- and School-Level
Data
X Classification Accuracy and Screening Data Reported to The National Center
on Response to Intervention (NCRTI)

Relationship of Star Reading Scores to Scores on Other


Tests of Reading Achievement
During the Star Reading 2.0 norming study, schools submitted data on how their
students performed on several standardized tests of reading achievement as well
as their students’ Star Reading results. This data included test results for more
than 12,000 students from such tests as the California Achievement Test (CAT),
the Comprehensive Test of Basic Skills (CTBS), the Iowa Test of Basic Skills
(ITBS), the Metropolitan Achievement Test (MAT), the Stanford Achievement Test
(SAT9), and several statewide tests.

Computing the correlation coefficients was a two-step process. First, where


necessary, data were placed onto a common scale. If Scaled Scores were

Star Assessments™ for Reading


Technical Manual 136
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

available, they could be correlated with Star Reading 2.0 Scaled Scores. However,
since Percentile Ranks (PRs) are not on an equal-interval scale, when PRs were
reported for the other tests, they were converted into Normal Curve Equivalents
(NCEs). Scaled Scores or NCE scores were then used to compute the Pearson
product-moment correlation coefficients.

In an ongoing effort to gather evidence for the validity of Star Reading scores,
continual research on score validity has been undertaken. In addition to original
validity data gathered at the time of initial development, numerous other studies
have investigated the correlations between Star Reading tests and other external
measures. In addition to gathering concurrent validity estimates, predictive validity
estimates have also been investigated.

Table 64, Table 65, Table 66, and Table 67 present the correlation coefficients
between the scores on the Star Reading test and each of the other tests for which
data were received. Table 64 and Table 65 display “concurrent validity” data; that
is, correlations between Star Reading test scores and other tests administered
within a two-month time period. The date of administration ranged from spring
1999–spring 2013. More recently, data have become available for analyses
regarding the predictive validity of Star Reading. Predictive validity provides an
estimate of the extent to which scores on the Star Reading test predicted scores
on criterion measures given at a later point in time, operationally defined as more
than 2 months between the Star test (predictor) and the criterion test. Predictive
validity provides an estimate of the linear relationship between Star scores and
scores on tests covering a similar academic domain. Predictive correlations are
attenuated by time due to the fact that students are gaining skills in the interim
between testing occasions, and also by differences between the tests’ content
specifications. Table 66 and Table 67 present predictive validity coefficients.

The tables are presented in two parts. Table 64 and Table 66 display validity
coefficients for grades 1–6, and Table 65 and Table 67 display the validity
coefficients for grades 7–12. The bottom of each table presents a grade-by-
grade summary, including the total number of students for whom test data were
available, the number of validity coefficients for that grade, and the average value
of the validity coefficients.

The within-grade average concurrent validity coefficients for grades 1–6 varied
from 0.72–0.80, with an overall average of 0.74. The within-grade average
concurrent validity for grades 7–12 ranged from 0.65–0.76, with an overall average
of 0.72. Predictive validity coefficients ranged from 0.69–0.72 in grades 1–6, with
an average of 0.71. In grades 7–12 the predictive validity coefficients ranged
from 0.72–0.87 with an average of 0.80. The other validity coefficient within-grade
averages (for Star Reading 2.0 with external tests administered prior to spring
1999, Table 68 and Table 69) varied from 0.60–0.77; the overall average was 0.72.

Star Assessments™ for Reading


Technical Manual 137
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

The process of establishing the validity of a test is laborious, and it usually takes
a significant amount of time. As a result, the validation of the Star Reading test is
an ongoing activity, with the goal of establishing evidence of the test’s validity for a
variety of settings and students. Star Reading users who collect relevant data are
encouraged to contact Renaissance Learning.

Since correlation coefficients are available for many different test editions, forms,
and dates of administration, many of the tests have several validity coefficients
associated with them. Data were omitted from the tabulations if (a) test data quality
could not be verified or (b) when sample size was very small. In general, these
correlation coefficients reflect very well on the validity of the Star Reading test as a
tool for placement in Reading. In fact, the correlations are similar in magnitude to
the validity coefficients of these measures with each other. These validity results,
combined with the supporting evidence of reliability and minimization of SEM
estimates for the Star Reading test, provide a quantitative demonstration of how
well this innovative instrument in reading achievement assessment performs.

Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


Arkansas Augmented Benchmark Examination (AABE)
AABE S 08 SS – – – – 2,858 0.78* 2,588 0.73* 1,897 0.73* 1,176 0.75*
AIMSweb
R-CBM S 12 correct 15 0.65* 72 0.28* 41 0.17 44 0.48* – – – –
California Achievement Test (CAT)
CAT S 99 SS 93 0.80* 36 0.67* – – 34 0.72* 146 0.76* – –
CAT/5 F 10–11 SS 68 0.79* 315 0.72* 410 0.69* 302 0.71* 258 0.71* 196 0.69*
Canadian Achievement Test (CAT)
CAT/2 F 10–11 – – – – 21 0.80* 31 0.84* 23 0.75* – –
Colorado Student Assessment Program (CSAP)
CSAP S 06 SS – – – – 82 0.75* 79 0.83* 93 0.68* 280 0.80*

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 138
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


Comprehensive Test of Basic Skills (CTBS)
CTBS/4 S 99 NCE – – – – – – 18 0.81* – – – –
CTBS/A-19/20 S 99 SS – – – – – – – – – – 8 0.91*
Delaware Student Testing Program (DSTP) – Reading
DSTP S 05 SS – – – – 104 0.57* – – – – – –

DSTP S 06 SS – – 158 0.68* 126 0.43* 141 0.62* 157 0.59* 75 0.66*
Dynamic Indicators of Basic Early Literacy Skills (DIBELS) – Oral Reading Fluency
DIBELS F 05 WCPM – – 59 0.78* – – – – – – – –
DIBELS W 06 WCPM 61 0.87* 55 0.75* – – – – – – – –
DIBELS S 06 WCPM 67 0.87* 63 0.71* – – – – – – – –
DIBELS F 06 WCPM – – 515 0.78* 354 0.81* 202 0.72* – – – –
DIBELS W 07 WCPM 208 0.75* 415 0.73* 175 0.69* 115 0.71* – – – –
DIBELS S 07 WCPM 437 0.81* 528 0.70* 363 0.66* 208 0.54* – – – –
DIBELS F 07 WCPM – – 626 0.79* 828 0.73* 503 0.73* 46 0.73* – –
Florida Comprehensive Assessment Test (FCAT)
FCAT S 06 SS – – – – – – 41 0.65* – – – –
FCAT S 06–08 SS – – – – 10,169 0.76* 8,003 0.73* 5,474 0.73* 1,188 0.67*
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS – – – – 3,641 0.83* 3,025 0.84* 2,439 0.83* 145 0.81*
Gates–MacGinitie Reading Test (GMRT)
GMRT/2nd Ed S 99 NCE – – 21 0.89* – – – – – – – –
GMRT/L-3rd S 99 NCE – – 127 0.80* – – – – – – – –
Idaho Standards Achievement Test (ISAT)
ISAT S 07–09 SS – – – – 3,724 0.75* 2,956 0.74* 2,485 0.74* 1,309 0.75*
Illinois Standards Achievement Test – Reading
ISAT S 05 SS – – 106 0.71* 594 0.76* – – 449 0.70* – –
ISAT S 06 SS – – – – 140 0.80* 144 0.80* 146 0.72 – –

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 139
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


Iowa Test of Basic Skills (ITBS)
ITBS–Form K S 99 NCE 40 0.75* 36 0.84* 26 0.82* 28 0.89* 79 0.74* – –
ITBS–Form L S 99 NCE – – – – 18 0.70* 29 0.83* 41 0.78* 38 0.82*
ITBS–Form M S 99 NCE – – – – 158 0.81* – – 125 0.84* – –
ITBS–Form K S 99 SS – – 58 0.74* – – 54 0.79* – – – –
ITBS–Form L S 99 SS – – – – 45 0.73* – – – – 50 0.82*
Kansas State Assessment Program (KSAP)
KSAP S 06–08 SS – – – – 4,834 0.61* 4,045 0.61* 3,332 0.63* 1,888 0.65*
Kentucky Core Content Test (KCCT)
KCCT S 08–10 SS – – – – 10,776 0.60* 8,885 0.56* 7,147 0.53* 5,003 0.57*
Metropolitan Achievement Test (MAT)
MAT–7th Ed. S 99 NCE – – – – – – 46 0.79* – – – –
MAT–6th Ed. S 99 Raw – – – – 8 0.58* – – 8 0.85* – –
MAT–7th Ed. S 99 SS – – – – 25 0.73* 17 0.76* 21 0.76 23 0.58*
Michigan Educational Assessment Program (MEAP) – English Language Arts
MEAP F 04 SS – – – – – – 155 0.81* – – – –
MEAP F 05 SS – – – – 218 0.76* 196 0.80* 202 0.80* 207 0.69*
MEAP F 06 SS – – – – 116 0.79* 132 0.69* 154 0.81* 129 0.66*
Michigan Educational Assessment Program (MEAP) – Reading
MEAP F 04 SS – – – – – – 155 0.80* – – – –
MEAP F 05 SS – – – – 218 0.77* 196 0.78* 202 0.81* 207 0.68*
MEAP F 06 SS – – – – 116 0.75* 132 0.70* 154 0.82* 129 0.70*
Mississippi Curriculum Test (MCT2)
MCT2 S 02 SS – – – – – – 155 0.80* – – – –
MCT2 S 03 SS – – – – 218 0.77* 196 0.78* 202 0.81* 207 0.68*
MCT2 S 08 SS – – – – 3,821 0.74* 3,472 0.73* 2,915 0.71* 2,367 0.68*
Missouri Mastery Achievement Test (MMAT)
MMAT S 99 NCE – – – – – – – – 26 0.62* – –

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 140
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


New Jersey Assessment of Skills and Knowledge (NJ ASK)
NJ ASK S 13 SS – – – – 1,636 0.79* 1,739 0.80* 1,486 0.82* 440 0.77*
New York State Assessment Program
NYSTP S 13 SS – – – – 185 0.78* – – – – – –
North Carolina End-of-Grade (NCEOG): Test
S 99 SS – – – – – – – – 85 0.79* – –
NCEOG S 06–08 SS – – – – 2,707 0.80* 2,234 0.77* 1,752 0.77* 702 0.77*
Ohio Achievement Assessment (OAA)
OAA S 13 SS – – – – 1,718 0.72* 1,595 0.71* 1,609 0.77* 1,599 0.76*
Oklahoma Core Curriculum Test (OCCT)
OCCT S 06 SS – – – – 78 0.62* 92 0.58* 46 0.52* 80 0.60*
OCCT S 13 SS – – – – 153 0.79* 66 0.79* 72 0.80* 64 0.72*
South Dakota State Test of Educational Progress (DSTEP)
DSTEP S 08–10 SS – – – – 2,072 0.78* 1,751 0.77* 1,409 0.80* 906 0.78*
Stanford Achievement Test (SAT)
SAT 9th Ed. S 99 NCE 68 0.79* – – 26 0.44* – – – – 86 0.65*
SAT 9th Ed. S 99 SS 11 0.89* 18 0.89* 67 0.79* 66 0.79* 72 0.80* 64 0.72*
State of Texas Assessments of Academic Readiness Standards Test (STAAR)
STAAR S 12–13 SS – – – – 8,567 0.79* 7,902 0.78* 7,272 0.76* 5,697 0.78*
Tennessee Comprehensive Assessment Program (TCAP)
TCAP S 11 SS – – – – 62 0.66* 56 0.59* – – – –
TCAP S 12 SS – – – – 91 0.79* 118 0.21* 81 0.64* – –
TCAP S 13 SS – – – – 494 0.73* 441 0.66* 426 0.77* – –
TerraNova
TerraNova S 99 SS – – 61 0.72* 117 0.78* – – – – – –
Texas Assessment of Academic Skills (TAAS)
TAAS S 99 NCE – – – – – – – – – – 229 0.66*
Transitional Colorado Assessment Program (TCAP)
TCAP S 12–13 SS – – – – 3,144 0.78* 3,200 0.82* 3,186 0.81* 3,106 0.83*

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 141
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


West Virginia Educational Standards Test 2 (WESTEST 2)
WESTEST 2 S 12 SS – – – – 2,949 0.76* 7,537 0.77* 5,666 0.76* 2,390 0.75*
Woodcock Reading Mastery (WRM)
S 99 – – – – – – – – 7 0.68* 7 0.66*
Wisconsin Knowledge and Concepts Examination (WKCE)
WKCE F 06–10 SS 8,649 0.78* 7,537 0.77* 5,666 0.76* 2,390 0.75*

Summary
Grade(s) All 1 2 3 4 5 6
Number of 255,538 1,068 3,629 76,942 66,400 54,173 31,686
students
Number of 195 10 18 47 47 41 32
coefficients
Average 0.80 0.73 0.72 0.72 0.74 0.72
validity
Overall 0.74
average

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Table 65: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 7–12a

7 8 9 10 11 12

Test Form Date Score n r n r n r n r n r n r


Arkansas Augmented Benchmark Examination (AABE)
AABE S 08 SS 318 0.79* 278 0.76* – – – – – – – –
California Achievement Test (CAT)
CAT/5 S 99 NCE – – – – 59 0.65* – – – – – –
CAT/5 S 99 SS 124 0.74* 131 0.76* – – – – – – – –
CAT/5 F 10–11 SS 146 0.75* 139 0.79* 92 0.64* 81 0.82* 48 0.79* 39 0.73*
Colorado Student Assessment Program (CSAP)
CSAP S 06 SS 299 0.84* 185 0.83* – – – – – – – –
Delaware Students Testing Program (DSTP) – Reading
DSTP S 05 SS – – – – – – 112 0.78* – – – –
DSTP S 06 SS 150 0.72* – – – – – – – – – –
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 142
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 65: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 7–12a

7 8 9 10 11 12

Test Form Date Score n r n r n r n r n r n r


Florida Comprehensive Assessment Test (FCAT)
FCAT S 06 SS – – 74 0.65* – – – – – – – –
FCAT S 06–08 SS 1,119 0.74* 618 0.76* – – – – – – – –
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS 158 0.83* 111 0.81* – – – – – – – –
Idaho Standards Achievement Test (ISAT)
ISAT S 06–08 SS 851 0.78* 895 0.71* – – – – – – – –
Illinois Standards Achievement Test (ISAT) – Reading
ISAT S 05 SS – – 157 0.73* – – – – – – – –
ISAT S 06 SS 140 0.70* – – – – – – – – – –
Iowa Test of Basic Skills (ITBS)
ITBS–K S 99 NCE – – – – 67 0.78* – – – – – –
ITBS–L S 99 SS 47 0.56* – – 65 0.64* – – – – – –
Kansas State Assessment Program (KSAP)
KSAP S 06–08 SS 1,147 0.70* 876 0.71* – – – – – – – –
Kentucky Core Content Test (KCCT)
KCCT S 08–10 SS 2,572 0.56* 1,198 0.56* – – – – – – – –
Michigan Educational Assessment Program – English Language Arts
MEAP F 04 SS 154 0.68* – – – – – – – – – –
MEAP F 05 SS 233 0.72* 239 0.70* – – – – – – – –
MEAP F 06 SS 125 0.79* 152 0.74* – – – – – – – –
Michigan Educational Assessment Program – Reading
MEAP–R F 04 SS 154 0.68* – – – – – – – – – –
MEAP–R F 05 SS 233 0.72* 239 0.70* – – – – – – – –
MEAP–R F 06 SS 125 0.79* 152 0.74* – – – – – – – –
Mississippi Curriculum Test (MCT2)
MCT2 S 03 SS 372 0.70* – – – – – – – – – –
MCT2 S 08 SS 1,424 0.69* 1,108 0.72* – – – – – – – –
Missouri Mastery Achievement Test (MMAT)
MMAT S 99 NCE – – 29 0.78* 19 0.71* – – – – – –
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 143
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 65: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 7–12a

7 8 9 10 11 12

Test Form Date Score n r n r n r n r n r n r


North Carolina End-of-Grade (NCEOG) Test
NCEOG S 06–08 SS 440 0.76* 493 0.74* – – – – – – – –
New Jersey Assessment of Skills and Knowledge (NJ ASK)
NJ ASK S 13 SS 595 0.78* 589 0.70* – – – – – – – –
Northwest Evaluation Association Levels Test (NWEA)
NWEA- S 99 NCE – – 124 0.66* – – – – – – – –
Achieve
South Dakota State Test of Educational Progress (DSTEP)
DSTEP S 08–10 SS 917 0.78* 780 0.77* – – – – – – – –
Stanford Achievement Test (SAT)
SAT–9th Ed. S 99 NCE 50 0.65* 50 0.51* – – – – – – – –
SAT–9th Ed. S 99 SS 70 0.70* 68 0.80* – – – – – – – –
State of Texas Assessments of Academic Readiness Standards Test (STAAR)
STAAR S 12–13 SS 5,062 .075* 4,651 0.75* – – – – – – – –
Test Achievement and Proficiency (TAP)
TAP S 99 NCE – – – – 6 0.42 13 0.80* 7 0.6 – –
Texas Assessment of Academic Skills (TAAS)
TAAS S 99 NCE – – – – – – 43 0.60* – – – –
Transitional Colorado Assessment Program (TCAP)
TCAP S 12–13 SS 3,165 0.83* 3,106 0.83* 1,466 0.72* – – – – – –
West Virginia Educational Standards Test 2 (WESTEST 2)
WESTEST 2 S 12 SS 1,612 0.76 1,396 0.75 – – – – – – – –
Wisconsin Knowledge and Concepts Examination (WKCE)
WKCE F 06–10 SS 1,811 0.81 1,886 0.77 – – 506 0.79 – – – –
Wide Range Achievement Test 3 (WRAT3)
WRAT3 S 99 – – 17 0.81* – – – – – – – –

Summary
Grade(s) All 7 8 9 10 11 12
Number of students 48,789 25,032 21,134 1,774 755 55 39
Number of coefficients 74 30 29 7 5 2 1
Average validity – 0.74 0.73 0.65 0.76 0.70 0.73
Overall average 0.72
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 144
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 66: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6a

1 2 3 4 5 6
b
Test Form Date Score n r n r n r n r n r n r
AIMSweb
R-CBM S 12 correct 60 0.14 156 0.38* 105 0.11 102 0.52* – – – –
Arkansas Augmented Benchmark Examination (AABE)
AABE F 07 SS – – – – 5,255 0.79* 5,208 0.77* 3,884 0.75* 3,312 0.75*
Colorado Student Assessment Program (CSAP)
CSAP F 04 – – – – – 82 0.72* 79 0.77* 93 0.70* 280 0.77*
Delaware Student Testing Program (DSTP) – Reading
DSTP S 05 – – – – – 189 0.58* – – – – – –
DSTP W 05 – – – – – 120 0.67* – – – – – –
DSTP S 05 – – – – – 161 0.52* 191 0.55* 190 0.62* – –
DSTP F 05 – – – 253 0.64* 214 0.39* 256 0.62* 270 0.59* 242 0.71*
DSTP W 05 – – – 275 0.61* 233 0.47* 276 0.59* 281 0.62* 146 0.57*
Florida Comprehensive Assessment Test (FCAT)
FCAT F 05 – – – – – – – 42 0.73* – – 409 0.67*
FCAT W 07 – – – – – – – – – – – 417 0.76*
FCAT F 05–07 SS – – – – 25,192 0.78* 21,650 0.75* 17,469 0.75* 9,998 0.73*
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS – – – – 6,788 0.78* 5,894 0.80* 5,374 0.80* 616 0.74*
Idaho Standards Achievement Test (ISAT)
ISAT F 08–10 SS – – – – 8,219 0.77* 8,274 0.77* 7,537 0.76* 5,742 0.77*
Illinois Standards Achievement Test (ISAT) – Reading
ISAT–R F 05 – – – – – 450 0.73* – – 317 0.68* – –
ISAT–R W 05 – – – – – 564 0.76* – – 403 0.68* – –
ISAT–R F 05 – – – – – 133 0.73* 140 0.74* 145 0.66* – –
ISAT–R W 06 – – – – – 138 0.76* 145 0.77* 146 0.70* – –
Iowa Assessment
IA F 12 SS – – – – 1,763 0.61* 1,826 0.61* 1,926 0.59* 1,554 0.64*
IA W 12 SS – – – – 548 0.60* 661 0.62* 493 0.64* 428 0.65*
IA S 12 SS – – – – 1,808 0.63* 1,900 0.63* 1,842 0.65* 1,610 0.63*

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).

Star Assessments™ for Reading


Technical Manual 145
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 66: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6a

1 2 3 4 5 6
b
Test Form Date Score n r n r n r n r n r n r
Kentucky Core Content Test (KCCT)
KCCT F 07–09 SS – – – – 16,521 0.62* 15,143 0.57* 12,549 0.53* 9,091 0.58*
Michigan Educational Assessment Program (MEAP) – English Language Arts
MEAP–EL F 04 – – – – – 193 0.60* 181 0.70* 170 0.75* 192 0.66*
MEAP–EL W 05 – – – – – 204 0.68* 184 0.74* 193 0.75* 200 0.70*
MEAP–EL S 05 – – – – – 192 0.73* 171 0.73* 191 0.71* 193 0.62*
MEAP–EL F 05 – – – – – 111 0.66* 132 0.71* 119 0.77* 108 0.60*
MEAP–EL W 06 – – – – – 114 0.77* – – 121 0.75* 109 0.66*
Michigan Educational Assessment Program (MEAP) – Reading
MEAP–R F 04 – – – – – 193 0.60* 181 0.69* 170 0.76* 192 0.66*
MEAP–R W 05 – – – – – 204 0.69* 184 0.74* 193 0.78* 200 0.70*
MEAP–R S 05 – – – – – 192 0.72* 171 0.72* 191 0.74* 193 0.62*
MEAP–R F 05 – – – – – 111 0.63* 132 0.70* 119 0.78* 108 0.62*
MEAP–R W 06 – – – – – 114 0.72* – – 121 0.75* 109 0.64*
Mississippi Curriculum Test (MCT2)
MCT2 F 01 – – – 86 0.57* 95 0.70* 97 0.65* 78 0.76* – –
MCT2 F 02 – – – 340 0.67* 337 0.67* 282 0.69* 407 0.71* 442 0.72*
MCT2 F 07 SS – – – – 6,184 0.77* 5,515 .74* 5,409 0.74* 4,426 0.68*
North Carolina End–of–Grade (NCEOG) Test
NCEOG F 05–07 SS – – – – 6,976 0.81* 6,531 0.78* 6,077 0.77* 3,255 0.77*
New York State Assessment Program
NYSTP S 13 SS – – – – 349 0.73* – – – – – –
Ohio Achievement Assessment
OAA S 13 SS – – – – 28 0.78* 41 0.52* 29 0.79* 30 0.75*
Oklahoma Core Curriculum Test (OCCT)
OCCT F 04 – – – – – – – – – 44 0.63* – –
OCCT W 05 – – – – – – – – – 45 0.66* – –
OCCT F 05 – – – – – 89 0.59* 90 0.60* 79 0.69* 84 0.63*
OCCT W 06 – – – – – 60 0.65* 40 0.67* – – – –

a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).

Star Assessments™ for Reading


Technical Manual 146
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 66: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6a

1 2 3 4 5 6
b
Test Form Date Score n r n r n r n r n r n r
South Dakota State Test of Educational Progress (DSTEP)
DSTEP F 07–09 SS – – – – 3,909 0.79* 3,679 0.78* 3,293 0.78* 2,797 0.79*
Star Reading
Star–R F 05 – 16,982 0.66* 42,601 0.78* 46,237 0.81* 44,125 0.83* 34,380 0.83* 23,378 0.84*
Star–R F 06 – 25,513 0.67* 63,835 0.78* 69,835 0.81* 65,157 0.82* 57,079 0.83* 35,103 0.83*
Star–R F 05 – 8,098 0.65* 20,261 0.79* 20,091 0.81* 18,318 0.82* 7,621 0.82* 5,021 0.82*
Star–R F 05 – 8,098 0.55* 20,261 0.72* 20,091 0.77* 18,318 0.80* 7,621 0.80* 5,021 0.79*
Star–R S 06 – 8,098 0.84* 20,261 0.82* 20,091 0.83* 18,318 0.83* 7,621 0.83* 5,021 0.83*
Star–R S 06 – 8,098 0.79* 20,261 0.80* 20,091 0.81* 18,318 0.82* 7,621 0.82* 5,021 0.81*
State of Texas Assessments of Academic Readiness Standards Test (STAAR)
STAAR S 12–13 SS – – – – 6,132 0.81* 5,744 0.80* 5,327 0.79* 5,143 0.79*
Tennessee Comprehensive Assessment Program (TCAP)
TCAP S 11 SS – – – – 695 0.68* 602 0.72* 315 0.61* – –
TCAP S 12 SS – – – – 763 0.70* 831 0.33* 698 0.65* – –
TCAP S 13 SS – – – – 2,509 0.67* 1,897 0.63* 1,939 0.68* 431 0.65*
West Virginia Educational Standards Test 2 (WESTEST 2)
WESTEST 2 S 12 SS – – – – 2,828 0.80* 3,078 0.73* 3,246 0.73* 3,214 0.73*
Wisconsin Knowledge and Concepts Examination (WKCE)
WKCE S 05–09 SS 15,706 0.75* 15,569 0.77* 13,980 0.78* 10,641 0.78*

Summary
Grade(s) All 1 2 3 4 5 6
Number of 1,227,887 74,887 188,434 313,102 289,571 217,416 144,477
students
Number of 194 6 10 49 43 47 39
coefficients
Average 0.69 0.72 0.70 0.71 0.72 0.71
validity
Overall 0.71
average
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).

Star Assessments™ for Reading


Technical Manual 147
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 67: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12a

7 8 9 10 11 12
b
Test Form Date Score n r n r n r n r n r n r
Arkansas Augmented Benchmark Examination (AABE)
AABE F 07 SS 2,418 0.74* 1,591 0.75* – – – – – – – –
Colorado Student Assessment Program (CSAP)
CSAP F 05 – 299 0.83* 185 0.83* – – – – – – – –
Delaware Student Testing Program (DSTP) – Reading
DSTP S 05 – 100 0.75* 143 0.63* – – 48 0.66* – – – –
DSTP F 05 – 273 0.69* 247 0.70* 152 0.73* 97 0.78* – – – –
DSTP W 05 – – – 61 0.64* 230 0.64* 145 0.71* – – – –
Florida Comprehensive Assessment Test (FCAT)
FCAT F 05 – 381 0.61* 387 0.62* – – – – – – – –
FCAT W 07 – 342 0.64* 361 0.72* – – – – – – – –
FCAT F 05–07 SS 8,525 0.72* 6,216 0.72* – – – – – – – –
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS 586 0.75* 653 0.78* – – – – – – – –
Idaho Standards Achievement Test (ISAT)
ISAT F 05–07 SS 4,119 0.76* 3,261 0.73* – – – – – – – –
Illinois Standards Achievement Test (ISAT) – Reading
ISAT F 05 – 173 0.51* 158 0.66* – – – – – – – –
Iowa Assessment
IA F 12 SS 1,264 0.60* 905 0.63* – – – – – – – –
IA W 12 SS 118 0.66* 72 0.67* – – – – – – – –
IA S 12 SS 1,326 0.68* 1,250 0.66* – – – – – – – –
Kentucky Core Content Test (KCCT)
KCCT F 07–09 SS 4,962 0.57* 2,530 0.58* – – – – – – – –
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).

Star Assessments™ for Reading


Technical Manual 148
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 67: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12a

7 8 9 10 11 12
b
Test Form Date Score n r n r n r n r n r n r
Michigan Educational Assessment Program (MEAP) – English Language Arts
MEAP F 04 – 181 0.71* 88 0.85* – – – – – – – –
MEAP W 05 – 214 0.73* 212 0.73* – – – – – – – –
MEAP S 05 – 206 0.75* 223 0.69* – – – – – – – –
MEAP F 05 – 114 0.66* 126 0.66* – – – – – – – –
MEAP W 06 – 114 0.64* 136 0.71* – – – – – – – –
MEAP S 06 – – – 30 0.80* – – – – – – – –
Michigan Educational Assessment Program (MEAP) – Reading
MEAP–R F 04 – 181 0.70* 88 0.84* – – – – – – – –
MEAP–R W 05 – 214 0.72* 212 0.73* – – – – – – – –
MEAP–R S 05 – 206 0.72* 223 0.69* – – – – – – – –
MEAP–R F 05 – 116 0.68* 138 0.66* – – – – – – – –
MEAP–R W 06 – 116 0.68* 138 0.70* – – – – – – – –
MEAP–R S 06 – – – 30 0.81* – – – – – – – –
Mississippi Curriculum Test (MCT2)
MCT2 F 02 – 425 0.68* – – – – – – – – – –
MCT2 F 07 SS 3,704 0.68* 3,491 0.73* – – – – – – – –
North Carolina End–of–Grade (NCEOG) Test
NCEOG F 05–07 SS 2,735 0.77* 2,817 0.77* – – – – – – – –
Ohio Achievement Assessment
OAA S 13 SS 53 0.82* 38 0.66* – – – – – – – –
South Dakota State Test of Educational Progress (DSTEP)
DSTEP F 07–09 SS 2,236 0.79* 2,073 0.78* – – – – – – – –
Star Reading
Star–R F 05 – 17,370 0.82* 9,862 0.82* 2,462 0.82* 15,277 0.85* 1,443 0.83* 596 0.85*
Star–R F 06 – 22,177 0.82* 19,152 0.82* 4,087 0.84* 2,624 0.85* 2,930 0.85* 2,511 0.86*
Star–R F 05 – 5,399 0.81* 641 0.76* 659 0.89* 645 0.88* 570 0.90* – –
Star–R F 05 – 5,399 0.79* 641 0.76* 659 0.83* 645 0.83* 570 0.87* – –
Star–R S 06 – 5,399 0.82* 641 0.83* 659 0.87* 645 0.88* 570 0.89* – –
Star–R S 06 – 5,399 0.80* 641 0.83* 659 0.85* 645 0.85* 570 0.86*
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).

Star Assessments™ for Reading


Technical Manual 149
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 67: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12a

7 8 9 10 11 12
b
Test Form Date Score n r n r n r n r n r n r
State of Texas Assessments of Academic Readiness Standards Test (STAAR)
STAAR S 12–13 SS 4,716 0.77* 4,507 0.76* – – – – – – – –
Tennessee Comprehensive Assessment Program (TCAP)
TCAP S 13 SS 332 0.81* 233 0.74* – – – – – – – –
West Virginia Educational Standards Test 2 (WESTEST 2)
WESTEST S 12 SS 2,852 0.71* 2,636 0.74* – – – – – – – –
2
Wisconsin Knowledge and Concepts Examination (WKCE)
WKCE S 05–09 SS 6,399 0.78* 5,500 0.78* 401 0.78*

Summary
Grade(s) All 7 8 9 10 11 12
Number of 224,179 111,143 72,537 9,567 21,172 6,653 3,107
students
Number of 106 39 41 8 10 6 2
coefficients
Average – 0.72 0.73 0.81 0.81 0.87 0.86
validity
Overall 0.80
average
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).

Star Assessments™ for Reading


Technical Manual 150
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 68: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


American Testronics
Level C-3 Spr 98 Scaled – – 20 0.71* – – – – – – – –
California Achievement Test (CAT)
/4 Spr 98 Scaled – – 16 0.82* – – 54 0.65* – – 10 0.88*
/5 Spr 98 Scaled – – – – 40 0.82* 103 0.85* – – – –
/5 Fall 98 NCE 40 0.83* – – – – – – – – – –
/5 Fall 98 Scaled – – – – 39 0.85* – – – – – –
Comprehensive Test of Basic Skills (CTBS)
A-15 Fall 97 NCE – – – – – – – – – – 24 0.79*
/4 Spr 97 Scaled – – – – – – – – 31 0.61* – –
/4 Spr 98 Scaled – – – – – – 6 0.49 68 0.76* – –
A-19/20 Spr 98 Scaled – – – – – – – – 10 0.73* – –
A-15 Spr 98 Scaled – – – – – – – – – – 93 0.81*
A-16 Fall 98 NCE – – – – – – – – – – 73 0.67*
Degrees of Reading Power (DRP)
Spr 98 – – – – 8 0.71* – – 25 0.72* 23 0.38
Gates-MacGinitie Reading Test (GMRT)
2nd Ed., D Spr 98 NCE – – – – – – – – – – 47 0.80*
L-3rd Spr 98 NCE – – 31 0.69* 27 0.62* – – – – – –
L-3rd Fall 98 NCE 60 0.64* – – 66 0.83* – – – – – –
Indiana Statewide Testing for Educational Progress (ISTEP)
Fall 98 NCE – – – – 19 0.80* – – – – 21 0.79*
Iowa Test of Basic Skills (ITBS)
Form K Spr 98 NCE – – – – 88 0.74* 17 0.59* – – 21 0.83*
Form L Spr 98 NCE – – – – 50 0.84* – – – – 57 0.66*
Form M Spr 98 NCE – – 68 0.71* – – – – – – – –
Form K Fall 98 NCE – – 67 0.66* 43 0.73* 67 0.74* 28 0.81* – –
Form L Fall 98 NCE – – – – – – 27 0.88* 6 0.97* 37 0.60*
Form M Fall 98 NCE – – 65 0.81* – – 53 0.72* – – – –

a. Sample sizes are in the columns labeled “n.”


* Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 151
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 68: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


Metropolitan Achievement Test (MAT)
7th Ed. Spr 98 NCE – – – – – – 29 0.67* 22 0.68* 17 0.86*
6th Ed Spr 98 Raw – – – – – – 6 0.91* – – 5 0.67
7th Ed. Spr 98 Scaled – – 48 0.75* – – – – 30 0.79* – –
7th Ed. Fall 98 NCE – – – – – – – – – – 49 0.75*
Metropolitan Readiness Test (MRT)
Spr 96 NCE – – – – 5 0.81 – – – – – –
Spr 98 NCE 4 0.63 – – – – – – – – – –
Missouri Mastery Achievement Test (MMAT)
Spr 98 Scaled – – – – 12 0.44 – – 14 0.75* 24 0.62*
New York State Pupil Evaluation Program (P&P)
Spr 98 – – – – – – 13 0.92* – – – –
North Carolina End of Grade Test (NCEOG)
Spr 98 Scaled – – – – – – – – 53 0.76* – –
NRT Practice Achievement Test (NRT)
Practice Spr 98 NCE – – 56 0.71* – – – – – – – –
Stanford Achievement Test (Stanford)
9th Ed. Spr 97 Scaled – – – – – – – – 68 0.65* – –
7th Ed. Spr 98 Scaled 11 0.73* 7 0.94* 8 0.65 15 0.82* 7 0.87* 8 0.87*
8th Ed. Spr 98 Scaled 8 0.94* 8 0.64 6 0.68 11 0.76* 8 0.49 7 0.36
9th Ed. Spr 98 Scaled 13 0.73* 93 0.73* 19 0.62* 314 0.74* 128 0.72* 62 0.67*
4th Ed. 3/V Spr 98 Scaled 14 0.76* – – – – – – – – – –
9th Ed. Fall 98 NCE – – – – 45 0.89* – – 35 0.68* – –
9th Ed. Fall 98 Scaled – – 88 0.60* 25 0.79* – – 196 0.73* – –
9th Ed. 2/ Fall 98 Scaled – – – – 103 0.69* – – – – – –
SA
Tennessee Comprehensive Assessment Program (TCAP)
Spr 98 Scaled – – 30 0.75* – – – – – – – –

a. Sample sizes are in the columns labeled “n.”


* Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 152
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Table 68: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 1–6a

1 2 3 4 5 6

Test Form Date Score n r n r n r n r n r n r


TerraNova
Fall 97 Scaled – – – – – – – – 56 0.70* – –
Spr 98 NCE – – – – 76 0.63* – – – – – –
Spr 98 Scaled – – 94 0.50* 55 0.79* 299 0.75* 86 0.75* 23 0.59*
Fall 98 NCE – – – – – – – – – – 126 0.74*
Fall 98 Scaled – – – – – – 14 0.70* – – 15 0.77*
Wide Range Achievement Test 3 (WRAT3)
Fall 98 – – – – – – – – – – 10 0.89*
Wisconsin Reading Comprehension Test
Spr 98 – – – – – – 63 0.58* – – – –

Summary
Grade(s) All 1 2 3 4 5 6
Number of 4,289 150 691 734 1,091 871 752
students
Number of 95 7 14 19 16 18 21
coefficients
Average validity – 0.75 0.72 0.73 0.74 0.73 0.71
Overall average 0.73

a. Sample sizes are in the columns labeled “n.”


* Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 153
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on Other Tests of Reading Achievement

Table 69: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 7–12a

7 8 9 10 11 12
Test
Form Date Score n r n r n r n r n r n r
California Achievement Test (CAT)
/4 Spr 98 Scaled – – 11 0.75* – – – – – – – –
/5 Spr 98 NCE 80 0.85* – – – – – – – – – –
Comprehensive Test of Basic Skills (CTBS)
/4 Spr 97 NCE – – 12 0.68* – – – – – – – –
/4 Spr 98 NCE 43 0.84* – – – – – – – – – –
/4 Spr 98 Scaled 107 0.44* 15 0.57* 43 0.86* – – – – – –
A-16 Spr 98 Scaled 24 0.82* – – – – – – – – – –
Explore (ACT Program for Educational Planning, 8th Grade)
Fall 97 NCE – – – – 67 0.72* – – – – – –
Fall 98 NCE – – 32 0.66* – – – – – – – –
Iowa Test of Basic Skills (ITBS)
Form K Spr 98 NCE – – – – 35 0.84* – – – – – –
Form K Fall 98 NCE 32 0.87* 43 0.61* – – – – – – – –
Form K Fall 98 Scaled 72 0.77* 67 0.65* 77 0.78* – – – – – –
Form L Fall 98 NCE 19 0.78* 13 0.73* – – – – – – – –
Metropolitan Achievement Test (MAT)
7th Ed. Spr 97 Scaled 114 0.70* – – – – – – – – – –
7th Ed. Spr 98 NCE 46 0.84* 63 0.86* – – – – – – – –
7th Ed. Spr 98 Scaled 88 0.70* – – – – – – – – – –
7th Ed. Fall 98 NCE 50 0.55* 48 0.75* – – – – – – – –
Missouri Mastery Achievement Test (MMAT)
Spr 98 Scaled 24 0.62* 12 0.72* – – – – – – – –
North Carolina End of Grade Test (NCEOG)
Spr 97 Scaled – – – – – – 58 0.81* – – – –
Spr 98 Scaled – – – – 73 0.57* – – – – – –
PLAN (ACT Program for Educational Planning, 10th Grade)
Fall 97 NCE – – – – – – – – 46 0.71* – –
Fall 98 NCE – – – – – – 104 0.53* – – – –
Preliminary Scholastic Aptitude Test (PSAT)
Fall 98 Scaled – – – – – – – – 78 0.67* – –

a. Sample sizes are in the columns labeled “n.”


* Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 154
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Table 69: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 7–12a

7 8 9 10 11 12
Test
Form Date Score n r n r n r n r n r n r
Stanford Achievement Test (Stanford)
9th Ed. Spr 97 Scaled – – – – – – – – – – 11 0.90*
7th Ed. Spr 98 Scaled – – 8 0.83* – – – – – – – –
8th Ed. Spr 98 Scaled 6 0.89* 8 0.78* 91 0.62* – – 93 0.72* – –
9th Ed. Spr 98 Scaled 72 0.73* 78 0.71* 233 0.76* 32 0.25 64 0.76* – –
4th Ed. Spr 98 Scaled – – – – – – 55 0.68* – – – –
3/V
9th Ed. Fall 98 NCE 92 0.67* – – – – – – – – – –
9th Ed. Fall 98 Scaled – – – – 93 0.75* – – – – 70 0.75*
Stanford Reading Test
3rd Ed. Fall 97 NCE – – – – 5 0.81 24 0.82* – – – –
TerraNova
Fall 97 NCE 103 0.69* – – – – – – – – – –
Spr 98 Scaled – – 87 0.82* – – 21 0.47* – – – –
Fall 98 NCE 35 0.69* 32 0.74* – – – – – – – –
Test of Achievement and Proficiency (TAP)
Spr 97 NCE – – – – – – – – 36 0.59* – –
Spr 98 NCE – – – – – – 41 0.66* – – 43 0.83*
Texas Assessment of Academic Skills (TAAS)
Spr 97 TLI – – – – – – – – – – 41 0.58*
Wide Range Achievement Test 3 (WRAT3)
Spr 98 9 0.35 – – – – – – – – – –
Fall 98 – – – – 16 0.80* – – – – – –
Wisconsin Reading Comprehension Test
Spr 98 – – – – – – 63 0.58* – – – –

Summary
Grade(s) All 7 8 9 10 11 12
Number of 3,158 1,016 529 733 398 317 165
students
Number of 60 18 15 10 8 5 4
coefficients
Average validity – 0.71 0.72 0.75 0.60 0.69 0.77
Overall average 0.71

a. Sample sizes are in the columns labeled “n.”


* Denotes correlation coefficients that are statistically significant at the 0.05 level.

Star Assessments™ for Reading


Technical Manual 155
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Relationship of Star Reading Scores to Scores on State


Tests of Accountability in Reading
In the US, following the passage of the No Child Left Behind Act (NCLB) in 2001,
all states moved to comprehensive tests of grade level standards for purposes
of accountability. This created interest in the degree to which Star Reading test
scores are related to state accountability test scores. The following section
provides specific information about the validity of Star scores relative to state test
scores of the NCLB era. Results of concurrent and predictive validity (defined
earlier) are presented here with specific results for a variety of state tests of
accountability.

Table 70 and Table 71 provide a variety of concurrent and predictive validity


coefficients, respectively, for grades 3–8. Numerous state accountability tests have
been used in this research.

Table 70: Concurrent Validity Data: Star Reading 2 Correlations (r) with State Accountability Tests,
Grades 3–8a

3 4 5 6 7 8

Date Score n r n r n r n r n r n r
Colorado Student Assessment Program
Spr 06 Scaled 82 0.75* 79 0.83* 93 0.68* 280 0.80* 299 0.84* 185 0.83*
Delaware Student Testing Program—Reading
Spr 05 Scaled 104 0.57* – – – – – – – – – –
Spr 06 Scaled 126 0.43* 141 0.62* 157 0.59* 75 0.66* 150 0.72 – –
Florida Comprehensive Assessment Test
Spr 06 SSS – – 41 0.65* – – – – – – 74 0.65*
Illinois Standards Achievement Test—Reading
Spr 05 Scaled 594 0.76* – – 449 0.70* – – – – 157 0.73*
Spr 06 Scaled 140 0.80* 144 0.80* 146 0.72* – – 140 0.70* – –
Michigan Educational Assessment Program—English Language Arts
Fall 04 Scaled – – 155 0.81* – – – – 154 0.68* – –
Fall 05 Scaled 218 0.76* 196 0.80* 202 0.80* 207 0.69* 233 0.72* 239 0.70*
Fall 06 Scaled 116 0.79* 132 0.69* 154 0.81* 129 0.66* 125 0.79* 152 0.74*
a. Sample sizes are in the columns labeled “n.”
* Denotes correlation coefficients that are statistically significant (p < 0.05).

Star Assessments™ for Reading


Technical Manual 156
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Table 70: Concurrent Validity Data: Star Reading 2 Correlations (r) with State Accountability Tests,
Grades 3–8a

3 4 5 6 7 8

Date Score n r n r n r n r n r n r
Michigan Educational Assessment Program—Reading
Fall 04 Scaled – – 155 0.80* – – – – 156 0.68* – –
Fall 05 Scaled 218 0.77* 196 0.78* 202 0.81* 207 0.68* 233 0.71* 239 0.69*
Fall 06 Scaled 116 0.75* 132 0.70* 154 0.82* 129 0.70* 125 0.86* 154 0.72*
Mississippi Curriculum Test
Spr 02 Scaled 148 0.62* 175 0.66* 81 0.69* – – – – – –
Spr 03 Scaled 389 0.71* 359 0.70* 377 0.70* 364 0.72* 372 0.70* – –
Oklahoma Core Curriculum Test
Spr 06 Scaled 78 0.62* 92 0.58* 46 0.52* 80 0.60* – – – –

Summary
Grades All 3 4 5 6 7 8
Number of 11,045 2,329 1,997 2,061 1,471 1,987 1,200
students
Number of 61 12 13 11 8 10 7
coefficients
Average – 0.72 0.73 0.73 0.71 0.74 0.73
validity
Overall 0.73
validity
a. Sample sizes are in the columns labeled “n.”
* Denotes correlation coefficients that are statistically significant (p < 0.05).

Star Assessments™ for Reading


Technical Manual 157
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Table 71: Predictive Validity Data: Star Reading Scaled Scores Predicting Later Performance for
Grades 3–8 on Numerous State Accountability Testsa

3 4 5 6 7 8
Predictor Criterion
Date Dateb n r n r n r n r n r n r
Colorado Student Assessment Program
Fall 05 Spr 06 82 0.72* 79 0.77* 93 0.70* 280 0.77* 299 0.83* 185 0.83*
Delaware Student Testing Program—Reading
Fall 04 Spr 05 189 0.58* – – – – – – – – – –
Win 05 Spr 05 120 0.67* – – – – – – – – – –
Spr 05 Spr 06 161 0.52* 191 0.55* 190 0.62* – – 100 0.75* 143 0.63*
Fall 05 Spr 06 214 0.39* 256 0.62* 270 0.59* 242 0.71* 273 0.69* 247 0.70*
Win 05 Spr 06 233 0.47* 276 0.59* 281 0.62* 146 0.57* – – 61 0.64*
Florida Comprehensive Assessment Test
Fall 05 Spr 06 – – 42 0.73* – – 409 0.67* 381 0.61* 387 0.62*
Win 07 Spr 07 – – – – – – 417 0.76* 342 0.64* 361 0.72*
Illinois Standards Achievement Test—Reading
Fall 04 Spr 05 450 0.73* – – 317 0.68* – – – – – –
Win 05 Spr 05 564 0.76* – – 403 0.68* – – – – – –
Fall 05 Spr 06 133 0.73* 140 0.74* 145 0.66* – – 173 0.51* 158 0.66*
Win 06 Spr 06 138 0.76* 145 0.77* 146 0.70* – – – – – –
Michigan Educational Assessment Program—English Language Arts
Fall 04 Fall 05P 193 0.60* 181 0.70* 170 0.75* 192 0.66* 181 0.71* 88 0.85*
Win 05 Fall 05P 204 0.68* 184 0.74* 193 0.75* 200 0.70* 214 0.73* 212 0.73*
Spr 05 Fall 05P 192 0.73* 171 0.73* 191 0.71* 193 0.62* 206 0.75* 223 0.69*
Fall 05 Fall 06P 111 0.66* 132 0.71* 119 0.77* 108 0.60* 114 0.66* 126 0.66*
Win 06 Fall 06P 114 0.77* – – 121 0.75* 109 0.66* 114 0.64* 136 0.71*
Spr 06 Fall 06P – – – – – – – – – – 30 0.80*
Michigan Educational Assessment Program—Reading
Fall 04 Fall 05P 193 0.60* 181 0.69* 170 0.76* 192 0.66* 181 0.70* 88 0.84*
Win 05 Fall 05P 204 0.69* 184 0.74* 193 0.78* 200 0.70* 214 0.72* 212 0.73*
Spr 05 Fall 05P 192 0.72* 171 0.72* 191 0.74* 193 0.62* 206 0.72* 223 0.69*
Fall 05 Fall 06P 111 0.63* 132 0.70* 119 0.78* 108 0.62* 116 0.68* 138 0.66*
Win 06 Fall 06P 114 0.72* – – 121 0.75* 109 0.64* 116 0.68* 138 0.70*
Spr 06 Fall 06P – – – – – – – – – – 30 0.81*
a. Grade given in the column signifies the grade within which the Predictor variable was given (as some validity estimates span
contiguous grades).
b. P indicates a criterion measure was given in a subsequent grade from the predictor.
* Denotes significant correlation (p < 0.05).

Star Assessments™ for Reading


Technical Manual 158
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Scores to Scores on State Tests of Accountability in Reading

Table 71: Predictive Validity Data: Star Reading Scaled Scores Predicting Later Performance for
Grades 3–8 on Numerous State Accountability Testsa

3 4 5 6 7 8
Predictor Criterion
Date Dateb n r n r n r n r n r n r
Mississippi Curriculum Test
Fall 01 Spr 02 95 0.70* 97 0.65* 78 0.76* – – – – – –
Fall 02 Spr 03 337 0.67* 282 0.69* 407 0.71* 442 0.72* 425 0.68* – –
Oklahoma Core Curriculum Test
Fall 04 Spr 05 – – – – 44 0.63* – – – – – –
Win 05 Spr 05 – – – – 45 0.66* – – – – – –
Fall 05 Spr 06 89 0.59* 90 0.60* 79 0.69* 84 0.63* – – – –
Win 06 Spr 06 60 0.65* 40 0.67* – – – – – – – –

Summary
Grades All 3 4 5 6 7 8
Number of 22,018 4,493 2,974 4,086 3,624 3,655 3,186
students
Number of 119 24 19 23 17 17 19
coefficients
Average – 0.66 0.68 0.70 0.68 0.69 0.70
validity
Overall 0.68
validity
a. Grade given in the column signifies the grade within which the Predictor variable was given (as some validity estimates span
contiguous grades).
b. P indicates a criterion measure was given in a subsequent grade from the predictor.
* Denotes significant correlation (p < 0.05).

Star Assessments™ for Reading


Technical Manual 159
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Enterprise Scores to Scores on Previous Versions

Relationship of Star Reading Enterprise Scores to Scores


on Previous Versions
The 34-item version of Star Reading represents a significant departure from
previous versions of Star. It is not a replacement for earlier versions; instead, it
presents an alternative approach to reading assessment. Unlike previous Star
Reading versions, which were primarily designed as measures only of reading
comprehension, the 34-item version of Star Reading, simply referred to as Star
Reading, is a standards-based assessment which measures a wide variety of
reading skills. In addition to this substantial change in content from previous
versions, Star Reading tests are also longer, and as a result have greater
measurement precision and reliability.

Star Reading was released for use in June 2011. In the course of its development,
Star Reading was administered to thousands of students who also took previous
versions. The correlations between Star Reading and previous versions of
Star Reading provide validity evidence of their own. To the extent that those
correlations are high, they would provide evidence that the current Star Reading
and previous versions are measuring the same or highly similar underlying
attributes, even though they are dissimilar in content and measurement precision.
Table 72 displays data on the correlations between Star Reading and scores on
two previous versions: classic versions of Star Reading (which includes versions
2.0 through 4.3) and Star Reading Progress Monitoring (version 4.4.) Both of those
Star Reading versions are 25-item versions that are highly similar to one another,
differing primarily in terms of the software that delivers them; for all practical
purposes, they may be considered alternate forms of Star Reading.

Star Assessments™ for Reading


Technical Manual 160
Appendix B: Detailed Evidence of Star Reading Validity
Relationship of Star Reading Enterprise Scores to Scores on Previous Versions

Table 72: Correlations of Star Reading with Scores on Star Reading Classic
and Star Reading Progress Monitoring Tests
Star Reading Star Reading
Classic Versions Progress Monitoring Version
Grade N r N r
1 810 0.73 539 0.87
2 1,762 0.81 910 0.85
3 2,830 0.81 1,140 0.83
4 2,681 0.81 1,175 0.82
5 2,326 0.80 919 0.82
6 1,341 0.85 704 0.84
7 933 0.76 349 0.81
8 811 0.80 156 0.85
9 141 0.76 27 0.75
10 107 0.79 20 0.84
11 84 0.87 6 0.94
12 74 0.78 5 0.64
All Grades 13,979 0.87 5,994 0.88
Combined

Figure 7: Scatterplot of Star Reading and Star Reading Progress Monitoring


Test Scores for 5,994 Students Tested in June and July 2011

Star Assessments™ for Reading


Technical Manual 161
Appendix B: Detailed Evidence of Star Reading Validity
Data from Post-Publication Studies

Data from Post-Publication Studies


Subsequent to publication of Star Reading 2.0 in 1999, additional external validity
data became available, both from users of the assessment and from special
studies conducted by Renaissance Learning and others. This section provides a
summary of results of a doctoral dissertation examining the relationship of Star
Reading to scores on a leading nationally standardized reading assessment, the
Stanford Achievement Test (SAT9), and a major reading state test, the California
Standards Test (CST).

Predictive Validity: Correlations with SAT9 and the California Standards


Tests
A doctoral dissertation (Bennicoff-Nan, 2002) studied the validity of Star Reading
as a predictor of student’s scores in a California school district on the California
Standards Test (CST) and the Stanford Achievement Tests, Ninth Edition (SAT9),
the reading accountability tests mandated by the State of California. At the time of
the study, those two tests were components of the California Standardized Testing
and Reporting Program. The study involved analysis of test scores of more than
1,000 school children in four grades in a rural central California school district; 83
percent of students in the district were eligible for free and reduced lunch and 30
percent were identified as having limited English proficiency.

Bennicoff-Nan’s dissertation addressed a number of different research


questions. For purposes of this technical manual, we are primarily interested
in the correlations between Star Reading 2 with SAT9 and CST scores. Those
correlations are displayed by grade in Table 73.

Table 73: Correlations of Star Reading 2.0 Scores with SAT9 and California
Standards Test Scores, by Grade
CST English and
Grade SAT9 Total Reading Language Arts
3 0.82 0.78
4 0.83 0.81
5 0.83 0.79
6 0.81 0.78

In summary, the average correlation between Star Reading and SAT9 was 0.82.
The average correlation with CST was 0.80. These values are evidence of the
validity of Star Reading for predicting performance on both norm-referenced
reading tests such as the SAT9, and criterion-referenced accountability measures
such as the CST. Bennicoff-Nan concluded that Star Reading was “a time and
labor effective” means of progress monitoring in the classroom, as well as

Star Assessments™ for Reading


Technical Manual 162
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

suitable for program evaluation and monitoring student progress toward state
accountability goals.

Linking Star and State Assessments: Comparing


Student- and School-Level Data
With an increasingly large emphasis on end-of-the-year summative state tests,
many educators seek out informative and efficient means of gauging student
performance on state standards—especially those hoping to make instructional
decisions before the year-end assessment date.

For many teachers, this is an informal process in which classroom assessments


are used to monitor student performance on state standards. While this may be
helpful, such assessments may be technically inadequate when compared to more
standardized measures of student performance.

Recently the assessment scale associated with Star Reading has been linked
to the scales used by virtually every state summative reading or ELA test in the
US. Linking Star Reading assessments to state tests allows educators to reliably
predict student performance on their state assessment using Star Reading scores.
More specifically, it places teachers in a position to identify
X which students are on track to succeed on the year-end summative state test,
and
X which students might need additional assistance to reach proficiency.

Educators using Star Reading assessments can access Star Performance


Reports that allow access to students’ Pathway to Proficiency. These reports
indicate whether individual students or groups of students (by class, grade, or
demographic characteristics) are likely to be on track to meet a particular state’s
criteria for reading proficiency. In other words, these reports allow instructors to
evaluate student progress toward proficiency and make data-based instructional
decisions well in advance of the annual state tests. Additional reports automatically
generated by Star Reading help educators screen for later difficulties and progress
monitor students’ responsiveness to interventions.

An overview of two methodologies used for linking Star Reading to state


assessments is provided in the following section.

Methodology Comparison
Recently, Renaissance Learning has developed linkages between Star Reading
Scaled Scores and scores on the accountability tests of a number of states.
Depending on the kind of data available for such linking, these linkages have been

Star Assessments™ for Reading


Technical Manual 163
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

accomplished using one of two different methods. One method used student-
level data, where both Star and state test scores were available for the same
students. The other method used school-level data; this method was applied when
approximately 100% of students in a school had taken Star Reading, but individual
students’ state test scores were not available.

Student-Level Data

Using individual data to link scores between distinct assessments is commonly


used when student-level data are readily available for both assessments. In this
case, the distribution of standardized scores on one test (e.g. percentile ranks)
may be compared to the distribution of standardized scores on another test in an
effort to establish concordance. Recently, the release of individual state test data
for linking purposes allowed for the comparison of Star assessments to state test
scores for several states. Star test comparison scores were obtained within an
eight-week window around the median state test date (+/–4 weeks).

Typically, states classify students into one of three, four, or five performance levels
on the basis of cut scores (e.g. Below Basic, Basic, Proficient, or Advanced).
After each testing period, a distribution of students falling into each of these
categories will always exist (e.g. 30% in Basic, 25% in Proficient, etc.). Because
Star data were available for the same students who completed the state test,
the distributions could be linked via equipercentile linking analysis (see Kolen
& Brennan, 2004) to scores on the state test. This process creates tables of
approximately equivalent scores on each assessment, allowing for the lookup of
Star scale scores that correspond to the cut scores for different performance levels
on the state test. For example, if 20% of students were “Below Basic” on the state
test, the lowest Star cut score would be set at a score that partitioned only the
lowest 20% of scores.

School-Level Data

While using student-level data is still common, obstacles associated with individual
data often lead to a difficult and time-consuming process of obtaining and
analyzing data. In light of the time-sensitive needs of schools, obtaining student-
level data is not always an option. As an alternative, school-level data may be
used in a similar manner. These data are publicly available, thus making the
linking process more efficient.

School-level data were analyzed for some of the states included in the student-
level linking analysis. In an effort to increase sample size, the school-level data
presented here represent “projected” Scaled Scores. Each Star score was
projected to the mid-point of the state test administrations window using decile-

Star Assessments™ for Reading


Technical Manual 164
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

based growth norms. The growth norms are both grade- and subject-specific and
are based on the growth patterns of more than one million students using Star
assessments over a three-year period. Again, the linking process used for school-
level data is very similar to the previously described process—the distribution of
state test scores is compared to projected Star scores and using the observed
distribution of state-test scores, equivalent cut scores are created for the Star
assessments (the key difference being that these comparisons are made at the
group level).

Accuracy Comparisons
Accuracy comparisons between student- and school-level data are particularly
important given the marked resource differences between the two methods. These
comparisons are presented for three states1 in Table 74, Table 75, and Table
76. With few exceptions, results of linking using school-level data were nearly
identical to student-level data on measures of specificity, sensitivity, and overall
accuracy. McLaughlin and Bandeira de Mello (2002) employed similar methods in
their comparison of NAEP scores and state assessment results, and this method
has been used several times since then (McLaughlin & Bandeira de Mello, 2003;
Bandeira de Mello, Blankenship, & McLaughlin, 2009; Bandeira et al., 2008).

In a similar comparison study using group-level data, Cronin et al. (2007) observed
cut score estimates comparable to those requiring student-level data.

1. Data were available for Arkansas, Florida, Idaho, Kansas, 2Kentucky, Mississippi, North
Carolina, South Dakota, and Wisconsin; however, only North Carolina, Mississippi, and
Kentucky are included in the current analysis.

Star Assessments™ for Reading


Technical Manual 165
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Table 74: Number of Students Included in Student-Level and School-Level


Linking Analyses by State, Grade, and Subject
Reading
State Grade Student School
NC 3 2,707 4,923
4 2,234 4,694
5 1,752 2,576
6 702 2,604
7 440 2,530
8 493 1,814
MS 3 3,821 6,786
4 3,472 7,915
5 2,915 8,327
6 2,367 7,861
7 1,424 6,133
8 1,108 4,004
KY 3 10,776 2,625
4 8,885 4,010
5 7,147 4,177
6 5,003 2,848
7 2,572 2,778
8 1,198 1,319

Star Assessments™ for Reading


Technical Manual 166
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Table 75: Comparison of School Level and Student Level Classification Diagnostics for Reading/
Language Arts

Sensitivitya Specificityb False + Ratec False – Rated Overall Rate

State Grade Student School Student School Student School Student School Student School
NC 3 89% 83% 75% 84% 25% 16% 11% 17% 83% 83%
4 90% 81% 69% 80% 31% 20% 10% 19% 82% 81%
5 90% 77% 69% 83% 31% 17% 10% 23% 81% 80%
6 85% 85% 75% 75% 25% 25% 15% 15% 81% 81%
7 84% 76% 77% 82% 23% 18% 16% 24% 80% 79%
8 83% 79% 74% 74% 26% 26% 17% 21% 79% 76%
MS 3 66% 59% 86% 91% 14% 9% 34% 41% 77% 76%
4 71% 68% 87% 88% 13% 12% 29% 32% 79% 79%
5 70% 68% 84% 85% 16% 15% 30% 32% 78% 78%
6 67% 66% 84% 84% 16% 16% 33% 34% 77% 77%
7 63% 66% 88% 86% 12% 14% 37% 34% 79% 79%
8 69% 72% 86% 85% 14% 15% 31% 28% 79% 80%
KY 3 91% 91% 49% 50% 51% 50% 9% 9% 83% 83%
4 90% 86% 46% 59% 54% 41% 10% 14% 81% 80%
5 88% 81% 50% 65% 50% 35% 12% 19% 79% 77%
6 89% 84% 53% 63% 47% 37% 11% 16% 79% 79%
7 86% 81% 56% 66% 44% 34% 14% 19% 77% 76%
8 89% 84% 51% 63% 49% 37% 11% 16% 79% 78%
a. Sensitivity refers to the proportion of correct positive predictions.
b. Specificity refers to the proportion of negatives that are correctly identified (e.g. student will not meet a particular cut score).
c. False + rate refers to the proportion of students incorrectly identified as “at-risk.”
d. False – rate refers to the proportion of students incorrectly identified as not “at-risk.”

Star Assessments™ for Reading


Technical Manual 167
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Table 76: Comparison of Differences Between Achieved and Forecasted Performance Levels in
Reading/Language Arts (Forecast % – Achieved %)
State Grade Student School Student School Student School Student School
NC Level I Level II Level III Level IV
3 –6.1% –1.1% 2.0% 1.1% 3.6% –0.8% 0.4% 0.9%
4 –3.9% –2.0% –0.1% 1.3% 4.3% 0.4% –0.3% 0.2%
5 –5.1% –1.9% –0.7% 2.4% 8.1% –0.7% –2.3% 0.2%
6 –2.1% 0.2% 0.8% –0.4% 3.2% –11.5% –2.0% 11.7%
7 –6.4% –0.9% 2.9% –0.4% 6.3% –0.7% –2.8% 2.0%
8 –4.9% –3.0% 3.0% 0.4% 5.1% 2.3% –3.1% 0.3%
MS Minimal Basic Proficient Advanced
3 5.2% 14.1% 3.9% 0.5% –6.1% –13.4% –3.0% –1.2%
4 5.6% 10.9% 0.2% –3.1% –3.0% –5.9% –2.8% –1.8%
5 4.2% 12.6% 0.4% –6.7% –2.7% –7.2% –1.9% 1.3%
6 1.9% 6.2% 2.0% –1.5% –3.8% –7.1% 0.0% 2.4%
7 5.3% 7.0% 1.1% –2.8% –6.3% –5.3% –0.2% 1.0%
8 6.8% 5.5% –1.7% –2.8% –4.6% –4.3% –0.5% 1.5%
KY Novice Apprentice Proficient Distinguished
3 –3.5% –1.4% 0.8% –1.4% 6.4% 3.1% –3.7% –0.3%
4 –0.5% –0.3% –2.5% 2.9% 6.8% –2.1% –3.9% –0.5%
5 –1.6% 1.0% –2.3% 3.7% 9.1% –2.9% –5.3% –1.8%
6 –1.5% 1.9% –3.6% –1.1% 7.3% 0.0% –2.3% –0.8%
7 –0.9% 0.6% –2.5% 2.5% 6.6% –1.7% –3.3% –1.4%
8 –0.1% 1.0% –5.1% 1.1% 8.1% –3.0% –2.9% 0.8%

Classification Accuracy and Screening Data: NCRTI


The National Center on Response to Intervention (NCRTI) is a federally funded
project whose mission includes reviewing the technical adequacy of assessments as
screening tools for use in schools adopting multi-tiered systems of support (commonly
known as RTI, or response to intervention). In the July 2011 review, Star Reading
earned strong ratings on NCRTI’s technical criteria.

When evaluating the validity of screening tools, NCRTI considered several factors:
X classification accuracy
X validity
X disaggregated validity and classification data for diverse populations

Star Assessments™ for Reading


Technical Manual 168
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

NCRTI ratings include four qualitative labels: convincing evidence, partially


convincing evidence, unconvincing evidence, or data unavailable/inadequate.
Please refer to Table 77 for descriptions of these labels as provided by NCRTI,
as well as the scores assigned to Star Reading in each of the categories. Further
descriptive information is provided within the following tables.

Table 77: NCRTI Screening Indicator Descriptions


Star Reading
Indicator Description Score
Classification Accuracy Classification accuracy refers to the extent to Convincing Evidence
which a screening tool is able to accurately
classify students into “at risk for reading disability”
and “not at risk for reading disability” categories
(often evidenced by AUC values greater than
0.85).
Validity Validity refers to the extent to which a tool Convincing Evidence
accurately measures the underlying construct
that it is intended to measure (often evidenced by
coefficients greater than 0.70).
Disaggregated Validity and Classification Data are disaggregated when they are calculated Convincing Evidence
Data for Diverse Populations and reported separately for specific subgroups.

Aggregated Classification Accuracy Data

Receiver Operating Characteristic (ROC) Curves as defined by NCRTI:

“Receiver Operating Characteristic (ROC) curves are a useful way to interpret


sensitivity and specificity levels and to determine related cut scores. ROC
curves are a generalization of the set of potential combinations of sensitivity
and specificity possible for predictors.” (Pepe, Janes, Longton, Leisenring, &
Newcomb, 2004)

“ROC curve analyses not only provide information about cut scores, but also
provide a natural common scale for comparing different predictors that are
measured in different units, whereas the odds ratio in logistic regression analysis
must be interpreted according to a unit increase in the value of the predictor, which
can make comparison between predictors difficult.” (Pepe, et al., 2004)

“An overall indication of the diagnostic accuracy of a ROC curve is the area
under the curve (AUC). AUC values closer to 1 indicate the screening measure
reliably distinguishes among students with satisfactory and unsatisfactory reading
performance, whereas values at .50 indicate the predictor is no better than
chance.” (Zhou, Obuchowski & Obushcowski, 2002)

Brief Description of the Current Sample and Procedure

Star Assessments™ for Reading


Technical Manual 169
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Initial Star Reading classification analyses were performed using state assessment
data from Arkansas, Delaware, Illinois, Michigan, Mississippi, and Kansas.
Collectively these states cover most regions of the country (Central, Southwest,
Northeast, Midwest, and Southeast). Both the Classification Accuracy and Cross
Validation study samples were drawn from an initial pool of 79,045 matched
student records covering grades 2–11. The sample used for this analysis was 49
percent female and 28 percent male, with 44 percent not responding. Twenty-
eight percent of students were White, 14 percent were Black, and 2 percent were
Hispanic. Lastly, 0.4 percent were Asian or Pacific Islander and 0.2 were American
Indian or Alaskan Native. Ethnicity data were not provided for 55.4 percent of the
sample.

A secondary analysis using data from a single state assessment was then
performed. The sample used for this analysis was 42,771 matched Star Reading
and South Dakota Test of Education Progress records. The sample covered
grades 3–8 and was 28 percent female and 28 percent male. Seventy-one percent
of students were White and 26 percent were American Indian or Alaskan Native.
Lastly, 1 percent were Black, and 1 percent were Hispanic and, 0.7 percent were
Asian or Pacific Islander.

An ROC analysis was used to compare the performance data on Star Reading to
performance data on state achievement tests. The Star Reading Scaled Scores
used for analysis originated from assessments 3–11 months before the state
achievement test was administered. Selection of cut scores was based on the
graph of sensitivity and specificity versus the Scaled Score. For each grade, the
Scaled Score chosen as the cut point was equal to the score where sensitivity
and specificity intersected. The classification analyses, cut points and outcome
measures are outlined in Table 78. When collapsed across ethnicity, AUC values
were all greater than 0.80. Descriptive notes for other values represented in the
table are provided in the table footnote.

Table 78: Classification Accuracy in Predicting Proficiency on State


Achievement Tests in Seven Statesa

Initial Analysis Secondary Analysis


b
Statistic Value Value
False Positive Rate 0.2121 0.1824
False Negative Rate 0.2385 0.2201
Sensitivity 0.7615 0.7799
Specificity 0.7579 0.8176
Positive Predictive Power 0.4423 0.5677
Negative Predictive Power 0.9264 0.9236
Overall Classification Rate 0.7586 0.8087
Grade AUC Grade AUC

Star Assessments™ for Reading


Technical Manual 170
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Table 78: Classification Accuracy in Predicting Proficiency on State


Achievement Tests in Seven Statesa

Initial Analysis Secondary Analysis


b
Statistic Value Value
AUC (ROC) 2 0.816
3 0.839 3 0.869
4 0.850 4 0.882
5 0.841 5 0.881
6 0.833 6 0.883
7 0.829 7 0.896
8 0.843 8 0.879
9 0.847
10 0.858
11 0.840
Base 0.20 0.24
Cut
Grade Score Grade Cut Score
Cut Point 2 228
3 308 3 288
4 399 4 397
5 488 5 473
6 540 6 552
7 598 7 622
8 628 8 727
9 708
10 777
11 1,055
a. Arkansas, Delaware, Illinois, Kansas, Michigan, Mississippi, and South Dakota.
b. The false positive rate is equal to the proportion of students incorrectly labeled “at-risk.” The
false negative rate is equal to the proportion of students incorrectly labeled not “at-risk.”
Likewise, sensitivity refers to the proportion of correct positive predictions while specificity refers
to the proportion of negatives that are correctly identified (e.g., student will not meet a particular
cut score).

Star Assessments™ for Reading


Technical Manual 171
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Aggregated Validity Data


Table 79 provides aggregated validity values as well as concurrent and predictive
validity evidence for Star Reading. Median validity coefficients ranged from 0.68–
0.84.

Table 79: Overall Concurrent and Predictive Validity Evidence for Star Reading
Coefficient
Type of
Validity Grade Test N (Range) Range Median
Predictive 3–6 CST 1,000+ 0.78–0.81 0.80
Predictive 2–6 SAT9 44–389 0.66–0.73 0.68
Concurrent 1–8 Suffolk Reading Scale 2,694 0.78–0.88 0.84
Construct 3, 5, 7, 10 DRP 273–424 0.76–0.86 0.82
Concurrent 1–4 DIBELS Oral Reading Fluency 12,220 0.71–0.87 0.81
Predictive 1–6 State Achievement Tests 74,877–200,929 0.68–0.82 0.79
Predictive 7–12 State Achievement Tests 3,107–64,978 0.81–0.86 0.82
Concurrent 3–8 State Achievement Tests 1,200–2,329 0.71–0.74 0.73
Predictive 3–8 State Achievement Tests 2,974–4,493 0.66–0.70 0.68

Disaggregated Validity and Classification Data


Table 80 shows the disaggregated classification accuracy data for ethnic
subgroups and also the disaggregated validity data.

Table 80: Disaggregated Classification and Validity Data

Classification Accuracy in Predicting Proficiency on State Achievement Tests in 6 States (Arkansas,


Delaware, Illinois, Kansas, Michigan, and Mississippi): by Race/Ethnicity
American
White, non- Black, non- Asian/Pacific Indian/Alaska
Hispanic Hispanic Hispanic Islander Native (n =
(n = 17,567) (n = 8,962) (n = 1,382) (n = 231) 111)
False Positive Rate 0.3124 0.4427 0.3582 0.1710 0.1216
False Negative Rate 0.3762 0.1215 0.1224 0.2368 0.4054
Sensitivity 0.6238 0.8785 0.8776 0.7632 0.5946
Specificity 0.8676 0.5573 0.6418 0.8290 0.8784
Positive Predictive 0.5711 0.5031 0.6103 0.4677 0.7097
Power
Negative Predictive 0.8909 0.8999 0.8913 0.9467 0.8125
Power
Overall Classification 0.8139 0.6658 0.7337 0.8182 0.7838
Rate

Star Assessments™ for Reading


Technical Manual 172
Appendix B: Detailed Evidence of Star Reading Validity
Linking Star and State Assessments: Comparing Student- and School-Level Data

Table 80: Disaggregated Classification and Validity Data

Classification Accuracy in Predicting Proficiency on State Achievement Tests in 6 States (Arkansas,


Delaware, Illinois, Kansas, Michigan, and Mississippi): by Race/Ethnicity
American
White, non- Black, non- Asian/Pacific Indian/Alaska
Hispanic Hispanic Hispanic Islander Native (n =
(n = 17,567) (n = 8,962) (n = 1,382) (n = 231) 111)
AUC (ROC) Grade AUC Grade AUC Grade AUC Grade AUC Grade AUC
2 n/a 2 0.500 2 n/a 2 n/a 2 n/a
3 0.863 3 0.828 3 0.868 3 0.913 3 0.697
4 0.862 4 0.823 4 0.837 4 0.869 4 0.888
5 0.853 5 0.832 5 0.839 5 0.855 5 0.919
6 0.849 6 0.806 6 0.825 6 0.859 6 0.846
7 0.816 7 0.784 7 0.866 7 0.904 7 0.900
8 0.850 8 0.827 8 0.812 8 0.961 8 1.000
9 1.000 9 0.848 9 n/a 9 n/a 9 n/a
10 0.875 10 0.831 10 0.833 10 n/a 10 n/a
11 0.750 11 1.000 11 n/a 11 n/a 11 n/a
Base Rate 0.2203 0.3379 0.3900 0.1645 0.333
Cut Scores Cut Cut Cut Cut Cut
Grade Score Grade Score Grade Score Grade Score Grade Score
2 228 2 228 2 228 2 228 2 228
3 308 3 308 3 308 3 308 3 308
4 399 4 399 4 399 4 399 4 399
5 488 5 488 5 488 5 488 5 488
6 540 6 540 6 540 6 540 6 540
7 598 7 598 7 598 7 598 7 598
8 628 8 628 8 628 8 628 8 628
9 708 9 708 9 708 9 708 9 708
10 777 10 777 10 777 10 777 10 777
11 1,055 11 1,055 11 1,055 11 1,055 11 1,055

Disaggregated Validity
Coefficient
Test or
Type of Validity Age or Grade Criterion n (range) Range Median
Predictive (White) 2–6 SAT9 35–287 0.69–0.75 0.72
Predictive (Hispanic) 2–6 SAT9 7–76 0.55–0.74 0.675

Star Assessments™ for Reading


Technical Manual 173
References

Allington, R., & McGill-Franzen, A. (2003). Use students’ summer-setback


months to raise minority achievement. Education Digest, 69(3), 19–24.

Bennicoff-Nan, L. (2002). A correlation of computer adaptive, norm


referenced, and criterion referenced achievement tests in elementary
reading. Unpublished doctoral dissertation, The Boyer Graduate School
of Education, Santa Ana, CA.

Betebenner, D. W. (2009). Norm- and criterion-referenced student growth.


Educational Measurement: Issues and Practice, 28(4), 42–51.
Betebenner, D. W. (2010). New directions for student growth models.
Retrieved from the National Center for the Improvement of Educational
Assessment website: [Link]
.aspx?fileticket=UssiNoSZks8%3D&tabid=4421&mid=10564

Betebenner, D. W., & Iwaarden, A. V. (2011a). SGP: An R package for the


calculation and visualization of student growth percentiles & percenitle
growth trajectories [Computer Software manual]. (R package version 0.4-
0.0 available at [Link]

Betebenner, D. W. (2011b). A technical overview of the student growth


percentile methodology: Student growth percentiles and percentile
growth projections/trajectories. The National Center for the Improvement
of Educational Assessment. Retrieved from [Link]
/njsmart/performance/SGP_Technical_Overview.pdf

Borman, G. D. & Dowling, N. M. (2004). Testing the Reading Renaissance


program theory: A multilevel analysis of student and classroom effects on
reading achievement. University of Wisconsin-Madison.

Bracey, G. (2002). Summer loss: The phenomenon no one wants to deal


with. Phi Delta Kappan, 84(1), 12–13.

Bryk, A., & Raudenbush, S. (1992). Hierarchical linear models: Applications


and data analysis methods. Newbury Park, CA: Sage Publications.

Bulut, O., & Cormier, D. C. (2018). Validity evidence for progress monitoring
with Star Reading: Slope estimates, administration frequency, and
number of data points. Frontiers in Education, 3(68), 1–12.

Campbell, D., & Stanley, J. (1966). Experimental and quasi-experimental


designs for research. Chicago: Rand McNally & Company.

Star Assessments™ for Reading


Technical Manual 174
References


Cook, T., & Campbell, D. (1979). Quasi-experimentation: Design & analysis


issues for field settings. Boston: Houghton Mifflin Company.

Deno, S. (2003). Developments in curriculum-based measurement. Journal


of Special Education, 37(3), 184–192.

Diggle, P., Heagerty, P., Liang, K., & Zeger, S. (2002). Analysis of longitudinal
data (2nd ed.). Oxford: Oxford University Press.

Duncan, T., Duncan, S., Strycker, L., Li, F., & Alpert, A. (1999). An
introduction to latent variable growth curve modeling: Concepts, issues,
and applications. Mahwah, NJ: Lawrence Erlbaum Associates.

Gickling, E. E., & Havertape, S. (1981). Curriculum-based assessment


(CBA). Minneapolis, MN: School Psychology Inservice Training Network.

Gickling, E. E., & Thompson, V. E. (2001). Putting the learning needs of


children first. In B. Sornson (Ed.). Preventing early learning failure.
Alexandria, VA: ASCD.

Hedges, L.V., & Olkin, I. (1985). Statistical methods for meta-analysis.


Orlando, FL: Academic Press.

Holmes, C. T., & Brown, C. L. (2003). A controlled evaluation of a total school


improvement process, School Renaissance. University of Georgia.
Available online: [Link]

Johnson, M. S., Kress, R. A., & Pikulski, J. J. (1987). Informal reading


inventories. Newark, DE: International Reading Association.

Kirk, R. (1995). Experimental Design: Procedures for the behavioral sciences


(3rd ed.). New York: Brooks/Cole Publishing Company.

Kolen, M., & Brennan, R. (2004). Test equating, scaling, and linking (2nd
ed.). New York: Springer.

McCormick, S. (1999). Instructing students who have literacy problems (3rd


ed.). Englewood Cliffs, NJ: Prentice-Hall.

Meyer, B., & Rice, G. E. (1984). The structure of text. In P.D. Pearson (Ed.),
Handbook on reading research (pp. 319–352). New York: Longman.

Moskowitz, D. S., & Hershberger, S. L. (Eds.). (2002). Modeling


intraindividual variability with repeated measures data: Methods and
applications. Mahwah, NJ: Lawrence Erlbaum Associates.

Multiple authors. (2002). Modeling intraindividual variability with repeated


measures data: Methods and applications. In D. S. Moskowitz & S. L.
Hershberger (eds.), Mahwah, NJ: Lawrence Erlbaum Associates.

Star Assessments™ for Reading


Technical Manual 175
References


Neter, J., Kutner, M., Nachtsheim, C., & Wasserman, W. (1996). Applied
linear statistical models (4th ed.). New York: WCB McGraw-Hill.

Pedhazur, E., & Schmelkin, L. (1991). Measurement, design, and analysis:


An integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates.

Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling.


Journal of Statistical Software, 48(2).

Rudner, L. M. (2001). Computing the expected proportions of misclassified


examinees. Practical Assessment Research & Evaluation, 7(14).

Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment


Research & Evaluation, 10(13).

Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory


classification accuracy and consistency indices. Applied Psychological
Measurement, 36(7), 602–624.

Star Assessments™ for Reading


Technical Manual 176
Index

Conditional standard error of measurement


A (CSEM), 49
Accessibility and test accommodations, 11 Construct validity, correlations with a measure of
Access levels, 13 reading comprehension, 64, 76
Adaptive Branching, 3, 7, 12 Content development, 15
Additional validity evidence, 74 ATOS graded vocabulary list, 18
Administering the test, 14 Educational Development Laboratory’s core
Agile Education Marketing Data, 98 vocabulary list, 18
Alternate form reliability, 52 Content specification
Answering test questions, 6 Star Reading, 15
Area Under the Curve (AUC), 82 Star Reading Progress Monitoring, 19
ATOS, 19 Content validity, 64
ATOS graded vocabulary list, 18 Conversion tables, 118
AUC. See Area Under the Curve (AUC) Cronbach’s alpha, 48, 51
Authentic text passage item specifications, 20 Cross-validation study results, 80
CSEM. See Conditional standard error of
measurement (CSEM)
B
Bayesian-modal Item Response Theory (IRT), 44
D
Data analysis, 101
C Data encryption, 12
Calibration Decision accuracy, 57
of Star Reading items for use in version 2, 31 Decision consistency, 57
of supplemental items for use in version 4.3, Definitions of scores, 105
39 Description of the program, 1
California Achievement Test (CAT), 136 DIBELS oral reading fluency (DORF), 78, 172
California Standards Test (CST), 162 Differential Item Functioning, 92
Common Core State Standards (CCSS), 72 DORF (DIBELS oral reading fluency), 78
Comparing the Star Reading test with classical Dynamic calibration, 15, 16, 41, 42
tests, 107
Compensating for incorrect grade placements, 117
Comprehensive Test of Basic Skills (CTBS), 136 E
Computer-adaptive test design, 43 Educational Development Laboratory, core
Concurrent validity, correlations with reading tests vocabulary list, 18
in England, 75 EIRF. See Empirical item response functions
conditional standard error of measurement (EIRF)
(CSEM), 57 Emergent Reader, 133
Empirical item response functions (EIRF), 37, 38

Star Assessments™ for Reading


Technical Manual 177
Index

England, 75 efficiency in use of student time, 25


Estimated Oral Reading Fluency (Est. ORF), 78, item components, 26
108 language conventions, 26
Extended time limits, 11 level of difficulty, cognitive load, content
External validity, 68 differentiation, and presentation, 24
level of difficulty, readability, 23
metadata requirements and goals, 27
Star Reading, 19
G Item difficulty, 36
GE. See Grade Equivalent (GE) Item discrimination, 36
Goal setting, 91 Item presentation, 34
Grade Equivalent (GE), 106, 110 Item response function (IRF), 36, 38
Grade placement, 116 item response theory (IRT), 57
compensating for incorrect grade
Item Response Theory (IRT), 36, 50
placements, 117 difficulty scale, 32
indicating appropriate grade placement, 116 difficulty scale, parameter, 39
Growth estimates, 103 Item retention, rules, 38
Growth norms, 103, 165 Item specifications
authentic text passages, 20
vocabulary-in-context items, 19
I
Improvements to the Star Reading test
in versions 3 RP and higher, 5 J
Indicating appropriate grade placement, 116 JAWS screen reader, 11
Individualized tests, 12
Instructional Reading Level (IRL), 108, 109
Internal validity, 65
Investigating Oral Reading Fluency and K
developing the Est. ORF (Estimated Oral Kuder-Richardson Formula 20 (KR-20), 51
Reading Fluency) scale, 78
Iowa Test of Basic Skills (ITBS), 136
Item calibration, 31
sample description, 32 L
Length of test, 7
sample description, item difficulty, 36
Lexile® Measures, 113
sample description, item discrimination, 36
Lexile Framework® for Reading, 115
sample description, item presentation, 34
of students and books, 114
sample description, item response function,
Linking Star and state assessments, 163
36
school-level data, 164
Item development, 15
student-level data, 164
authentic text passage item specifications, 20
Longitudinal study, correlations with SAT9, 74
vocabulary-in-context item specifications, 19
Item development specifications
accuracy of content, 26
adherence to skills, 23
balanced items, bias and fairness, 25

Star Assessments™ for Reading


Technical Manual 178
Index

M R
Maximum likelihood IRT estimation, 31 Rasch model, 36
Measurement precision, 48 Reader
Metropolitan Achievement Test (MAT), 136 Emergent, 133
Probable, 133
Transitional, 133
Receiver Operating Characteristic (ROC) curves,
N 169
National Center for Educational Statistics (NCES), Reliability, 48
98 alternate-form, 52
National Center on Response to Intervention coefficient, 48, 60
(NCRTI), 136, 168, 169 definition, 48
disaggregated validity and classification data, split-half, 48, 51
83 standard error of measurement (SEM), 54
Normal Curve Equivalent (NCE), 111 test-retest, 48, 52
Norming, 96 Renaissance learning progressions for reading, 27
data analysis, 101 Repeating a test, 9
growth norms, 103 ROC analysis, 82
sample characteristics, 97 Rudner’s index, 57
test administration, 101 Rules for item retention, 38
test score norms, 96

O S
Sample characteristics, norming, 97
Oral Reading Fluency, 78
SAT9, 74
Scale calibration, 31, 39
Scaled Score (SS), 40
P Score definitions, 105
Password entry, 13 conversion tables, 118
Pathway to Proficiency, 163 grade placement, 116
Percentile Rank (PR), 110 special scores, 116
Permissions, 13 types of test scores, 105
Post-publication studies, 162 Scores
Post-publication study data Estimated Oral Reading Fluency (Est. ORF),
correlations with a measure of reading 78, 108
comprehension, 76 Grade Equivalent (GE), 106, 109, 110
correlations with reading tests in England, 75 Instructional Reading Level (IRL), 108, 109
correlations with SAT9, 74 Lexile® Measures, 113
investigating Oral Reading Fluency and Normal Curve Equivalent (NCE), 111
developing the Est. ORF (Estimated Oral Percentile Rank (PR), 110
Reading Fluency) scale, 78 special IRL (Instructional Reading Level),
PR. See Percentile Rank (PR) 109
Practice session, 6 Student Growth Percentile (SGP), 112
Program description, 1 test scores, 105
Purpose of the program, 1 Zone of Proximal Development (ZPD), 116

Star Assessments™ for Reading


Technical Manual 179
Index

Security. See Test security Normal Curve Equivalent (NCE), 111


SEM. See Standard error of measurement (SEM) Percentile Rank (PR), 110
SGP. See Student Growth Percentile (SGP) special IRL (Instructional Reading Level),
Special scores 109
Zone of Proximal Development (ZPD), 116 Student Growth Percentile (SGP), 112
Split-application model, 12
Split-half reliability, 48, 51
SS. See Scaled Score (SS)
Standard error of measurement (SEM), 54, 138 U
Understanding GE scores, 109
Stanford Achievement Test (SAT9), 136, 162
Understanding IRL scores, 109
Star Early Learning, 96
Universal skills pool, 28
Star Reading, compared with classical tests, 107
Unlimited time, 12
State assessments, linked to Star, 163
User permissions, 13
methodology comparison, 163
school-level data, 164
student-level data, 164
State tests, 71 V
Student Growth Percentile (SGP), 112 Validation evidence
Summary of validity data, 95 additional, 74
Validity
concurrent, 75
T construct, 76
cross-validation study results, 80
Test administration, 101
definition, 64
Test administration procedures, 14
external evidence, 68
Test items, time limits, 10
internal evidence, 65
Test monitoring, 13
investigating Oral Reading Fluency and
Test repetition, 9
developing the Estimated Oral Reading
Test-retest reliability, 48, 52
Fluency (Est. ORF) scale, 78
Test score norms, 96
longitudinal study, 74
Test security, 12
summary of validity data, 95
access levels and user permissions, 13
Vocabulary-in-context item specifications, 19
data encryption, 12
Vocabulary lists, 18
individualized tests, 12
split-application model, 12
test monitoring and password entry, 13
Time limits, 10 W
extended time limits, 11 WCAG 2.1 AA, 11
unlimited, 12 Words correctly read per minute (WCPM), 78
Types of test scores, 105
comparing the Star Reading test with
classical tests, 107
Estimated Oral Reading Fluency (Est. ORF), Z
78, 108 Zone of Proximal Development (ZPD), 116
Grade Equivalent (GE), 106, 110
Instructional Reading Level (IRL), 108, 109

Star Assessments™ for Reading


Technical Manual 180
About Renaissance
Renaissance® transforms data about how students learn into instruments of empowerment for classroom
teachers, enabling them to guide all students to achieve their full potentials. Through smart, data-driven
educational technology solutions that amplify teachers’ effectiveness, Renaissance helps teachers
teach better, students learn better, and school administrators lead better. By supporting teachers in the
classroom but not supplanting them, Renaissance solutions deliver tight learning feedback loops: between
teachers and students, between assessment of skills mastery and the resources to spur progress, and
between each student’s current skills and future academic growth.

© Copyright 2024 Renaissance Learning, Inc. All rights reserved. (800) 338-4204 [Link]
All logos, designs, and brand names for Renaissance’s products and services are trademarks of Renaissance Learning, Inc., and its subsidiaries,
registered, common law, or pending registration in the United States. All other product and company names should be considered the property of their
respective companies and organizations. R43843.240827

You might also like