SRRPTechnical Manual
SRRPTechnical Manual
Technical Manual
Renaissance Learning
PO Box 8036
Wisconsin Rapids, WI 54495-8036
Telephone: (800) 338-4204
(715) 424-3636
Outside the US: 1.715.424.3636
Fax: (715) 424-4242
Email (general/technical questions): support@[Link]
Email (international support): worldsupport@[Link]
Website: [Link]
Copyright Notice
Copyright © 2024 by Renaissance Learning, Inc. All Rights Reserved.
This publication is protected by US and international copyright laws. It is unlawful to duplicate or reproduce
any copyrighted material without authorization from the copyright holder. This document may be reproduced
only by staff members in schools that have a license for Star Reading software. For more information, contact
Renaissance Learning, Inc., at the address above.
All logos, designs, and brand names for Renaissance’s products and services, including but not limited to
Accelerated Reader, Accelerated Reader Bookfinder, AR, AR Bookfinder, AR Bookguide, Accelerated Math,
Freckle, Lalilo, myIGDIs , myON, myON Classics, myON News, Renaissance, Renaissance Growth Alliance,
Renaissance Growth Platform, Renaissance Learning, Renaissance Place, Renaissance Smart Start,
Renaissance-U, Star Assessments, Star 360, Star CBM, Star Reading, Star Math, Star Early Literacy, Star
Custom, Star Spanish, Schoolzilla, and Renaissance are trademarks of Renaissance Learning, Inc., and its
subsidiaries, registered, common law, or pending registration in the United States. All other product and company
names should be considered the property of their respective companies and organizations.
Macintosh is a trademark of Apple Inc., registered in the U.S. and other countries.
METAMETRICS®, the METAMETRICS® logo and tagline, LEXILE®, the LEXILE® logo, POWERV®,
QUANTILE®, and the QUANTILE® logo are registered trademarks of MetaMetrics, Inc. in the United States and
abroad. Copyright © 2024 MetaMetrics, Inc. All rights reserved.
6/2024 SRRP
Contents
Introduction......................................................... 1
Star Reading: Screening and Progress-Monitoring Assessment............................1
Star Reading Purpose.............................................................................................1
Design of Star Reading...........................................................................................2
Three Generations of Star Reading Assessments...........................................2
Overarching Design Considerations.................................................................3
Improvements Specific to Star Reading Versions 3 and Higher................5
Test Interface..........................................................................................................6
Practice Session.....................................................................................................6
Adaptive Branching/Test Length.............................................................................7
Test Length.......................................................................................................7
Test Repetition........................................................................................................9
Item Time Limits......................................................................................................9
Accessibility and Test Accommodations ........................................................11
Unlimited Time................................................................................................12
Test Security.........................................................................................................12
Split-Application Model...................................................................................12
Individualized Tests........................................................................................12
Data Encryption..............................................................................................12
Access Levels and Capabilities......................................................................13
Test Monitoring/Password Entry.....................................................................13
Final Caveat...................................................................................................13
Test Administration Procedures............................................................................14
Reliability Coefficients....................................................................................60
Standard Error of Measurement.....................................................................61
Decision Accuracy and Consistency...............................................................63
Validity................................................................. 64
Content Validity.....................................................................................................64
Construct Validity..................................................................................................64
Internal Evidence: Evaluation of Unidimensionality of Star Reading....................65
External Evidence: Relationship of Star Reading Scores to Scores
on Other Tests of Reading Achievement..........................................................68
Relationship of Star Reading Scores to Scores on State Tests
of Accountability in Reading..............................................................................71
Relationship of Star Reading Scores to Scores on Multi-State Consortium
Tests in Reading...............................................................................................72
Meta-Analysis of the Star Reading Validity Data...................................................73
Additional Validation Evidence for Star Reading...................................................74
A Longitudinal Study: Correlations with SAT9................................................74
Concurrent Validity: An International Study of Correlations with
Reading Tests in England...........................................................................75
Construct Validity: Correlations with a Measure of Reading
Comprehension..........................................................................................76
Investigating Oral Reading Fluency and Developing the Estimated
Oral Reading Fluency Scale......................................................................78
Cross-Validation Study Results......................................................................80
Classification Accuracy of Star Reading...............................................................81
Accuracy for Predicting Proficiency on a State Reading Assessment............81
Accuracy for Identifying At-Risk Students.......................................................81
Brief Description of the Current Sample and Procedure..........................82
Disaggregated Validity and Classification Data..............................................83
Evidence of Technical Accuracy for Informing Screening
and Progress Monitoring Decisions...........................................................84
Screening.......................................................................................................85
Progress Monitoring.......................................................................................89
Additional Research on Star Reading as a Progress Monitoring Tool.....92
Differential Item Functioning.................................................................................92
Summary of Star Reading Validity Evidence.........................................................95
Norming................................................................ 96
Background...........................................................................................................96
The 2024 Star Reading Norms.......................................................................96
Sample Characteristics.........................................................................................97
Geographic region..........................................................................................99
School size.....................................................................................................99
Socioeconomic status as indexed by the percent of school
students with free and reduced lunch........................................................99
Test Administration..............................................................................................101
Data Analysis......................................................................................................101
Growth Norms.....................................................................................................103
References............................................................ 174
Index..................................................................... 177
The lengthier Star Reading serves similar purposes, but tests a greater breadth of
reading skills appropriate to each grade level. While the Star Reading test provides
accurate normed data like traditional norm-referenced tests, it is not intended to
be used as a “high-stakes” test. Generally, states are required to use high-stakes
assessments to document growth, adequate yearly progress, and mastery of
state standards. These high-stakes tests are also used to report end-of-period
performance to parents and administrators or to determine eligibility for promotion
or placement. Star Reading is not intended for these purposes. Rather, because
of the high correlation between the Star Reading test and high-stakes instruments,
classroom teachers can use Star Reading scores to fine-tune instruction while
there is still time to improve performance before the regular test cycle. At the same
time, school- and district-level administrators can use Star Reading to predict
performance on high-stakes tests. Furthermore, Star Reading results can easily be
disaggregated to identify and address the needs of various groups of students.
The Star Reading test’s repeatability and flexible administration provide specific
advantages for everyone responsible for the education process:
X For students, Star Reading software provides a challenging, interactive, and
brief test that builds confidence in their reading ability.
X For teachers, the Star Reading test facilitates individualized instruction by
identifying children who need remediation or enrichment most.
X For principals, the Star Reading software provides regular, accurate reports on
performance at the class, grade, building, and district level.
X For district administrators and assessment specialists, it provides a wealth of
reliable and timely data on reading growth at each school and districtwide. It
also provides a valid basis for comparing data across schools, grades, and
special student populations.
Star Reading is similar in many ways to the Star Reading Progress Monitoring
version, but with some enhanced features, including additional reports and
expanded benchmark management.
The second generation consisted of Star Reading versions 2 through 4.4, including
the current Star Reading Progress Monitoring version. This second generation
differed from the first in three major respects: It replaced classical test theory with
Item Response Theory (IRT) as the psychometric foundation for adaptive item
selection and scoring; its test length was fixed at twenty-five items (rather than
the variable length of version 1); and its content included a second item type: the
original vocabulary in context items were augmented in grades 3–12 by the use
of longer, authentic text passages for the last 5 items of each test. The second
generation versions differed from one another primarily in terms of the size of their
item banks, which grew to over 2000 items in version 4.4. Like the first generation
of Star Reading tests, the second generation continued to measure a single
construct: reading comprehension.
The third generation is represented by the current version of Star Reading. This
is the first version of Star Reading to be designed as a standards-based test; its
items are organized into 5 blueprint domains, 10 skill sets, 36 general skills, and
over 470 discrete skills—all designed to align to national and state curriculum
standards in reading and language arts, including the Common Core State
Standards. Like the second generation of Star Reading tests, the third generation
Star uses fixed-length adaptive tests. Its tests are longer than the second
generation test—34 items in length—both to facilitate broader standards coverage
and to improve measurement precision and reliability.
Another fundamental Star Reading design decision involved the choice of the
content and format of items for the test. Many types of stimulus and response
procedures were explored, researched, discussed, and prototyped. These
procedures included the traditional reading passage followed by sets of literal or
inferential questions, previously published extended selections of text followed by
open-ended questions requiring student-constructed answers, and several cloze-
type procedures for passage presentation. While all of these procedures can be
used to measure reading comprehension and overall reading achievement, the
vocabulary-in-context format was selected as the primary item format for the first
generation Star Reading assessments. This decision was made for interrelated
reasons of efficiency, breadth of construct coverage, and objectivity and simplicity
of scoring.
Four fundamental arguments support the use of the original Star Reading design
for obtaining quick and reliable estimates of reading comprehension and reading
achievement:
The current third-generation tests expand the breadth of item formats and content
beyond that of the previous versions. Each test consists of 34 items; of these, the
first 10 are vocabulary-in-context items, while the last 24 items spiral their content
to include standards-based material from all five blueprint domains.
The introduction of the 34-Item Star Reading version does not replace the
previous version or make it obsolete. The previous version continues to be
available as “Star Reading Progress Monitoring,” the familiar 25-item measure of
reading comprehension. Star Reading thus gives users a choice between a brief
assessment focusing on reading comprehension alone, or a longer, standards-
based assessment which assures that a broad range of different reading skills,
appropriate to student grade level and performance, are included in each
assessment.
For these reasons, the Star Reading test design and item format provide a valid
procedure for assessing a student’s reading comprehension. Data and information
presented in this manual reinforce this.
Test Interface
The Star Reading test interface was designed to be both simple and effective.
Students can use either the mouse or the keyboard to answer questions.
X If using the keyboard, students press one of the four number keys (1, 2, 3, and
4) and then press the Enter key (or the return key on Macintosh computers).
X If using the mouse, students click the answer of choice and then click Next to
enter the answer.
X On a tablet, students tap their answer choice; then, they tap Next.
Practice Session
Star Reading software includes a provision for a brief practice test preceding the
test itself. The practice session allows students to get comfortable with the test
interface and to make sure that they know how to operate it properly. As soon
as a student has answered three practice questions correctly, the program takes
the student into the actual test. As long as they possess the requisite 100-word
vocabulary, even the lowest-level readers should be able to answer the sample
questions correctly. If the student has not successfully answered three items by
the end of the practice session, Star Reading will halt the testing session and tell
the student to ask the teacher for help. It may be that the student cannot read at
even the most basic level, or it may be that the student needs help operating the
interface, in which case the teacher should help the student through the practice
session the next time. Before beginning the next test with the student, the program
will recommend that the teacher assist the student during the practice.
Once a student has successfully passed a practice session, the student will not
be presented with practice items again on a test of the same type taken within the
next 180 days.
In order to minimize student frustration, the first administration of the Star Reading
test begins with items that have a difficulty level that is below what a typical
student at a given grade can handle—usually one or two grades below grade
placement. On the average, about 85 percent of students will be able to answer
the first item correctly. Teachers can override this typical value by entering an
even lower Estimated Instructional Reading Level for the student. On the second
and subsequent administrations, the Star Reading test again begins with items
that have a difficulty level lower than the previously demonstrated reading ability.
Students generally have an 85 percent chance of answering the first item correctly
on second and subsequent tests.
Test Length
Once the testing session is underway, the Star Reading test administers 34 items
(the Star Reading Progress Monitoring test administers 25 items) of varying
difficulty based on the student’s responses; this is sufficient information to obtain a
reliable Scaled Score and to determine the student’s Instructional Reading Level.
The length of time needed to complete a Star Reading test varies across students.
Table 1 provides an overview of the testing time by grade for the students who
took the full-length 34-item version of Star Reading during the 2018–2019 school
year. The results of the analysis of test completion time indicate that half or more
of students completed the test in less than 20 minutes, depending on grade, and
even in the slowest grade (grade K) 95% of students finished their Star Reading
test in less than 34 minutes.
Table 1: Average and Percentiles of Total Time to Complete the 34-item Star Reading Assessment
During the 2018–2019 School Year
Table 2: Average and Percentiles of Total Time to Complete the 25-item Star Reading Progress
Monitoring Assessment During the 2017–2018 and 2018–2019 School Years
Test Repetition
Star Reading score data can be used for multiple purposes such as screening,
placement, planning instruction, benchmarking, and outcomes measurement. The
frequency with which the assessment is administered depends on the purpose for
assessment and how the data will be used. Renaissance Learning recommends
assessing students only as frequently as necessary to get the data needed.
Schools that use Star for screening purposes typically administer it two to five
times per year. Teachers who want to monitor student progress more closely or
use the data for instructional planning may use it more frequently. Star Reading
may be administered monthly for progress monitoring purposes, and as often as
weekly when needed.
Star Reading keeps track of the questions presented to each student from test
session to test session and will not ask the same question more than once in any
120-day period.
more than 90 percent of students can complete each item within the normal time
limits.
Star Reading provides the option of extended time limits for selected students
who, in the judgment of the test administrator, require more than the standard
amount of time to read and answer the test questions.
Table 3 shows the Star Reading Progress Monitoring version’s test time-out limits
for individual items. These time limits are based on a student’s grade level.
Standard Time
Limit (seconds/ Extended Time Limit
Grade Question Type item) (seconds/item)
K–2 Practice 60 180
a
Test, questions 1–25 60 180
Skill Test—Practice 60 180
(Calibration)
Skill Test—Test (Calibration) 60 180
3–12 Practice 60 180
Test, questions 1–20a 45 135
b
Test, questions 21–25 90 270
Skill Test—Practice 60 180
(Calibration)
Skill Test—Test (Calibration) 90 270
a. Vocabulary-in-context items.
b. Authentic text/passage comprehension items.
These time-out values are based on latency data obtained during item validation.
Very few vocabulary-in-context items at any grade had latencies longer than 30
seconds, and almost none (fewer than 0.3 percent) had latencies of more than
45 seconds. Thus, the time-out limit was set to 45 seconds for most students and
increased to 60 seconds for the very young students. Longer time limits were
allowed for the lengthier authentic text passages items.
Table 4 shows time limits for the 34-item Star Reading version’s test questions:
At all grades, regardless of the extended time limit setting, when a student has
only 15 seconds remaining for a given item, a time-out warning appears, indicating
that he or she should make a final selection and move on. Items that time out
are counted as incorrect responses unless the student has the correct answer
selected when the item times out. If the correct answer is selected at that time, the
item will be counted as a correct response.
If a student doesn’t respond to an item, the item times out and briefly gives
the student a message describing what has happened. Then the next item is
presented. The student does not have an opportunity to take the item again. If a
student doesn’t respond to any item, all items are scored as incorrect.
Unlimited Time
Beginning with the 2022–23 school year, a new preference has been added: the
Accommodations Preference. Among other things, this preference allows teachers
to give students virtually unlimited time to answer questions: 15 minutes for both
practice questions and test questions. When this preference is set, the student will
not see a time-out warning when there are 15 seconds left; however, if there is no
activity at all from the student within 15 minutes of a question first being presented,
the student will be shown a dialog box. The student will have 60 seconds to close
the dialog box and return to the test. If the student does not close the dialog box
within 60 seconds, the student’s current progress on the test will be saved and the
test will be ended (and can be resumed the same way as a paused test).
Test Security
Star Reading software includes a number of security features to protect the
content of the test and to maintain the confidentiality of the test results.
Split-Application Model
When students log into Star Reading, they do not have access to the same
functions that teachers, administrators, and other personnel can access. Students
are allowed to take the test, but no other features available in Star Reading are
available to them; therefore, they have no access to confidential information.
When teachers and administrators log in, they can manage student and class
information, set preferences, and create informative reports about student test
performance.
Individualized Tests
Using Adaptive Branching, every Star Reading test consists of items chosen
from a large number of items of similar difficulty based on the student’s estimated
ability. Because each test is individually assembled based on the students past
and present performance, identical sequences of items are rare. This feature,
while motivated chiefly by psychometric considerations, contributes to test security
by limiting the impact of item exposure.
Data Encryption
A major defense against unauthorized access to test content and student test
scores is data encryption. All of the items and export files are encrypted. Without
Renaissance also allows you to restrict students’ access to certain computers. This
prevents students from taking Star Reading tests from unauthorized computers
(such as home computers). For more information, see [Link]
com/setup/22509.
The security of the Star Reading data is also protected by each person’s user
name (which must be unique) and password. User names and passwords
identify users, and the program only allows them access to the data and features
that they are allowed based on their primary position and the user permissions
that they have been granted. Personnel who log in to Renaissance (teachers,
administrators, or staff) must enter a user name and password before they
can access the data and create reports. Parents who are granted access to
Renaissance must also log in with a user name and password before they can
access information about their children. Without an appropriate user name and
password, personnel and parents cannot use the Star Reading software.
Final Caveat
While Star Reading software can do much to provide specific measures of test
security, the most important line of defense against unauthorized access or misuse
of the program is the user’s responsibility. Teachers and test monitors need to be
careful not to leave the program running unattended and to monitor all testing to
prevent students from cheating, copying down questions and answers, or performing
“print screens” during a test session. Taking these simple precautionary steps will
help maintain Star Reading’s security and the quality and validity of its scores.
The Test Administration Manual included with the Star Reading product describes
the standard test orientation procedures that teachers should follow to prepare
their students for the Star Reading test. These instructions are intended for
use with students of all ages; however, the Star Reading test should only be
administered to students who have a reading vocabulary of at least 100 words.
The instructions were successfully field-tested with students ranging from grades
1–8. It is important to use these same instructions with all students before they
take the Star Reading test.
For information regarding the development of Star Reading items, see “Item
Development Specifications: Star Reading” on page 19. Before inclusion
in the Star Reading item bank, all items are reviewed to ensure they meet the
content specifications for Star Reading item development. Items that do not
meet the specifications are either discarded or revised for recalibration. All new
item development adheres to the content specifications and all items have been
calibrated using the dynamic calibration method.
The first stage of expanded Star Reading development is to identify the set of skills
to be assessed. Multiple resources were consulted to determine the set of skills
most appropriate for assessing the reading development of K–12 US students.
The resources include but are not limited to:
X Reading Next—A Vision for Action and Research in Middle and High School
Literacy: A Report to Carnegie Corporation of New York © 2004 by Carnegie
Corporation of New York. [Link]
[Link].
X NCTE Principles of Adolescent Literacy Reform, A Policy Research Brief,
Produced by The National Council of Teachers of English, April 2006.
[Link]
The development of the skills list included iterative reviews by reading and
assessment experts and psychometricians specializing in educational assessment.
See Table 5 for the Star Reading Blueprint Skills List. The skills list is organized
into five blueprint domains:
X Word Knowledge and Skills
X Comprehension Strategies and Constructing Meaning
X Analyzing Literary Text
X Understanding Author’s Craft
X Analyzing Argument and Evaluating Text
Assessment items, once written, edited, and reviewed, are field tested and
calibrated to estimate their Rasch difficulty parameters and goodness of fit to the
model. Field testing and calibration are conducted in a single step. This dynamic
calibration method is done by embedding new items in appropriate, random
positions within the Star assessments to collect the item response data needed
for psychometric evaluation and calibration analysis. Following these analyses,
each assessment item—along with both traditional and Item Response Theory
(IRT) analysis information (including fit plots) and information about the test level,
form, and item identifier—is stored in an item statistics database. A panel of
content reviewers then examines each item within the proper context, to determine
whether the item meets all criteria for use in an operational assessment.
Table 5: Star Reading Assessment Organization: Star Reading Blueprint Domains, Skill Sets, and
Skills
Beginning with new test items introduced in version 4.3, Star Reading item
developers have used ATOS instead of the EDL word list. ATOS is a system for
evaluating the reading level of continuous text; it contains over 125,000 words in
its graded vocabulary list. This readability formula was developed by Renaissance
Learning, Inc., and designed by leading readability experts. ATOS is the first
formula to include statistics from actual student book reading.
The current item bank for Star Reading contains over 6,000 items.
2. To answer the question, the student selects the word from the answer choices
that best completes the sentence. The correct answer option is the word that
appropriately fits both the semantics and the syntax of the sentence. All of
the incorrect answer options either fit the syntax of the sentence or relate to
the meaning of something in the sentence. They do not, however, meet both
conditions.
3. The answer blanks are generally located near the end of the context sentence
to minimize the amount of rereading required.
4. The sentence provides sufficient context clues for students to determine the
appropriate answer choice. However, the length of each sentence varies
according to the guidelines shown in Table 6.
5. Typically, the words that provide the context clues in the sentence are below
the level of the actual test word. However, due to a limited number of available
words, not all of the questions at or below grade 2 meet this criterion—but even
at these levels, no context words are above the grade level of the item.
6. The correct answer option is a word selected from the appropriate grade
level of the item set. Incorrect answer choices are words at the same
grade level or one grade below. Through vocabulary-in-context test items,
Star Reading requires students to rely on background information, apply
vocabulary knowledge, and use active strategies to construct meaning from the
assessment text. These cognitive tasks are consistent with what researchers
and practitioners describe as reading comprehension.
To answer the item correctly, the student needs to have a general understanding of
the context and content of the passage, not merely an understanding of the specific
content of the sentence.
The first authentic passages in Star Reading were extracted from children’s and
young adult literature, from nonfiction books, and from newspapers, magazines,
and encyclopedias. Passages were selected from combinations of three primary
categories for school-age children: popular fiction, classic fiction, and nonfiction.
Overall Flesch-Kincaid readability estimates of the source materials were used as
initial estimates of grade-level difficulty.
After the grade-level difficulty of a passage was estimated, the passage was
searched for occurrences of Educational Development Laboratory (EDL) words
at the same grade level difficulty. When an EDL word was found that, if replaced
with a blank space, would make the passage a good cloze passage, the passage
was extracted for use as an authentic text passage test item. Approximately 600
authentic text passage items were initially developed.
Each of the items in the resulting pool was then rated according to several criteria
in order to determine which items were best suited for inclusion in the tryout and
calibration. Three educators rated each item on the following criteria:
X Grade-level appropriateness of the text
X Cohesiveness of the passage
X Suitability of the passage for its grade level in terms of vocabulary
X Suitability of the passage for its grade level in terms of content density
To ensure a variety of authentic text passage items on the test, each passage was
also placed in one of the following categories, according to Meyer and Rice:
Replacement passages and newly created items intended for use in versions 4.3
and later were extracted primarily from Accelerated Reader (AR) books. (Updated
content specifications were used for writing the new and replacement Star
Reading items in version 4.3.) Target words were selected in advance (based on
the average ATOS level of target words within a range of difficulty levels). Texts of
AR books, based on those with the fewest quiz requests, were run through a text-
analysis tool to find instances of use. This was done to decrease the possibility
that students may have already encountered an excerpt.
Consideration was given to include some passages from the public domain. When
necessary, original long items were written. In any case, passages excerpted or
adapted are attributed in “Item and Scale Calibration” on page 31.
Each of the authentic text passage items is written to the following specifications:
1. Each authentic text passage test item consists of a paragraph. The second half
of the paragraph contains a sentence with a blank indicating a missing word.
Four possible answers are shown beneath the sentence.
2. To answer the question, the student selects the word from the list of answer
choices that best completes the sentence based on the context of the
paragraph. The correct answer choice is the word that appropriately fits
both the semantics and the syntax of the sentence, and the meaning of the
paragraph. All of the incorrect answer choices either fit the syntax of the
sentence or relate to the meaning of the paragraph.
3. The paragraph provides sufficient context clues for students to determine the
appropriate answer choice. Average sentence length within the paragraphs is
8–16 words depending on the item’s grade level. Total passage length ranges
from 27–107 words, based on the average reading speed of each grade level,
as shown in Table 7.
4. Answer choices for authentic text passage items are EDL Core Vocabulary or
ATOS words selected from vocabulary levels at or below that of the correct
response. The correct answer for a passage is a word at the targeted level of
the item. Incorrect answers are words or appropriate synonyms at the same
EDL or ATOS vocabulary level or one grade below.
Adherence to Skills
Star Reading assesses more than 600 grade-specific skills within the Renaissance
Core Progress for Reading Learning Progression. Item development is skill-specific.
Each item in the item bank is developed for and clearly aligned to one skill. An item
meets the alignment criteria if the knowledge and skill required to correctly answer
the item match the intended knowledge and skill being assessed. Answering an item
correctly does not require reading skill knowledge beyond the expected knowledge
for the skill being assessed. Star Reading items include only the information and text
needed to assess the skill.
The reading level and grade level for individual words are determined by ATOS.
Item stems and answer choices present several challenges to accurately
determining reading level. Items may contain discipline-specific vocabulary that is
typically above grade level but may still be appropriate for the item. Examples of
this could include summary, paragraph, or organized and the like. Answer choices
may be incomplete sentences for which it is difficult to get an accurate reading
grade level. These factors are taken into account when determining reading level.
Item stems and answer choices that are complete sentences are written for the
intended grade level of the item. The words in answer choices and stems that are
not complete sentences are within the designated grade-level range. Reading
comprehension is not complicated by unnecessarily difficult sentence structure
and/or vocabulary.
Items and passages are written at grade level. Table 8 indicates the GLE range,
item word count range, maximum passage word count range, and sentence length
range.
One exception exists for the reading skill use context clues. For those items, the
target word will be one grade level above the designated grade of the item.
Maximum Sentence
GLE Item Word Length Number of Words 1 Number of Unrecognized
Grade Range Count Range Grade Above (per 100) Words
K Less than 30 < 10 0 As a rule, the only
unrecognized words will be:
names, common derivatives,
etc.
1 30 10 0
2 1.8–2.7 40 Up to 12 0
3 2.8–3.7 Up to 55 Up to 12 0
4 3.8–4.7 Up to 70 Up to 14 0
5 4.8–5.7 Up to 80 Up to 14 In grade 5 and above, only 1
and only when needed.
6 5.8–6.7 Up to 80 Up to 14 1
7 6.8–7.7 Up to 90 Up to 16 1
8 7.8–8.7 Up to 90 Up to 16 1
9 8.8–9.7 Up to 90 Up to 16 1
10–12 9.8–10.7 Up to 100 Up to 16 1
Accuracy of Content
Concepts and information presented in items are accurate, up-to-date, and
verifiable. This includes, but is not limited to, references, dates, events, and
locations.
Language Conventions
Grammar, usage, mechanics, and spelling conventions in all Star Reading items
adhere to the rules and guidelines in the approved content reference books.
Merriam Webster’s 11th Edition is the reference for pronunciation and spelling.
The Chicago Manual of Style 17th Edition is the anchor reference for grammar,
mechanics, and usage.
Item Components
In addition to the guidelines outlined above, there are criteria that apply to
individual item components. The guidelines for passages are addressed above.
Specific considerations regarding stem and distractors are listed below.
are drawn from an overarching pool of skills known as the universal skills pool.
The universal skills pool contains the full range of skills reflected in state content
standards from all 50 US states and the District of Columbia from early literacy to
high-school level analysis and critique. The universal skills pool continues to grow
and evolve as state standards change and are updated. Learning progressions
are created by mapping the skills in the universal skills pool to different content
standards. Learning progressions define coherent and continuous pathways in
which students acquire knowledge and skills and present the knowledge and skills
in teachable orders that can be used to inform instructional decisions.
The first learning progression created for Star Reading was the Renaissance
Core Progress for Reading Learning Progression, which identifies a continuum of
reading skills that span from early literacy through high-school level analysis and
critique. It was developed in consultation with leading experts in early literacy and
reading by reviewing research and curricular documents and standards, including
the National Assessment of Education Progress (NAEP) Reading framework,
Texas Essential Knowledge and Skills, and state reading standards. The
Renaissance Core Progress for Reading Learning Progression is supported by
calibration data and psychometric analyses and is regularly refined and updated.
Item calibration data from Star Reading continually shows that there is a strong
correlation between rank ordering of skills in the Renaissance Core Progress for
Reading Learning Progression and the item difficulty estimates of items written to
measure those skills that are used in Star Reading.
Figure 1 illustrates the relationship between the sequential order of skills in the
Renaissance Core Progress for Reading Learning Progression and the average
difficulty of the Star Early Literacy and Star Reading items measuring that skill on
the Star Reading Unified scale. Each skill is represented by a single data point
with skills in each learning progression domain represented by different color
points. The figure shows that skills that are ordered later in the Renaissance Core
Progression for Reading Learning Progression are often more difficult than skills
that are represented earlier in the progression.
When a student completes a Star Reading assessment, the program uses that
student’s performance to place the student at the appropriate point in the learning
progression designated for that school. This learning progression is usually the
state specific learning progression for the state in which the school is located.
Locating students in the learning progression helps teachers to identify the skills
that students are likely to have already learned and the skills they are ready
to learn next. It also indicates whether students are meeting the grade-level
performance expectations established by state content standards.
Background
Star Reading was initially published in 1996, and quickly became one of the first
applications of computerized adaptive testing (CAT) to educational assessment
at the primary and secondary school levels. Unlike other early CAT applications,
the initial version of Star Reading was not based on item response theory (IRT).
Instead, it was an instance of stratified adaptive testing (Weiss, 19731). The items
in its item bank were sorted into grade levels (strata) based on their vocabulary
levels. Examinees started the test at the stratum corresponding to their school
grade; an algorithm branched them to easier or more difficult levels, contingent on
their performance.
IRT was introduced in Version 2 of Star Reading. At that time, hundreds of new
test items were developed, and both the new and the original items from Version
1 were calibrated as to difficulty on a vertical scale using the Rasch model. Star
Reading uses the calibrated Rasch difficulty of the test items as the basis for
adaptive item selection. And it uses the Rasch difficulty of the items administered
to a student, along with the pattern of right and wrong answers, to calculate a
maximum likelihood estimate of the location of the student on the Rasch scale.
To provide continuity with the non-IRT score scale of Version 1, equipercentile
equating was used to transform the Rasch scores to the original Star Reading
score scale.
Version 2’s Rasch model-based scale of item difficulty and student ability has
continued in use in all subsequent versions of Star Reading. This chapter begins
by presenting technical details of the development of that Rasch scale. Later, it
will describe improvements that have been made to the method of calibrating the
Rasch difficulty of new items. Finally, it will present details of the development of
a new scale for reporting Star Reading test scores—the Unified Score Scale, first
introduced in the 2017–2018 school year.
1. Weiss, D.J. The stratified adaptive computerized ability test (Research Report 73-3).
Minneapolis: University of Minnesota, Department of Psychology, Psychometric Method
Program, 1973. [Link]
Reading 2, as well as the linkage of Star Reading 2 scores to the original Star
Reading 1 score scale. This research took place in two stages: item calibration
and score scale calibration. These are described in their respective sections
below.
Sample Description
The data collection phase of the Star Reading 2 calibration study began with a
total item pool of over 2100 items. A nationally representative sample of students
tested these items. A total of 27,807 students from 247 schools participated in the
item calibration study. Table 9 provides the numbers of students in each grade who
participated in the study.
Students
National % Sample %
Geographic Region Northeast 20% 16%
Midwest 24% 34%
Southeast 24% 25%
West 32% 25%
District Socioeconomic Low: 31–100% 30% 28%
Status
Average: 15–30% 29% 26%
High: 0–14% 31% 32%
Non-Public 10% 14%
School Type & District Public
Enrollment < 200 17% 15%
200–499 19% 21%
27% 25%
500–2,000 28% 24%
> 2,000
Non-Public 10% 14%
Students
National % Sample %
Ethnic Group Asian 3% 3%
Black 15% 13%
Hispanic 12% 9%
Native American 1% 1%
White 59% 63%
Unclassified 9% 10%
Item Presentation
For the calibration research study, seven levels of test booklets were constructed
corresponding to varying grade levels. Because reading ability and vocabulary
growth are much more rapid in the lower grades, only one grade was assigned
per test level for the first four levels of the test (through grade 4). As grade level
increases, there is more variation among both students and school curricula,
so a single test can cover more than one grade level. Grades were assigned to
test levels after extensive consultation with reading instruction experts as well as
considering performance data for items as they functioned in the Star Reading
1 test. Items were assigned to grade levels such that the resulting test forms
sampled an appropriate range of reading ability typically represented at or near the
targeted grade levels.
Grade levels corresponding to each of the seven test levels are shown in the first two
columns of Table 12. Students answered a set number of questions at their current
grade level, as well as a number of questions one grade level above and one grade
level below their grade level. Anchor items were included to support vertically scaling
the test across the seven test levels. Table 12 breaks down the composition of test
forms at each test level in terms of types and number of test questions, as well as the
number of calibration test forms at each level.
Table 12: Calibration Test Forms Design by Test Level, Star Reading 2
Calibration Study—Spring 1998
Anchor Unique
Grade Items per Items per Items per Number of
Test Level Levels Form Form Form Test Forms
A 1 44 21 23 14
B 2 44 21 23 11
C 3 44 21 23 11
D 4 44 21 23 11
E 5–6 44 21 23 14
F 7–9 44 21 23 14
G 10–12 44 21 23 15
Each of the calibration test forms within a test level consisted of a set of 21
anchor items which were common across all test forms within a test level. Anchor
items consisted of items: a) on grade level, b) one grade level above, and c) one
grade level below the targeted grade level. The use of anchor items facilitated
equating of both test forms and test levels for purposes of data analysis and the
development of the overall score scale.
In addition to the anchor items were a set of 23 additional items that were unique
to a specific test form (within a level). Items were selected for a specific test
level based on Star Reading 1 grade level assignment, EDL vocabulary grade
designation, or expert judgment. To avoid problems with positioning effects
resulting from the placement of items within each test booklet form, items were
shuffled within each test form. This created two variations of each test form such
that items appeared in different sequential positions within each “shuffled” test
form. Since the final items would be administered as part of a computer-adaptive
test, it was important to remove any effects of item positioning from the calibration
data so that each item could be administered at any point during the test.
The number of field test forms constructed for each of the seven test levels is
shown in the last column of Calibration Test Forms Design by Test Level, Star
Reading 2 Calibration Study—Spring 1998 (varying from 11–15 forms per level).
Calibration test forms were spiraled within a classroom such that each student
received a test form essentially at random. This design ensured that no more
than two or three students in any classroom attempted any particular tryout item.
Additionally, it ensured a balance of student ability across the various tryout forms.
Typically, 250–300 students at the designated grade level of the test item received
a given question on their test.
It is important to note that some performance data already existed for the majority
of the questions in the Star Reading 2 calibration study. All of the questions from
the Star Reading 1 item bank were included, as were many items that were
previously field tested, but were not included in the Star Reading 1 test.
Following extensive quality control checks, the Star Reading 2 calibration research
item response data were analyzed, by level, using both traditional item analysis
techniques and IRT methods. For each test item, the following information was derived
using traditional psychometric item analysis techniques:
X The number of students who attempted to answer the item
X The number of students who did not attempt to answer the item
X The percentage of students who answered the item correctly (a traditional
measure of difficulty)
X The percentage of students who selected each answer choice
X The correlation between answering the item correctly and the total score
(a traditional measure of item discrimination)
X The correlation between the endorsement of an alternative answer and the
total score
Item Difficulty
The difficulty of an item, in traditional item analysis, is the percentage of students
who answer the item correctly. This is typically referred to as the “p-value” of the
item. Low p-values (such as 15 percent) indicate that the item is difficult since only
a small percentage of students answered it correctly. High p-values (such as 90
percent) indicate that almost all students answered the item correctly, and thus the
item is easy. It should be noted that the p-value only has meaning for a particular
item relative to the characteristics of the sample of students who responded to it.
Item Discrimination
The traditional measure of the discrimination of an item is the correlation between
the “score” on the item (correct or incorrect) and the total test score. Items that
correlate well with total test score also tend to correlate well with one another and
produce a test that has more reliable scores (more internally consistent). For the
correct answer, the higher the correlation between item score and total score, the
better the item is at discriminating between low scoring and high scoring students.
Such items generally will produce optimal test performance. When the correlation
between the correct answer and total test score is low (or negative), it typically
indicates that the item is not performing as intended. The correlation between
endorsing incorrect answers and total score should generally be low since there
should not be a positive relationship between selecting an incorrect answer and
scoring higher on the overall test.
IRT attempts to model quantitatively what happens when a student with a specific
level of ability attempts to answer a specific question. IRT calibration places the
item difficulty and student ability on the same scale; the relationship between them
can be represented graphically in the form of an item response function (IRF),
which describes the probability of answering an item correctly as a function of the
student’s ability and the difficulty of the item.
Figure 2 is a plot of three item response functions: one for an easy item, one for
a more difficult one, and one for a very difficult item. Each plot is a continuous
S-shaped (ogive) curve. The horizontal axis is the scale of student ability, ranging
from very low ability (–5.0 on the scale) to very high ability (+5.0 on the scale). The
vertical axis is the percent of students expected to answer each of the three items
correctly at any given point on the ability scale. Notice that the expected percent
correct increases as student ability increases, but varies from one item to another.
In Figure 2, each item’s difficulty is the scale point where the expected percent
correct is exactly 50. These points are depicted by vertical lines going from the 50
percent point to the corresponding locations on the ability scale. The easiest item
has a difficulty scale value of about –1.67; this means that students located at –1.67
on the ability scale have a 50-50 chance of answering that item right. The scale
values of the other two items are approximately +0.20 and +1.25, respectively.
Calibration of test items estimates the IRT difficulty parameter for each test
item and places all of the item parameters onto a common scale. The difficulty
parameter for each item is estimated, along with measures to indicate how well the
item conforms to (or “fits”) the theoretical expectations of the presumed IRT model.
Also plotted in Figure 2 are “empirical item response functions (EIRF)”: the actual
percentages of correct responses of groups of students to all three items. Each
group is represented as a small triangle, circle, or diamond. Each of those geometric
symbols is a plot of the percent correct against the average ability level of the group.
Ten groups’ data are plotted for each item; the triangular points represent the groups
responding to the easiest item. The circles and diamonds, respectively, represent
the groups responding to the moderate and to the most difficult item.
Figure 2: Example of Item Statistics Database Presentation of Information
For purposes of the Star Reading 2 calibration research, two different “fit”
measures (both unweighted and weighted) were computed. Additionally, if the IRT
model is functioning well, then the EIRF points should approximate the (estimated)
theoretical IRF. Thus, in addition to the traditional item analysis information, the
following IRT-related information was determined for each item administered
during the calibration research analyses:
X The IRT item difficulty parameter
X The unweighted measure of fit to the IRT model
X The weighted measure of fit to the IRT model
X The theoretical and empirical IRF plots
Items were eliminated when they met one or more of the following criteria:
X Item-total correlation (item discrimination) was < 0.30
X Some other answer option had an item discrimination that was high
X Sample size of students attempting the item was less than 300
X The traditional item difficulty indicated that the item was too difficult or too easy
X The item did not appear to fit the Rasch model
For Star Reading version 2, after each content reviewer had designated certain
items for elimination, their recommendations were combined and a second review
was conducted to resolve issues where there was not uniform agreement among
all reviewers.
Of the initial 2100+ items administered in the Star Reading 2 calibration research
study, 1,409 were deemed of sufficient quality to be retained for further analyses.
Traditional item-level analyses were conducted again on the reduced data set that
excluded the eliminated items. IRT calibration was also performed on the reduced
data set and all test forms and levels were equated based on the information
provided by the embedded anchor items within each test form. This resulted in
placing the IRT item difficulty parameters for all items onto a single scale spanning
grades 1–12.
Table 13 summarizes the final analysis information for the test items included
in the calibration test forms by test level (A–G). As shown in the table, the item
placements in test forms were appropriate: the average percentage of students
correctly answering items is relatively constant across test levels. Note, however,
that the average scaled difficulty of the items increases across successive levels
of the calibration tests, as does the average scaled ability of the students who
answered questions at each test level. The median point-biserial correlation, as
shown in the table, indicates that the test items were performing well.
Table 13: Calibration Test Item Summary Information by Test Level, Star Reading 2 Calibration
Study—Spring 1998
This continuous Rasch score scale is very different from the Scaled Score metric
used in Star Reading version 1. Star Reading version 1 scaled scores ranged
from 50–1,350, in integer units. The relationship of those scaled scores to the IRT
ability scale introduced in Star Reading version 2 was expected to be direct, but
not necessarily linear. For continuity between Star Reading 1 and Star Reading 2
scoring, it was desirable to be able to report Star Reading 2 scores on the same
scale used in Star Reading 1. To make that possible, a scale linking study was
undertaken in conjunction with Star Reading 2 norming. At every grade from
1–12, a portion of the norming sample was asked to take both versions of the Star
Reading test: versions 1 and 2. The test score data collected in the course of the
linking study were used to link the two scales, providing a conversion table for
transforming Star Reading 2 ability scores into equivalent Star Reading 1 Scaled
Scores.
From around the country and spanning all 12 grades, 4,589 students participated
in the linking study. Linking study participants took both Star Reading 1 and Star
Reading 2 tests within a few days of each other. The order in which they took the
two test versions was counterbalanced to account for the effects of practice and
fatigue. Test score data collected were edited for quality assurance purposes,
and 38 cases with anomalous data were eliminated from the linking analyses;
the linking was accomplished using data from 4,551 cases. The linking of the two
score scales was accomplished by means of an equipercentile equating involving
all 4,551 cases, weighted to account for differences in sample sizes across
grades. The resulting table of 99 sets of equipercentile equivalent scores was then
smoothed using a monotonic spline function, and that function was used to derive
a table of Scaled Score equivalents corresponding to the entire range of IRT
ability scores observed in the norming study. These Star Reading 2 Scaled Score
equivalents range from 0–1400; the same scale has been used for all subsequent
Star Reading versions, from version 3 to the present.
Summary statistics of the test scores of the 4,551 cases included in the linking
analysis are listed in Table 14. The table lists actual Star Reading 1 Scaled
Score means and standard deviations, as well as the same statistics for Star
Reading 2 IRT ability estimates and equivalent Scaled Scores calculated using
the conversion table from the linking study. Comparing the Star Reading 1 Scaled
Score means to the IRT ability score means illustrates how different the two
metrics are.
Comparing the Star Reading 1 Scaled Score means to the Star Reading 2
Equivalent Scale Scores in the rightmost two columns of Table 14 illustrates how
successful the scale linking was.
Table 14: Summary Statistics of Star Reading 1 and 2 Scores from the Linking Study, by Grade—
Spring 1999 (N = 4,551 Students)
Data from the linking study made it clear that Star Reading 2 software measures
ability levels extending beyond the minimum and maximum Star Reading 1 Scaled
Scores. In order to retain the superior bandwidth of Star Reading 2 software,
extrapolation procedures were used to extend the Scaled Score range below 50
and above 1,350; the range of reported scale scores for Star Reading versions 2
and later is 0 to 1400 for the Enterprise Scale. The Unified Scale reports scores
that range from 600 to 1400.
Both traditional and IRT item analyses are conducted of the item response data
collected. The traditional analyses yielded proportion correct statistics, as well as
biserial and point-biserial correlations between scores on the new items and actual
scores on the Star Reading tests. The IRT analyses differed from those used in
the calibration of Star Reading 2 items, in that the relationships between scores
on each new item and the actual Star Reading scores were used to calibrate the
Rasch difficulty parameters.
For dynamic calibration, a minimum of 1,000 responses per item is the data
collection target. In practice, because of the very large number of Star Reading
tests administered each year, the average number of students responding to
each new test item is typically several times the target. The calibration analysis
proceeds one item at a time, using SAS/STAT™ software to estimate the threshold
(difficulty) parameter of every new item by calculating the non-linear regression
of each new item score (0 or 1) on the Star Reading Rasch ability estimates.
The accuracy of the non-linear regression approach has been corroborated by
conducting parallel analyses using Winsteps software. In tests, the two methods
yielded virtually identical results.
Table 15 summarizes the final analysis information for the 854 new test items
introduced in Star Reading Version 4.3, in 2007, by the target grades tagged to
each item. Since that time, several thousand more Star Reading items have gone
through dynamic calibration; currently the Star Reading operational item bank
contains more than 6,000 items.
Table 15: Calibration Test Item Summary Information by Test Item Grade Level, Star Reading 4.3
Calibration Study–Fall 2007
During a Star Reading test, a student may be “routed” to items at the lowest
reading level or to items at higher reading levels within the overall pool of items,
depending on the student’s unfolding performance during the testing session. In
general, when an item is answered correctly, the student is then given a more
difficult item. When an item is answered incorrectly, the student is then given
an easier item. Item difficulty here is defined by results of the Star Reading item
calibration studies.
Students who have not taken a Star Reading test within six months initially receive
an item whose difficulty level is relatively easy for students at the examinee’s
grade level. The selection of an item that is a bit easier than average minimizes
any effects of initial anxiety that students may have when starting the test and
serves to better facilitate the student’s initial reactions to the test. These starting
points vary by grade level and were based on research conducted as part of the
national item calibration study.
When a student has taken a Star Reading test within the last 120 days, the difficulty of
the first item depends on that student’s previous Star Reading test score information.
After the administration of the initial item, and after the student has entered an answer,
Star Reading software estimates the student’s reading ability. The software then
selects the next item randomly from among all of the items available that closely
match the student’s estimated reading ability.
Randomization of items with difficulty values near the student’s adjusted reading
ability allows the program to avoid overexposure of test items. Items that have
been administered to the same student within the past 120 days are not available
for administration. The large numbers of items available in the item pools,
however, ensure that this constraint has negligible impact on the quality of each
Star Reading computer-adaptive test.
This approach to scoring enables Star Reading to provide Scaled Scores that
are statistically consistent and efficient. Accompanying each Scaled Score is an
associated measure of the degree of uncertainty, called the conditional standard
error of measurement (CSEM). The CSEM values for the Star Reading test are
unique for each student. CSEM values are dependent on the particular items the
student received and on the student’s performance on those items.
Scaled Scores are expressed on a common scale that spans all grade levels
covered by Star Reading (grades K–12). Because of this common scale, Scaled
Scores are directly comparable with each other, regardless of grade level. Other
scores, such as Percentile Ranks and Grade Equivalents, are derived from the
Scaled Scores.
What was needed was a common scale that can be used to report scores on
both tests. Such a scale, the Unified Score Scale, has been developed, and was
introduced into use in the 2017–2018 school year as an optional alternative scale
for reporting achievement on both tests. The Unified Scale is the default scale for
reporting test results starting in the 2022–2023 school year.
The Unified Score Scale is derived from the Star Reading Rasch scale of ability
and difficulty, which was first introduced with the development of Star Reading
Version 2.
The unified Star Early Learning scale was developed by performing the following
steps:
X The Rasch scale used by Star Early Literacy was linked (transformed) to the
Star Reading Rasch scale.
X A linear transformation of the transformed Rasch scale was developed that
spans the entire range of knowledge and skills measured by both Star Early
Literacy and Star Reading.
1. The Rasch scale used by Star Early Literacy was linked to the Star Reading
Rasch scale.
In this step, a linear transformation of the Star Early Literacy Rasch scale to
the Rasch scale used by Star Reading was developed, using a method for
linear equating of IRT (item response theory) scales described by Kolen and
Brennan (2004, pages 162–165).
2. Because Rasch scores are expressed as decimal fractions, and may be either
negative or positive, a more user-friendly scale score was developed that uses
positive integer numbers only. A linear transformation of the extended Star
Reading Rasch scale was developed that spans the entire range of knowledge
and skills measured by both Star Early Literacy and Star Reading. The
transformation formula is as follows:
Unified Scale Score = INT (42.93 * Star Reading Rasch Score + 958.74)
where the Star Reading Rasch score has been extended downwards to values
as low as –20.00.
i. The minimum SEL scale score of 300 was set equal to 200 on the
Unified Scale.
ii. An SR scale score of 0 was set equal to 600 on the Unified Scale.
b. The scale uses integer scale scores. New scale scores from 200 to
1400 correspond respectively to the lowest current SEL scale score
of 300, and a point slightly higher than the highest current SR scale
score of 1400.
Further details of the transformation of SEL Rasch scores to the SR Rasch scale
may be found in the 2018 edition of the Star Early Literacy Technical Manual.
Table 16 contains a table of selected Star Reading Rasch ability scores and their
equivalents on the Star Reading and Unified Score scales.
Table 16: Some Star Reading Rasch Scores and Their Equivalents on the Star
Reading and Unified Score Scales
In a computerized adaptive test such as Star Reading, content varies from one
administration to another, and it also varies with each student’s performance.
Another feature of computerized adaptive tests based on Item Response Theory
(IRT) is that the degree of measurement error can be expressed for each student’s
test individually.
The Star Reading tests provide two ways to evaluate the reliability of scores:
reliability coefficients, which indicate the overall precision of a set of test scores,
and conditional standard errors of measurement (CSEM), which provide an
index of the degree of error in an individual test score. A reliability coefficient is a
summary statistic that reflects the average amount of measurement precision in a
specific examinee group or in a population as a whole. In Star Reading, the CSEM
is an estimate of the unreliability of each individual test score. While a reliability
coefficient is a single value that applies to the test in general, the magnitude of the
CSEM may vary substantially from one person’s test score to another’s.
The reliability and measurement error presentation is divided into two sections
below: First is a section describing the reliability coefficients, standard errors of
measurement, and decision accuracy and consistency indices for the 34-item
Star Reading tests. Second, another brief section presents reliability coefficients,
standard errors of measurement, and decision accuracy and consistency indices
for the 25-item Star Reading progress monitoring tests..
where σ2error is the variance of the errors of measurement and σ2total is the
variance of test scores. In Star Reading, the variance of the test scores is easily
calculated from Scaled Score data. The variance of the errors of measurement
may be estimated from the conditional standard error of measurement (CSEM)
statistics that accompany each of the IRT-based test scores, including the Scaled
Scores, as depicted below.
where the summation is over the squared values of the reported CSEM for
students i = 1 to n. In each Star Reading test, CSEM is calculated along with the
IRT ability estimate and Scaled Score. Squaring and summing the CSEM values
yields an estimate of total squared error; dividing by the number of observations
yields an estimate of mean squared error, which in this case is tantamount to error
variance. “Generic” reliability is then estimated by calculating the ratio of error
variance to Scaled Score variance, and subtracting that ratio from 1.
Using this technique with the Star Reading 2018–2019 school year data resulted
in the generic reliability estimates shown in Table 17 and Table 18 on page
53. Because this method is not susceptible to error variance introduced by
repeated testing, multiple occasions, and alternate forms, the resulting estimates
of reliability are generally higher than the more conservative alternate forms
reliability coefficients. These generic reliability coefficients are, therefore, plausible
upper-bound estimates of the internal consistency reliability of the Star Reading
computer-adaptive test.
Generic reliability estimates for scores on the Unified score scale are shown
in Table 17; Table 18 lists the reliability estimates for the older Star Reading
“Enterprise” scale scores. Results in Table 17 indicate that the overall reliability of
the Unified scale scores was about 0.98. Coefficients ranged from a low of 0.94
in grade 5 to a high of 0.97 in grade K. Results based on the Enterprise Scale in
Table 18 are slightly lower: the overall reliability of those scale scores was about
As both tables show, Star Reading reliability is quite high, grade by grade and
overall. Star Reading also demonstrates high test-retest consistency as shown
in the rightmost columns of the same tables. Star Reading’s technical quality
for an interim assessment is on a virtually equal footing with the highest-quality
summative assessments in use today.
Split-Half Reliability
While generic reliability does provide a plausible estimate of measurement
precision, it is a theoretical estimate, as opposed to traditional reliability
coefficients, which are more firmly based on item response data. Traditional
internal consistency reliability coefficients such as Cronbach’s alpha and Kuder-
Richardson Formula 20 (KR-20) are not meaningful for adaptive tests. However,
an estimate of internal consistency reliability can be calculated using the split-half
method.
A split-half reliability coefficient is calculated in three steps. First, the test is divided
into two halves, and scores are calculated for each half. Second, the correlation
between the two resulting sets of scores is calculated; this correlation is an
estimate of the reliability of a half-length test. Third, the resulting reliability value
is adjusted, using the Spearman-Brown formula, to estimate the reliability of the
full-length test.
Results indicated that the overall split-half reliability of the Unified scores was 0.98.
The coefficients ranged from a low of 0.94 in grades 4 to 8 to a high of 0.96 in
grade 1. On the Enterprise Scale, the overall split-half reliability of the Enterprise
scores was 0.97. The coefficients ranged from a low of 0.92 in grades 4 and 5 to a
high of 0.95 in grades K, 1, and 12. These reliability estimates are quite consistent
across grades 1-12, and quite high, again a result of the measurement efficiency
inherent in the adaptive nature of the Star Reading test.
The alternate form reliability study provided estimates of Star Reading reliability
using a variation of the test-retest method. In the traditional approach to test-retest
reliability, students take the same test twice, with a short time interval, usually a
few days, between administrations. In contrast, the Star Reading alternate form
reliability study administered two different tests by avoiding during the second test
the use of any items the student had encountered in the first test. All other aspects
of the two tests were identical. The correlation coefficient between the scores on
the two tests was taken as the reliability estimate.
The alternate form reliability estimates for the Star Reading test were calculated
using both the Star Reading Unified scaled scores and the Enterprise scaled
scores. Checks were made for valid test data on both test administrations and to
remove cases of apparent motivational discrepancies.
Table 17 and Table 18 include overall and within-grade alternate reliability, along
with an indication of the average number of days between testing occasions. The
average number of days between testing occasions ranged from 91–130 days.
Results indicated that the overall reliability of the scores on the Unified scale was
about 0.93. The alternate form coefficients ranged from a low of 0.73 in grade K
to a high of 0.87 in grade 9. Results for the Enterprise scale were similar to those
of the Unified Scale with an overall reliability of 0.93; its alternate form coefficients
ranged from a low of 0.76 in grade K to a high of 0.88 in grades 8, 9, and 10.
Table 17: Reliability Estimates from the Star Reading 2018–2019 Data on the Unified Scale
Table 18: Reliability Estimates from the Star Reading 2018–2019 Data on the Enterprise Scale
The increased length of the current version of Star Reading, combined with its
increased breadth of skills coverage and enhanced technical quality, was expected
to result in improved measurement precision; this showed up as slightly increased
reliability, in both internal consistency reliability and alternate form reliability as
shown in the tables above. For comparison, see Table 22 on page 60 and Table
23 on page 61.
The Star Reading tests differ from traditional tests in at least two respects with
regard to the standard error of measurement. First, Star Reading software
computes the SEM for each individual student based on his or her performance,
unlike most traditional tests that report the same SEM value for every examinee.
Each administration of Star Reading yields a unique “conditional” SEM (CSEM)
that reflects the amount of information estimated to be in the specific combination
of items that a student received in his or her individual test. Second, because
the Star Reading test is adaptive, the CSEM will tend to be lower than that of
a conventional test, particularly at the highest and lowest score levels, where
conventional tests’ measurement precision is weakest. Because the adaptive
testing process attempts to provide equally precise measurement, regardless of
the student’s ability level, the average CSEMs for the IRT ability estimates are very
similar for all students.
Table 19 and Table 20 contain two different sets of estimates of Star Reading
measurement error: conditional standard error of measurement (CSEM) and
global standard error of measurement (SEM). Conditional SEM was just described;
the estimates of CSEM in Table 19 and Table 20 are the average CSEM values
observed for each grade.
SEM = SQRT(1 – ρ) σx
where
Table 19 and Table 20 summarize the distribution of CSEM values for the 2018–
2019 data, overall and by grade level. The overall average CSEM on the Unified
scale across all grades was 17 scaled score units and ranged from a low of 16 in
grades 1–3 to a high of 17 in grades K and 4–12 (Table 19).The average CSEM
based on the Unified scale is similar across all grades. The overall average unified
scale score global SEM was 18, slightly higher than the average CSEM. Table 20
shows the average CSEM values on the Enterprise Star Reading scale. Although
the adaptive testing process attempts to provide equally precise measurement,
regardless of the student’s ability level, and the average CSEMs for the IRT ability
estimates are very similar for all students, the transformation of the Star Reading
IRT ability estimates into equivalent Scaled Enterprise Scores is not linear and the
resulting SEMs in the Enterprise Scaled Score metric are less similar.
The overall average CSEM on the Enterprise scale across all grades was 54
scaled score units and ranged from a low of 20 in kindergarten to a high of 71
in grade 8. Unlike the Unified scale, the Enterprise Scale CSEM values vary by
grade and increased with grade until grade 8. The global SEMs for the Enterprise
scale scores were higher at each grade, and overall, than the average CSEMs; the
overall average SEM was 56. This is attributable to the nonlinear transformation of
the Star Reading IRT ability estimates into equivalent Enterprise Scaled Scores.
The Unified scale, in contrast, is based on a linear transformation of the IRT ability
estimates; it eliminates the issues of variable and large CSEM values that are an
artifact of the Enterprise Scaled Score nonlinear transformation.
Table 19: Standard Error of Measurement for the 2018–2019 Star Reading
Data on the Unified Scale
Table 20: Standard Error of Measurement for the 2018–2019 Star Reading
Data on the Enterprise Scale
[ ]
p^11 p^12 ... p^1C
p^21 p^22 ... p^2C
P = .. . ..
. .. . . . .
p^N 1 pN 2 . . .
^
p^N C
e e e
with the expected probability p^ic in the above matrix estimated as:
^
p^ic = ϕ(κic, κi(c+1), θ i, σ^θ ),
i
where ϕ(a, b, μ, σ) is the area from a to b under a normal curve with a mean of
^
μ and a standard deviation of σ, θ i is examinee i’s IRT ability estimate, σ^θ is the
^ i
corresponding CSEM for the ability estimate θ i, and κic and κi(c+1) are cut scores
with κi1 = –∞, κi2 being the cut score separating performance categories 1 and 2,
κi3 being the cut score separating performance categories 2 and 3, and so on with
[ ]
the last cut score κi(c+1) = ∞. The W matrix of weights is defined as:
where the weight, wic, equals 1 if the student was classified into performance level
category C based on their ability estimate and 0 otherwise.
Results indicate that decision accuracy and consistency were quite high overall
and across grades. For PR10, decision accuracy ranged from a low of 0.95 to
a high of 0.99, while decision consistency ranged from 0.93 to 0.99. For PR25,
decision accuracy ranged from a low of 0.93 to a high of 0.97, while decision
consistency ranged from 0.90 to 0.96. For PR40, decision accuracy ranged from a
low of 0.92 to a high of 0.95, while decision consistency ranged from 0.89 to 0.93.
Decision accuracy when using all three benchmarks together ranged from a low
of 0.81 to a high of 0.93, while decision consistency ranged from a low of 0.74 to
a high of 0.89. These are high levels of decision accuracy and consistency when
making classification decisions based on each individual benchmark or all three
benchmarks together, and support using Star Reading in RTI/MTSS frameworks.
Table 21: Decision Accuracy and Consistency for Different Benchmarks Based on 2018–2019 Star
Reading Tests
Decision Accuracy Decision Consistency
All 3 All 3
Grade N PR10 PR25 PR40 Benchmarks PR10 PR25 PR40 Benchmarks
K 50,000 0.99 0.97 0.95 0.92 0.99 0.96 0.93 0.89
1 1,000,000 0.99 0.96 0.94 0.89 0.98 0.94 0.92 0.85
2 1,000,000 0.97 0.95 0.94 0.86 0.96 0.93 0.91 0.81
3 1,000,000 0.97 0.94 0.94 0.84 0.94 0.92 0.90 0.79
4 1,000,000 0.97 0.94 0.92 0.83 0.95 0.92 0.89 0.77
5 1,000,000 0.96 0.93 0.92 0.82 0.95 0.90 0.89 0.75
6 1,000,000 0.95 0.93 0.92 0.81 0.94 0.90 0.89 0.74
7 1,000,000 0.96 0.93 0.92 0.81 0.94 0.90 0.89 0.75
8 1,000,000 0.96 0.93 0.92 0.81 0.94 0.90 0.89 0.74
9 500,000 0.95 0.93 0.93 0.81 0.93 0.90 0.90 0.74
10 500,000 0.95 0.93 0.93 0.81 0.93 0.90 0.90 0.74
11 200,000 0.95 0.93 0.93 0.81 0.93 0.90 0.90 0.74
12 200,000 0.95 0.93 0.93 0.81 0.93 0.90 0.91 0.74
Overall 9,450,000 0.96 0.94 0.93 0.83 0.95 0.91 0.90 0.77
Reliability Coefficients
Table 22 and Table 23 show the reliability estimates of the Star Reading progress
monitoring test on both the Unified scale and the Enterprise scale using data from
the 2017–2018 and 2018–2019 school years.
Table 22: Reliability Estimates from the 2017–2018 and 2018–2019 Star
Reading Progress Monitoring Tests on the Unified Scale
Table 23: Reliability Estimates from the 2017–2018 and 2018–2019 Star
Reading Progress Monitoring Tests on the Enterprise Scale
The progress monitoring Star Reading reliability estimates are also quite high and
consistent across grades 1–12, for a test composed of only 25 items.
Overall, these coefficients also compare very favorably with the reliability
estimates provided for other published reading tests, which typically contain far
more items than the 25-item Star Reading progress monitoring tests. The Star
Reading progress monitoring test’s high reliability with minimal testing time is
a result of careful test item construction and an effective and efficient adaptive-
branching procedure.
Table 24: Estimates of 2017–2018 and 2018–2019 Star Reading Progress Monitoring Measurement
Precision by Grade and Overall, on the Unified Scale
Table 25: Estimates of 2017–2018 and 2018–2019 Star Reading Progress Monitoring Measurement
Precision by Grade and Overall, on the Enterprise Scale
Table 26: Decision Accuracy and Consistency for Different Benchmarks Based on 2017–2018 and
2018–2019 Star Reading Progress Monitor Tests
Decision Accuracy Decision Consistency
All 3 All 3
Grade N PR10 PR25 PR40 Benchmarks PR10 PR25 PR40 Benchmarks
1 10,000 0.96 0.92 0.91 0.79 0.95 0.88 0.87 0.73
2 30,000 0.94 0.91 0.92 0.77 0.92 0.87 0.88 0.70
3 30,000 0.94 0.90 0.89 0.75 0.92 0.86 0.85 0.67
4 30,000 0.93 0.89 0.89 0.73 0.91 0.84 0.85 0.64
5 28,500 0.93 0.88 0.90 0.71 0.90 0.83 0.86 0.63
6 14,000 0.90 0.88 0.92 0.71 0.86 0.84 0.88 0.62
7 10,000 0.91 0.90 0.92 0.74 0.87 0.85 0.89 0.65
8 10,000 0.92 0.90 0.93 0.76 0.89 0.87 0.90 0.68
9 1,800 0.91 0.91 0.93 0.76 0.87 0.87 0.91 0.68
10 1,450 0.91 0.92 0.93 0.77 0.88 0.89 0.90 0.70
11 730 0.93 0.92 0.93 0.78 0.90 0.89 0.90 0.71
12 480 0.93 0.92 0.20 0.78 0.91 0.89 0.89 0.72
Overall 166,960 0.93 0.90 0.90 0.74 0.91 0.85 0.87 0.66
Test validity was long described as the degree to which a test measures what
it is intended to measure. A more current description is that a test is valid to
the extent that there are evidentiary data to support specific claims as to what
the test measures, the interpretation of its scores, and the uses for which
it is recommended or applied. Evidence of test validity is often indirect and
incremental, consisting of a variety of data that in the aggregate are consistent
with the theory that the test measures the intended construct(s), or is suitable for
its intended uses and interpretations of its scores. Determining the validity of a test
involves the use of data and other information both internal and external to the test
instrument itself.
Content Validity
One touchstone is content validity, which is the relevance of the test questions
to the attributes or dimensions intended to be measured by the test—namely
reading comprehension, reading vocabulary, and related reading skills, in the case
of the Star Reading assessments. The content of the item bank and the content
balancing specifications that govern the administration of each test together form
the foundation for “content validity” for the Star Reading assessments. These
content validity issues were discussed in detail in “Content and Item Development”
and were an integral part of the test items that are the basis of Star Reading today.
Construct Validity
Construct validity, which is the overarching criterion for evaluating a test, investigates
the extent to which a test measures the construct(s) that it claims to be assessing.
Establishing construct validity involves the use of data and other information external
to the test instrument itself. For example, Star Reading claims to provide an estimate
of a child’s reading comprehension and achievement level. Therefore, demonstration
of Star Reading’s construct validity rests on the evidence that the test provides such
estimates. There are a number of ways to demonstrate this.
For instance, in a study linking Star Reading Version 1 and the Degrees of
Reading Power comprehension assessment, a raw correlation of 0.89 was
observed between the two tests. Adjusting that correlation for attenuation due to
unreliability yielded a corrected correlation of 0.96 between the two assessments,
indicating that the constructs measured by the different tests are essentially
indistinguishable.
Since reading ability varies significantly within and across grade levels and
improves as a student’s grade placement increases, scores within Star Reading
should demonstrate these anticipated internal relationships; in fact, they do.
Additionally, scores for Star Reading should correlate highly with other accepted
procedures and measures that are used to determine reading achievement and
reading comprehension; this is external construct validity. This section deals
with both internal and external evidence of the validity of Star Reading as an
assessment of reading comprehension and reading skills.
Star Reading is an application of item response theory (IRT); each test item’s
difficulty has been calibrated using the Rasch model. One of the assumptions of
the Rasch model is unidimensionality: that a test measures only a single construct
such as reading comprehension in the case of Star Reading. To evaluate whether
Star reading measures a single construct, factor analyses were conducted. Factor
analysis is a statistical technique used to determine the number of dimensions or
constructs that a test measures. Both exploratory and confirmatory factor analyses
were conducted across grades K to 12.
To begin, a large sample of student Star Reading data was assembled. The overall
sample consisted of 286,000 student records. That sample was divided into 2
sub-samples. The first sub-sample, consisting of 26,000 cases, was used for
exploratory factor analysis; the second sub-sample, 260,000 cases, was reserved
for confirmatory factor analyses that followed the initial exploratory analysis.
Within each sub-sample, each student’s 34 Star Reading item responses were
divided into subsets of items aligned to each of the 5 blueprint domains. Tests
administered in grades 4–12 included items from all five domains. Tests given in
grades K–3 included items from just 4 domains; no items measuring analyzing
argument and evaluating text were administered in these grades. For each
student, separate Rasch ability estimates (subtest scores) were calculated from
each domain-specific subset of item responses. A Bayesian sequential procedure
developed by Owen (1969, 1975) was used for the subtest scoring. The number of
items included in each subtest ranged from 2 to 18, following the Star Reading test
blueprints, which specify different numbers of items per domain, depending on the
student’s grade level.
The results of the CFA analyses are summarized in Table 27. As that table
indicates, sample sizes ranged from 18,723 to 20,653; because the chi-square
(Χ2) test is not a reliable test of model fit when sample sizes are large, fit indices
are presented. The comparative fit index (CFI) and Tucker-Lewis Index (TLI) are
shown; for these indices, values are either 1 or very close to 1, indicating strong
evidence of a single construct/dimension. In addition, the root mean square error
of approximation (RMSEA), and the standardized root mean square residual
(SRMR) are presented. RMSEA and SRMR values less than 0.08 indicate good
fit. Cutoffs for the indices are presented in Hu and Bentler, 1999. Overall, the CFA
results strongly support a single underlying construct in Star Reading.
Table 27: Summary of the Goodness-of-Fit of the CFA Models for Star Reading by Grade
The EFA analyses were conducted using the factor analysis procedure in R, while
the CFA analysis was conducted using R with the lavaan package Rosseel, 2012).
of data 300 coefficients of correlation between Star Reading and other measures
administered at points in time at least two months later than Star Reading; more
than 1.45 million students’ test scores are represented in these two tables.
Predictive validity coefficients ranged from 0.69–0.72 in grades 1–6, with an
average of 0.71. In grades 7–12 the predictive validity coefficients ranged from
0.72–0.87 with an average of 0.80.
In general, these correlation coefficients reflect very well on the validity of the Star
Reading test as a tool for placement, achievement and intervention monitoring in
Reading. In fact, the correlations are similar in magnitude to the validity coefficients of
these measures with each other. These validity results, combined with the supporting
evidence of reliability and minimization of SEM estimates for the Star Reading test,
provide a quantitative demonstration of how well this innovative instrument in reading
achievement assessment performs.
Table 28: Concurrent Validity Data: Star Reading Correlations (r) with External Tests Administered
Spring 1999–Spring 2013, Grades 1–6
Summary
Grade(s) All 1 2 3 4 5 6
Number of students 255,538 1,068 3,629 76,942 66,400 54,173 31,686
Number of coefficients 195 10 18 47 41 32
Average validity 0.80 0.73 0.72 0.72 0.74 0.72
Overall average 0.74
Table 29: Concurrent Validity Data: Star Reading Correlations (r) with External Tests Administered
Spring 1999–Spring 2013, Grades 7–12
Summary
Grade(s) All 7 8 9 10 11 12
Number of students 48,789 25,032 21,134 1,774 755 55 39
Number of coefficients 74 30 29 7 5 2 1
Average validity 0.74 0.73 0.65 0.76 0.70 0.73
Overall average 0.72
Table 30: Predictive Validity Data: Star Reading Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6
Summary
Grade(s) All 1 2 3 4 5 6
Number of students 1,227,887 74,887 188,434 313,102 289,571 217,416 144,477
Number of coefficients 194 6 10 49 43 47 39
Average validity 0.69 0.72 0.70 0.71 0.72 0.71
Overall average 0.71
Table 31: Predictive Validity Data: Star Reading Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12
Summary
Grade(s) All 7 8 9 10 11 12
Number of students 224,179 111,143 72,537 9,567 21,172 6,653 3,107
Number of coefficients 106 39 41 8 10 6 2
Average validity 0.72 0.73 0.81 0.81 0.87 0.86
Overall average 0.80
Table 32: Concurrent Validity Data: Star Reading Correlations (r) with State Accountability Tests,
Grades 3–8
Summary
Grade(s) All 3 4 5 6 7 8
Number of students 11,045 2,329 1,997 2,061 1,471 1,987 1,200
Number of coefficients 61 12 13 11 8 10 7
Average validity 0.72 0.73 0.73 0.71 0.74 0.73
Overall average 0.73
Table 33: Predictive Validity Data: Star Reading Scaled Scores Predicting Later Performance for
Grades 3–8 on Numerous State Accountability Tests
Summary
Grade(s) All 3 4 5 6 7 8
Number of students 22,018 4,493 2,974 4,086 3,624 3,655 3,186
Number of coefficients 119 24 19 23 17 17 19
Average validity 0.66 0.68 0.70 0.68 0.69 0.70
Overall average 0.68
Table 34: Concurrent Predictive Validity Data: Star Reading Scaled Scores Predicting Later
Performance for Grades 3–8 on Smarter Balanced Assessment Consortium Test
Star Reading Predictive and Concurrent Correlations with Smarter Balanced Assessment Scores
Grade(s) All 3 4 5 6 7 8
Number of students 3,539 709 690 697 567 459 417
Fall Predictive 0.78 0.78 0.76 0.77 0.79 0.80
Winter Predictive 0.78 0.78 0.79 0.78 0.79 0.81
Spring Concurrent 0.79 0.82 0.80 0.70 0.79 0.81
Table 35: Concurrent and Predictive Validity Data: Star Reading Scaled Scores Correlations for
Grades 3–8 with PARCC Assessment Consortium Test Scores
Star Reading Predictive and Concurrent Correlations with PARCC Assessment Scores
Grade(s) All 3 4 5 6 7 8
Number of students 22,134 1770 3950 3843 4370 4236 3965
Predictive Concurrent 0.82 0.85 0.82 0.81 0.83 0.80
Concurrent 0.83 0.82 0.78 0.79 0.80 0.77
The average of the concurrent correlations was approximately 0.79 for SBAC
and 0.80 for PARCC. The average predictive correlation was 0.78 for the SBAC
assessments, and 0.82 for PARCC.
The results are displayed in Table 36. The table lists correlations within each
grade, as well as results from combining data from all twelve grades. For each
set of results, the table gives an estimate of the true validity, a standard error, and
the lower and upper limits of a 95 percent confidence interval for the expected
validity coefficient. Using the 789 correlation coefficients, the overall estimate of
the validity of Star Reading is 0.79, with a standard error of 0.001. The 95 percent
confidence interval allows one to conclude that the true validity coefficient for Star
Reading is approximately 0.79. The probability of observing the 789 correlations
reported in Table 28 through Table 35 if the true validity were zero, would be
virtually zero. Because the 789 correlations were obtained with widely different
tests, and among students from twelve different grades, these results provide
strong support for the validity of Star Reading as a measure of reading skills.
Table 36: Results of the Meta-Analysis of Star Reading Correlations with Other Tests
1. Reading Renaissance is a supplemental reading program that uses Star Reading and
Accelerated Reader.
Total Reading scores from each year and calculated correlations between them.
Students’ test scores were available for multiple years, spanning grades 2–6. Data
on gender, ethnic group, and Title I eligibility were also collected.
Table 37 displays the observed correlations for the overall group. Table 38 displays
the same correlations, broken out by ethnic group.
Overall correlations by year ranged from 0.66–0.73. Sadusky and Brem concluded
that “Star results can serve as a moderately good predictor of SAT9 performance
in reading.”
Enough Hispanic and white students were identified in the sample to calculate
correlations separately for those two groups. Within each ethnic group, the
correlations were similar in magnitude, as Table 38 shows. This supports the
assertion that Star Reading is valid for multiple student ethnicities.
Table 37: Correlations of the Star Posttest with the SAT9 Total Reading Scores
1998–2002a
Table 38: Correlations of the Star Posttest with the SAT9 Total Reading
Scores, by Ethnic Group, 1998–2002a
Hispanic White
Year Grade N Correlation N Correlation
1998 3–6 7 (n.s.) 0.55 35 0.69
1999 2–6 42 0.64 179 0.75
2000 2–6 67 0.74 287 0.71
2001 2–6 76 0.71 255 0.73
a. All correlations significant, p < 0.001, unless otherwise noted.
1–8) took both Star Reading and one of three age-appropriate forms of the
Suffolk Reading Scale 2 (SRS2) in the fall of 2006. Scores on the SRS2 included
traditional scores, as well as estimates of the students’ Reading Age (RA), a
scale that is roughly equivalent to the Grade Equivalent (GE) scores used in the
US. Additionally, teachers conducted individual assessments of each student’s
attainment in terms of curriculum levels, a measure of developmental progress
that spans the primary and secondary years in England.
Correlations with all three measures are displayed in Table 39, by grade and
overall. As the table indicates, the overall correlation between Star Reading
and Suffolk Reading Scaled Scores was 0.91, the correlation with Reading Age
was 0.91, and the correlation with teacher assessments was 0.85. Within-form
correlations with the SRS ability estimate ranged from 0.78–0.88, with a median
correlation of 0.84, and ranged from 0.78–0.90 on Reading Age, with a median of
0.85.
Table 39: Correlations of Star Reading with Scores on the Suffolk Reading
Scale and Teacher Assessments in a Study of 16 Schools in England
Teacher
Suffolk Reading Scale Assessments
School Test SRS Reading Assessment
Yearsa Form N Scoreb Age N Levels
2–3 SRS1A 713 0.84 0.85 n/a n/a
4–6 SRS2A 1,255 0.88 0.90 n/a n/a
7–9 SRS3A 926 0.78 0.78 n/a n/a
Overall 2,694 0.91 0.91 2,324 0.85
a. UK school year values are 1 greater than the corresponding US school grade. Thus, Year 2
corresponds to Grade 1, etc.
b. Correlations with the individual SRS forms were calculated with within-form raw scores. The
overall correlation was calculated with a vertical Scaled Score.
Star Reading and DRP test score data were obtained on 273 students at grade 3,
424 students at grade 5, 353 students at grade 7, and 314 students at grade 10.
Item-level factor analysis of the combined Star and DRP response data indicated
that the tests were essentially measuring the same construct at each of the four
grades. Eigenvalues from the factor analysis of the tetrachoric correlation matrices
tended to verify the presence of an essentially unidimensional construct. In
general, the eigenvalue associated with the first factor was very large in relation to
the eigenvalue associated with the second factor. Overall, these results confirmed
the essential unidimensionality of the combined Star Reading and DRP data.
Since DRP is an acknowledged measure of reading comprehension, the factor
analysis data support the claim that Star Reading likewise measures reading
comprehension.
Subsequent to the factor analysis, the Star Reading item difficulty parameters
were transformed to the DRP difficulty scale, so that scores on both tests could
be expressed on a common scale. Star Reading scores on that scale were then
calculated using the methods of Item Response Theory. Table 40 below shows
the correlations between Star Reading and DRP reading comprehension scores
overall and by grade.
Table 40: Correlations between Star Reading and DRP Test Scores, Overall
and by Grade
Number of
Test Form Items
Sample
Grade Size Star Calibration DRP Star DRP Correlation
3 273 321 H-9 44 42 0.84
5 424 511 H-7 44 70 0.80
7 353 623 H-6 44 70 0.76
10 314 701 H-2 44 70 0.86
Overall 1,364 0.89
In summary, using item factor analysis Yoes (1999) showed that Star
Reading items measure the same underlying construct as the DRP: reading
comprehension. The overall correlation of 0.89 between the DRP and Star
Reading test scores corroborates that. Furthermore, correcting that correlation
coefficient for the effects of less than perfect reliability yields a corrected
correlation of 0.96. Thus, both at the item level and at the test score level, Star
Reading was shown to measure essentially the same construct as the DRP.
The 32 schools in the sample came from 9 states: Alabama, Arizona, California,
Colorado, Delaware, Illinois, Michigan, Tennessee, and Texas. This represented a
broad range of geographic areas, and resulted in a large number of students (N =
12,220). The distribution of students by grade was as follows:
X 1st grade: 2,001
X 2nd grade: 4,522
X 3rd grade: 3,859
X 4th grade: 1,838
Students were individually assessed using the DORF (DIBELS Oral Reading
Fluency) benchmark passages. The students read the three benchmark passages
under standardized conditions. The raw score for passages was computed as
the number of words read correctly within the one-minute limit (WCPM, Words
Correctly read Per Minute) for each passage. The final score for each student was
the median WCPM across the benchmark passages, and was the score used for
analysis. Each student also took a Star Reading assessment within two weeks of
the DORF assessment.
Descriptive statistics for each grade in the study on Star Reading Scaled Scores and
DORF WCPM (words correctly read per minute) are found in Table 41.
Correlations between the Star Reading Scaled Score and DORF WCPM at
all grades were significant (p < 0.01) and diminished consistently as grades
Figure 5: Scatterplot of Observed DORF WCPM and SR Scale Scores for Each
Grade with the Grade Specific Linking Function Overlaid
Table 42: Descriptive Statistics and Correlations between Star Reading Scale
Scores and DIBELS Oral Reading Fluency for the Cross-Validation
Sample
Star Reading
Scale Score DORF WCPM
Grade N Mean SD Mean SD
1 205 179.31 100.79 45.61 26.75
2 438 270.04 121.67 71.18 33.02
3 362 357.95 141.28 86.26 33.44
4 190 454.04 143.26 102.37 32.74
Table 43: Correlation between Observed WCPM and Estimated WCPM Along
with the Mean and Standard Deviation of the Differences between
Them
Table 44: Classification diagnostics for predicting students’ reading proficiency on the PARCC
consortium assessment from earlier Star Reading scores
Grade
Measure 3 4 5 6 7 8
Overall classification accuracy 86% 87% 86% 86% 86% 83%
Sensitivity 64% 73% 73% 69% 73% 70%
Specificity 93% 93% 90% 91% 91% 89%
Observed proficiency rate (OPR) 26% 29% 27% 24% 28% 29%
Projected proficiency rate (PPR) 22% 26% 26% 23% 27% 28%
Proficiency status projection error –5% –3% 0% –1% –1% –1%
Area under the ROC curve 0.91 0.93 0.91 0.92 0.92 0.90
Numerous other reports of linkages between Star Reading and state accountability
tests have been conducted. Reports are available at
[Link]
2. Renaissance Learning (2016). Relating Star Reading™ and Star Math™ to the Colorado
Measure of Academic Success (CMAS) (PARCC Assessments) performance.
Initial Star Reading classification analyses were performed using state assessment
data from Arkansas, Delaware, Illinois, Michigan, Mississippi, and Kansas.
Collectively these states cover most regions of the country (Central, Southwest,
Northeast, Midwest, and Southeast). Both the Classification Accuracy and Cross
Validation study samples were drawn from an initial pool of 79,045 matched
student records covering grades 2–11.
A secondary analysis using data from a single state assessment was then
performed. The sample used for this analysis was 42,771 matched Star Reading
and South Dakota Test of Education Progress records of students in grades 3–8.
An ROC analysis was used to compare the performance data on Star Reading
to performance data on state achievement tests, with “at risk” identification as
the criterion. The Star Reading Scaled Scores used for analysis originated from
assessments 3–11 months before the state achievement tests were administered.
Selection of cut scores was based on the graph of sensitivity and specificity versus
the Scaled Score. For each grade, the Scaled Score chosen as the cut point was
equal to the score where sensitivity and specificity intersected. The classification
analyses, cut points and outcome measures are outlined in Table 45. Area Under
the Curve (AUC) values were all greater than 0.80. Descriptive notes for other
values represented in the table are provided in the table footnote.
3. For descriptions of ROC curves, AUC, and related classification accuracy statistics,
refer to Pepe, Janes, Longton, Leisenring, & Newcomb (2004) and Zhou, Obuchowski &
Obushcowski (2002).
Assessment data are central to both screening and progress monitoring, and
Star Reading is widely used for both purposes. This chapter includes technical
information about Star Reading’s ability to accurately screen students according
to risk and to help educators make progress monitoring decisions. Much of this
information has been submitted to and reviewed by the Center on Response
to Intervention [Link] and/or the National Center on Intensive
Intervention [Link] two technical assistance groups
funded by the US Department of Education.
For several years running, Star Reading has enjoyed favorable technical reviews
for its use in informing screening and progress monitoring decision by the CRTI
and NCII, respectively. The most recent reviews by CRTI indicate that Star
Reading has a “convincing” level of evidence (the highest rating awarded) in the
core screening categories, including classification accuracy, reliability, and validity.
CRTI also notes that the extent of the technical evidence is “Broad” (again, the
highest rating awarded) and notes that not only is the overall evidence compelling,
but there are disaggregated data as well that shows Star Reading works equally
well among subgroups. The most recent reviews by NCII indicate that there is
full “convincing” evidence of Star Reading’s psychometric quality for progress
monitoring purposes, including reliability, validity, reliability of the slope, and
validity of the slope. Furthermore, they find fully “convincing” evidence that Star
Reading is sufficiently sensitive to student growth, has adequate alternate forms,
and provides data-based guidance to educators on end-of-year benchmarks and
when an intervention should be changed, among other categories. Readers may
find additional information on Star Reading on those sites and should note that the
reviews are updated on a regular basis, as their review standards are adjusted and
new technical evidence for Star Reading and other assessments are evaluated.
Screening
According to the Center on Response to Intervention, “Screening is conducted
to identify or predict students who may be at risk for poor learning outcomes.
Universal screening assessments are typically brief, conducted with all students at
a grade level, and followed by additional testing or short-term progress monitoring
to corroborate students’ risk status.”4
Most commonly, screening is conducted with all students at the beginning of the
year and then another two to four times throughout the school year. Star Reading
is widely used for this purpose. In this section, the technical evidence supporting
its use to inform screening decisions is summarized.
4. [Link]
1. Validity and reliability. Data on Star Reading’s reliability were presented in the
“Reliability and Measurement Precision” chapter of this manual. A wide array
of validity evidence has been presented in this chapter, above; detailed tables
of correlational data can be found in “Appendix B: Detailed Evidence of Star
Reading Validity”.
5. [Link]
performance levels, and the cut point used in these analyses refers to the level the
state has determined indicates meeting grade level reading or English Language
Arts standards. For instance, the cut point on California’s CAASPP is Level 3,
also known as “Standard Met.” On Louisiana’s LEAP 2025 the cut point is at the
“Mastery” level. In the case of ACT and SAT, the cut point established by the
developers (ACT and College Board, respectively) indicates an estimated level of
readiness for success in college.
Table 47: Summary of classification accuracy metrics from recent studies linking Star Reading with
summative reading and English Language Arts measures
Table 47: Summary of classification accuracy metrics from recent studies linking Star Reading with
summative reading and English Language Arts measures
Note that many states tend to not use the same assessment system for more than
a few consecutive years, and Renaissance endeavors to keep the Star Reading
classification reporting as up to date as possible. Those interested in reviewing the
full technical reports for these or other state assessments are encouraged to visit
6. [Link]
rating-system
Progress Monitoring
According to the National Center on Intensive Intervention, “progress monitoring is
used to assess a student’s performance, to quantify his or her rate of improvement
or responsiveness to intervention, to adjust the student’s instructional program
to make it more effective and suited to the student’s needs, and to evaluate the
effectiveness of the intervention.”7
The RTI Action Network, National Center on Intensive Intervention, and other
organizations offering technical assistance to schools implementing RTI/MTSS
models are generally consistent in encouraging educators to select assessments
for progress monitoring that have certain characteristics.
7. [Link]
8. Shapiro, E. S. (2008). Best practices in setting progress-monitoring goals for academic skill
improvement. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology V (pp.
141-157). Bethesda, MD: National Association of School Psychologists.
Grade n Coefficient
1 14,179 0.76
2 43,978 0.93
3 52,670 0.94
4 37,862 0.93
5 31,326 0.93
6 16,990 0.94
7 9,683 0.94
8 7,786 0.94
9 2,483 0.94
10 1,549 0.94
11 799 0.94
12 384 0.95
There are many different methods that one can use to investigate items for DIF,
including item response theory methods, observed score methods, and a variety
of nonparametric approaches (Zumbo, 2007). The method that Star Reading
uses to evaluate items for DIF is a method known as logistic regression (Rogers
& Swaminathan, 1993; Swaminathan & Rogers, 1990; Swaminathan, 1994). With
this approach, student item responses are regressed on student ability estimates
from Star Reading as well as their subgroup membership and the student ability
and subgroup membership interaction. To conduct a DIF analysis, a reference
group and a focal group is defined. For instance, male is the reference group for
gender while female is the focal group. Similarly, Caucasian is the reference group
for race/ethnicity with the minority race/ethnic groups being focal groups. Separate
models are run for DIF for male versus female, black versus white, Hispanic
versus white, Asian versus white, and Native American versus white.
Items are flagged for DIF using a blended approach that employs a Chi-square
test of statistical significance to determine if DIF is present and then assessing
whether any evidence of DIF is practically significant using the Nagelkerke R2
statistic (1991), a common effect size measure used in DIF investigations with
logistic regression (Jodion & Gierl, 2001). Using the Nagelkerke R2 statistic, items
are categorized as exhibiting negligible DIF if the null hypothesis is not rejected or
the R2 statistic is less than 0.035, moderate DIF if the null hypothesis is rejected
and the R2 statistic is greater than or equal to 0.035 and less than 0.070, or large
DIF if the null hypothesis is rejected and the R2 statistic is greater than or equal to
0.070 (Jodion & Gierl, 2001).
There are a couple of points in the Star Reading assessment development cycle
when items are evaluated for DIF. The first time point is when an item is included
as a field test item as part of Star Reading’s item calibration process. During
item calibration, new assessment items are tried out with different groups of
students to make sure that items have appropriate statistical and psychometric
properties before they are used operationally and count towards a student’s
score. The second time point is when the full item bank of operational test
items is recalibrated for scale maintenance to check whether the statistical and
psychometric properties of the items have remained similar after the items become
operational.
It is important to point out that just because an item is flagged for DIF against one
or more subgroups does not necessarily mean that the item is biased. There are
many possible explanations why an item may be statistically flagged for DIF. All
items that are statistically flagged as having non-negligible DIF are marked for a
bias and sensitivity review by the Content team. This review process consists of
several subject matter experts with diverse perspectives and different backgrounds
looking at and reviewing each item to see if there is any content in the item that
may be biased against a particular subgroup and might explain why the item was
statistically flagged for DIF. Items identified as being biased for any reason are
removed from the item bank and do not appear on the Star Reading test. The
statistical flagging of items for DIF as well as the bias and sensitivity review by the
Content team helps ensure test fairness and that the items that appear on Star
Reading do not favor any group of students that may take the test.
As shown in Figure 6 only 2% of about 6,000 items in the Star Reading item bank
showed any evidence of DIF when Star Reading was recalibrated in 2021.
Figure 6: Summary of Star Reading Items with DIF
Table 49 shows the DIF results by reference and focal groups from various the
DIF analyses. These results suggest that of the thousands of items analyzed very
few items were flagged for DIF. There were 1.00% of items categorized with non-
negligible DIF for male versus female, 0.04% of items flagged with non-negligible
DIF for Asian versus white, 0.34% of items flagged with non-negligible DIF for
black versus white, 0.94% of items flagged with non-negligible DIF for Hispanic
versus white, and 0.00% of items flagged with non-negligible DIF for Native
American versus white. These results provide evidence of the fairness of the Star
Reading test for different demographic subgroups that take the assessment. As
previously noted, any items that show DIF are removed from operational use.
International data from the UK show even stronger correlations between Star
Reading and widely used reading measures there: overall correlations of 0.91
with the Suffolk Reading Scale, median within-form correlations of 0.84, and a
correlation of 0.85 with teacher assessments of student reading.
Finally, the data showing the relationship between the current, standards- based
Star Reading Enterprise test and scores on specific state accountability tests
and on the SBAC and PARCC Common Core consortium tests show that the
correlations with these summative measures are consistent with the meta-analysis
findings.
Two distinct kinds of norms are described in this chapter: test score norms and
growth norms. The former refers to distributions of test scores themselves. The
latter refers to distributions of changes in test scores over time; such changes are
generally attributed to growth in the attribute that is measured by a test. Hence
distributions of score changes over time may be called “growth norms.”
Background
National norms for Star Reading version 1 were first collected in 1996. Substantial
changes introduced in Star Reading version 2 necessitated the development of
new norms in 1999. Those norms were used until new norms were developed
in 2008. Since 2008, Star Reading norms have been updated four times (2014,
2017, 2022, and 2024). The 2024 norms went live in Star Reading in the 2024–
2025 school year. This chapter describes the development of the 2024 norms.
From 1996 through mid-2011, Star Reading was primarily a measure of reading
comprehension comprising short vocabulary-in-context items and longer passage
comprehension items. The current version of Star Reading, introduced in June
2011, is a standards-based assessment that measures a wide variety of skills and
instructional standards, as well as reading comprehension. The 2024 norms are
based on the current standards-based version of Star Reading.
New U.S. norms for Star Reading assessments were introduced at the start of the
2017–18 school year. Separate early fall and late spring norms were developed for
grades Kindergarten through 12. Before the introduction of the 2017 Star Reading
norms, the reference populations for grades Kindergarten through 3 consisted
only of students taking Star Reading; students who only took Star Early Literacy
were excluded from the Star Reading norms, and vice versa. Consequently,
previous Star Reading and Star Early Literacy norms for this grade range were
not completely representative of the full range of literacy development in those
grades. To address this, the concept of “Star Early Learning” was introduced. That
The norms introduced in 2024 are based on test scores of K–12 students taking
either the Reading assessment, or the Early Literacy one, or both. These norms
are based on the use of the Unified scale, which allowed performance on both Star
Early Literacy and Star Reading to be measured on the same scale. Norms for
students in pre-K are based on students taking Star Early Literacy in this grade.
Pre-K norms are not available for Star Reading because students do not typically
take Star Reading in this grade.
The 2024 Star Reading norms are based on Star Reading and Star Early Literacy
test data collected over the course of the 2022–2023 school year. Students
participating in the norming study took assessments between August 1, 2022,
and June 30, 2023. Students took the Star Reading or Early Literacy tests under
normal test administration conditions. No specific norming test was developed
and no deviations were made from the usual test administration. Thus, students
in the norming sample took Star Reading or Star Early Literacy tests as they are
administered in everyday use.
Sample Characteristics
During the norming period, a total of 1,837,495 US students in grades Pre-K–3
took Star Early Literacy while 8,878,035 US students in grades K–12 took Star
Reading tests, using Renaissance servers hosted by Renaissance Learning. The
first step in sampling was to select representative fall and spring student samples:
students who had tested in the fall, in the spring, or in both the fall and spring of
the 2022–2023 school year.
Because these norming data were convenience samples drawn from the Star
Reading and Star Early Literacy customer base, steps were taken to ensure
the resulting norms were nationally representative of grades K–12 US student
populations with regard to certain important characteristics. A post-stratification
procedure was used to adjust the sample’s proportions to the approximate national
proportions on three key variables: geographic region, district socio-economic
status, and district/school size. These three variables were chosen because they
had previously been used in Star norming studies to draw nationally representative
samples, are known to be related to test scores, and were readily available for the
schools in the Renaissance hosted database.
The final norming sample size for grades K–12, after selecting only students with
test scores in either the fall or the spring or both fall and spring in the norming year
was 6,405,124 students. There were 4,687,579 students in the fall norming sample
and 4,282,841 students in the spring norming sample; 2,565,296 students were
included in both norming samples. These students came from 22,978 schools
across the 50 states and the District of Columbia.
Table 50: Numbers of Students per Grade in the Fall Norms Sample
Table 51: Numbers of Students per Grade in the Spring Norms Sample
Geographic region
Using the categories established by the National Center for Education Statistics
(NCES), students were grouped into four geographic regions as defined below:
Northeast, Southeast, Midwest, and West.
Northeast:
Southeast:
Midwest:
West:
School size
Based on total school enrollment, schools were classified into one of three school
size groups: small schools had under 200 students enrolled, medium schools had
between 200–499 students enrolled, and large schools had 500 or more students
enrolled.
No students were sampled from the schools that did not report the percent of
school students with free and reduced lunch.
The norming sample also included private schools, Catholic schools, students with
disabilities, and English Language Learners.
Table 52: Sample Characteristics Along with National Population Estimates and Sample Estimates
Table 53: Student Demographics and School Information: National Estimates and Sample
Percentages
Spring
National Fall Norming Norming
Estimate Sample Sample
Gender Public Female 48.6% 49.5% 49.5%
Male 51.4% 50.5% 50.5%
Non-Public Female – 50.5% 50.8%
Male – 49.5% 49.2%
Race/Ethnicity Public American Indian 0.9% 2.0% 2.1%
Asian 5.8% 6.6% 6.3%
Black 15.1% 14.6% 14.1%
Hispanic 28.7% 34.1% 34.6%
White 44.6% 39.7% 39.9%
a
Multiple Race 5.0% 3.0% 3.0%
Non-Public American Indian 0.6% 0.7% 0.6%
Asian 7.3% 7.6% 7.8%
Black 9.5% 8.2% 7.6%
Hispanic 12.1% 11.5% 26.4%
White 65.1% 66.5% 51.4%
a
Multiple Race 5.3% 5.4% 6.2%
a. Students identified as belonging to two or more races.
Test Administration
All students took the current version Star Reading or Early Literacy tests under
normal administration procedures. Some students in the normative sample took
the assessment two or more times within the norming windows; scores from their
initial test administration in the fall and the last test administration in the spring
were used for computing the norms.
Data Analysis
Student test records were compiled from the complete database of Star Reading
and Early Literacy Renaissance Place users. Data were from a single school year
from August 2022 to June 2023. Students’ Unified scale Rasch scores on their
first Star Reading or Early Literacy test taken during the first month of the school
year based on grade placement were used to compute norms for the fall; students’
Rasch scores on the last Star Reading or Early Literacy test taken during the 8th
and 9th months of the school year were used to compute norms for the spring.
Interpolation was used to estimate norms for times of the year between the first
month in the fall and the last month in the spring. The norms were based on the
distribution of Rasch scores for each grade.
Table 54: Descriptive Statistics for Weighted Scaled Scores by Grade for the Norming Sample in the
Unified Scale
1. As part of the development of the Star Early Learning Unified scale, Star Early Literacy
Rasch scores were equated to the Star Reading Rasch scale. This resulted in a downward
extension of the latter scale that encompasses the full range of both Star Early Literacy and
Reading performance. This extended Rasch scale was employed to put all students’ scores
on the same scale for purposes of norms development.
Growth Norms
Student achievement typically is thought of in terms of status: a student’s
performance at one point in time. However, this ignores important information
about a student’s learning trajectory—how much students are growing over a
period of time. When educators are able to consider growth information—the
amount or rate of change over time—alongside current status, a richer picture
of the student emerges, empowering educators to make better instructional
decisions.
The Star Reading SGP package also produces a range of future growth estimates.
Those are mostly hidden from users but are presented in goal setting and
related applications to help users understand what typical or expected growth
looks like for a given student. They are particularly useful for setting future goals
and understanding the likelihood of reaching future benchmarks, such as likely
achievement of proficient on an upcoming state summative assessment.
At present, the Star Reading SGP growth norms are based on a sample of 23,
376,700 matched student records from the 2016–2017, 2017–2018, and 2018–
2019 school years across grades 1–12. The sample included 9,778,703 unique
students across all three school years. Table 56 below provides a summary of the
number of students and tests that were used when computing the SGP growth
norms.
Table 55: Number of Students and Number of Tests Used in Computing SGP
Growth Norms
Grade Students Tests Grade Students Tests
1 761,260 1,691,537 8 875,682 2,035,964
2 1,142,750 2,903,543 9 466,481 994,943
3 1,234,147 3,132,872 10 380,663 792,217
4 1,247,517 3,137,316 11 261,521 512,893
5 1,260,547 3,127,739 12 180,836 322,067
6 1,060,365 2,574,598 Total 9,778,703 23,376,700
7 906,934 2,151,011
This chapter enumerates all of the scores reported by Star Reading, including
scaled scores, norm-referenced, and criterion-referenced scores.
Scaled Score as high as the average third-grade student who is 40% through the
school year.
GE scores are often misinterpreted as though they convey information about what
a student knows or can do—that is, as if they were criterion-referenced scores.
To the contrary, GE scores are norm-referenced. Star Reading Grade Equivalents
range from 0.0–12.9+. The minimum GE score is 0, shown as either 0 or < K;
values below 0 are shown as < K.
The scale divides the academic year into 10 equal segments and is expressed as
a decimal with the integer denoting the grade level and the tenths value indicating
10% segments through the school year. For example, if a student obtained a GE
of 4.6 on a Star Reading assessment, this would suggest that the student was
performing similarly to the average student in the fourth grade that is 60% through
the school year. Because Star Reading norms are based on fall and spring score
data only, the 10% incremental GE scores are derived through interpolation by
fitting a curve to the grade-by-grade medians. Table 56 on page 118 contains the
Scaled Score to GE conversions.
Because the Star Reading test adapts to the reading level of the student being tested,
Star Reading GE scores are more consistently accurate across the achievement
spectrum than those provided by conventional test instruments. Grade Equivalent
scores obtained using conventional (non-adaptive) test instruments are less accurate
when a student’s grade placement and GE score differ markedly. It is not uncommon
for a fourth-grade student to obtain a GE score of 8.9 when using a conventional test
instrument. However, this does not necessarily mean that the student is performing at
a level typical of an end-of-year eighth grader; more likely, it means that the student
answered all, or nearly all, of the items correctly and thus performed beyond the range
of the fourth-grade test.
The Est. ORF score was computed using the results of a large-scale research
study investigating the linkage between the Star Reading scores and estimates of
oral reading fluency on a range of passages with grade-level-appropriate difficulty.
An equipercentile linking was done between Star Reading scores and oral reading
fluency, providing an estimate of the oral reading fluency for each Scaled Score
unit in Star Reading for grades 1–4 independently. Results of the analysis can be
found in “Additional Validation Evidence for Star Reading” on page 74. A table of
selected Star Reading Scaled Scores and corresponding Estimated ORF values
can be found in “Appendix B: Detailed Evidence of Star Reading Validity” on page
136.
When Star Reading software determines that a student can answer 80 percent
or more of the grade 13 items in the Star Reading test correctly, the student is
assigned an IRL of Post-High School (PHS). This is the highest IRL that anyone
can obtain when taking the Star Reading test.
placement) receives a Star Reading IRL of 4.0, this indicates that the student can
most likely learn without experiencing too many difficulties when using materials
written to be on a fourth-grade level.
The IRL is estimated based on the student’s pattern of responses to the Star Reading
items. A given student’s IRL is the highest grade level of items at which it is estimated
that the student can correctly answer at least 80 percent of the items.
In effect, the IRL references each student’s Star Reading performance to the difficulty
of written material appropriate for instruction. This is a valuable piece of information in
planning the instructional program for individuals or groups of students.
In general, IRLs and GEs will differ. These differences are caused by the fact
that the two score metrics are designed to provide different information. That is,
IRLs estimate the level of text that a student can read with some instructional
assistance; GEs express a student’s performance in terms of the grade level for
which that performance is typical. Usually, a student’s GE score will be higher than
the IRL.
simply indicate how a student performed compared to the others who took Star
Reading tests as a part of the national norming program. The range of Percentile
Ranks is 1–99.
The Percentile Rank scale is not an equal-interval scale. For example, for a
student with a grade placement of 1.7, a Unified Scaled Score of 896 corresponds
to a PR of 80, and a Unified Scaled Score of 931 corresponds to a PR of 90. Thus,
a difference of 35 Scaled Score points represents a 10-point difference in PR.
However, for students at the same 1.7 grade placement, a Unified Scaled Score
of 836 corresponds to a PR of 50, and a Unified Scaled Score of 853 corresponds
to a PR of 60. While there is now only a 17-point difference in Scaled Scores,
there is still a 10-point difference in PR. For this reason, PR scores should not
be averaged or otherwise algebraically manipulated. NCE scores are much more
appropriate for these activities.
Table 59 on page <OV> and Table 57 on page 122 contain an abridged version
of the Scaled Score to Percentile Rank conversion table that the Star Reading
software uses. The actual table includes data for all of the monthly grade
placement values from 0.0–12.9.
This table can be used to estimate PR values for tests that were taken when the
grade placement value of a student was incorrect (see “Types of Test Scores” on
page 105 for more information). If the error is caught right away, one always has
the option of correcting the grade placement for the student and then having the
student retest.
conversions from NCE to PR. The NCE values are given as a range of scores that
convert to the corresponding PR value.
SGPs are often used to indicate whether a student’s growth is more or less than
can be expected. For example, without an SGP, a teacher would not know if a
Scaled Score increase of 100 points represents good, not-so-good, or average
growth. This is because students of differing achievement levels in different grades
grow at different rates relative to the Star Reading scale. For example, a high-
achieving second-grader grows at a different rate than a low-achieving second-
grader. Similarly, a high-achieving second-grader grows at a different rate than a
high-achieving eighth-grader.
1. We collect data for our growth norms during three different time periods: fall, winter, and
spring. More information about these time periods is provided on page 113.
SGP is calculated for students who have taken at least two tests (a current test
and a prior test) within at least two different testing windows (Fall, Winter, or
Spring).
If a student has taken more than one test in a single test window, the SGP
calculation is based off the following tests:
X The current test is always the last test taken in a testing window.
X The test used as the prior test depends on what testing window it falls in:
X Fall window: The first test taken in the Fall window is used.
X Winter window: The test taken closest to January 15 in the Winter window
is used.
X Spring window: The last test taken in the Spring window is used.
Fall–Winter
Winter–Spring
Spring–Fall
Spring–Spring
Fall–Fall
Fall–Spring
a Prior School Year
Fall–Winter
Winter–Spring
Spring–Fall
Spring–Spring
Fall–Fall
* Test window dates are fixed and may not correspond to the beginning/ending dates of your school year. Students will only have SGPs calculated if they have
taken at least two tests, and the date of the most recent test has to be within the past 18 months.
Two tests used to calculate SGP If more than one test was taken in a prior test
Test in window, but skipped when calculating SGP Test Window window, which is used to calculate SGP?
Third test used to calculate SGP (if available) Fall Window First test taken
Winter Window Test closest to 1/15 (red line)
Spring Window Last test taken
Lexile® Measures
In cooperation with MetaMetrics®, since August 2014, users of Star Reading
have had the option of including Lexile measures on certain Star Reading score
reports. Reported Lexile measures will range from BR400L to 1825L. (The “L”
suffix identified the score as a Lexile measure. Where it appears, the “BR” prefix
indicates a score that is below 0 on the Lexile scale; such scores are typical of
beginning readers.)
The ability to read and comprehend written text is important for academic success.
Students may, however, benefit most from reading materials that match their
reading ability/achievement: reading materials that are neither too easy nor too
hard so as to maximize learning. To facilitate students’ choices of appropriate
reading materials, measures commonly referred to as readability measures are
used in conjunction with students’ reading achievement measures.
Matching a student with text/books that target a student’s interest and level of
reading achievement is a two-step process: first, a student’s reading achievement
score is obtained by administering a standardized reading achievement test;
second, the reading achievement score serves as an entry point into the
readability measure to determine the difficulty level of text/books that would best
support independent reading for the student. Optimally, a readability measure
should match students with books that they are able to read and comprehend
independently without boredom or frustration: books that are engaging yet slightly
challenging to students based on the students’ reading achievement and grade
level.
In some cases, a range of text/book reading difficulty in which a student can read
independently or with minimal guidance is desired. At Renaissance, we define the
range of reading difficulty level that is neither too hard nor too easy as the Zone of
Proximal Development (ZPD). The ZPD range allows, potentially, optimal learning
to occur because students are engaged and appropriately challenged by reading
materials that match their reading achievement and interest. The ZPD range is
simply an approximation of the range of reading materials that is likely to benefit
the student most. ZPD ranges are not absolute and teachers should also use their
objective judgment to help students select reading books that enhance learning.
The Zone of Proximal Development is especially useful for students who use
Accelerated Reader, which provides readability levels on over 180,000 trade books
and textbooks. Renaissance Learning developed the ZPD ranges according to
Vygotskian theory, based on an analysis of Accelerated Reader book reading data
from 80,000 students in the 1996–1997 school year. More information is available
in The research foundation for Accelerated Reader goal-setting practices (2006),
which is published by Renaissance Learning ([Link]
[Link]).
Grade Placement
Star Reading software uses students’ grade placement—grade and percent of the
school year completed—when determining norm-referenced scores. The values of
PR (Percentile Rank) and NCE (Normal Curve Equivalent) are based not only on
what Scaled Score the student achieved, but also on the grade placement of the
student at the time of the test. For example, a second-grader that is 80% through
the school year with a Unified Scaled Score of 957 would have a PR of 74, while
a third-grader that is 80% through the school year with the same Unified Scaled
Score would have a PR of 38. Thus, it is crucial that student records indicate the
proper grade placement when students take a Star Reading test.
Grade Placement is shown as a decimal, with the integer being the student’s
grade and the decimal being the percentage of the school year that has passed
when a student completed an assignment or test; for example, a GP of 3.25 would
represent a third-grade student who is one-quarter of the way through the school
year. For purposes of this calculation, a “school year” is the span between First
Day for Students and Last Day for Students, which are entered as part of the new
school year setup process.
Table 62: Relating Star Early Literacy Unified Scale Scores to Star Reading GE Scores and ZPD
Ranges
Star Early Literacy Star Reading
Unified Scaled Score Range
SEL Literacy ZPD Recommended
Classification Low High GE Range Assessment(s)
Emergent Reader 200 685 NA NA Star Early Literacy
686 692 0.0 0.0–1.0
693 701 0.1 0.1–1.1
702 709 0.2 0.2–1.2
710 718 0.3 0.3–1.3
719 726 0.4 0.4–1.4
727 735 0.5 0.5–1.5
736 744 0.6 0.6–1.6
745 753 0.7 0.7–1.7
754 762 0.8 0.8–1.8
763 771 0.9 0.9–1.9
772 780 1.0 1.0–2.0
Transitional Reader 781 789 1.1 1.1–2.1
SS = 786
790 798 1.2 1.2–2.2
799 807 1.3 1.3–2.3
808 816 1.4 1.4–2.4
817 824 1.5 1.5–2.5 Star Early Literacy and
Star Reading
825 833 1.6 1.6–2.6
834 841 1.7 1.7–2.7
842 849 1.8 1.8–2.8
Probable Reader 850 857 1.9 1.9–2.9
SS = 852
858 865 2.0 2.0–3.0 Star Reading
866 873 2.1 2.1–3.1
874 881 2.2 2.1–3.1
882 888 2.3 2.2–3.2
889 896 2.4 2.2–3.2
897 903 2.5 2.3–3.3
904 910 2.6 2.4–3.4
911 916 2.7 2.4–3.4
Table 62: Relating Star Early Literacy Unified Scale Scores to Star Reading GE Scores and ZPD
Ranges
Star Early Literacy Star Reading
Unified Scaled Score Range
SEL Literacy ZPD Recommended
Classification Low High GE Range Assessment(s)
Probable Reader 917 923 2.8 2.5–3.5 Star Reading
(continued) (continued)
924 929 2.9 2.5–3.5
930 935 3.0 2.6–3.6
936 941 3.1 2.6–3.7
942 947 3.2 2.7–3.8
948 952 3.3 2.7–3.8
953 958 3.4 2.8–3.9
959 963 3.5 2.8–4.0
964 968 3.6 2.8–4.1
969 973 3.7 2.9–4.2
974 977 3.8 2.9–4.3
978 982 3.9 3.0–4.4
Table 63: Estimated Oral Reading Fluency (Est. ORF) Given in Words Correct
per Minute (WCPM) by Grade for Selected Star Reading Scale Score
Units (SR SS)
Grade
SR SS 1 2 3 4
50 0 4 0 8
100 29 30 32 31
150 41 40 43 41
200 55 52 52 47
250 68 64 60 57
300 82 78 71 69
350 92 92 80 80
400 111 106 97 93
450 142 118 108 104
500 142 132 120 115
550 142 152 133 127
600 142 175 147 137
650 142 175 157 145
700 142 175 167 154
750 142 175 170 168
800 142 175 170 184
850–1400 142 175 170 190
The Validity chapter of this technical manual places its emphasis on summaries
of Star Reading validity evidence, and on recent evidence which comes primarily
from the 34-item, standards-based version of the assessment, which was
introduced in 2011. However, the abundance of earlier evidence, and evidence
related to the 25-item Star Reading versions, is all part of the accumulation of
technical support for the validity and usefulness of Star Reading. Much of that
cumulative evidence is presented in this appendix, to ensure that the historical
continuity of research in support of Star Reading validity is not lost. The material
that follows touches on the following list of topics:
X Relationship of Star Reading Scores to Scores on Other Tests of Reading
Achievement
X Relationship of Star Reading Scores to Scores on State Tests of Accountability
in Reading
X Relationship of Star Reading Enterprise Scores to Scores on Previous
Versions
X Data from Post-Publication Studies
X Linking Star and State Assessments: Comparing Student- and School-Level
Data
X Classification Accuracy and Screening Data Reported to The National Center
on Response to Intervention (NCRTI)
available, they could be correlated with Star Reading 2.0 Scaled Scores. However,
since Percentile Ranks (PRs) are not on an equal-interval scale, when PRs were
reported for the other tests, they were converted into Normal Curve Equivalents
(NCEs). Scaled Scores or NCE scores were then used to compute the Pearson
product-moment correlation coefficients.
In an ongoing effort to gather evidence for the validity of Star Reading scores,
continual research on score validity has been undertaken. In addition to original
validity data gathered at the time of initial development, numerous other studies
have investigated the correlations between Star Reading tests and other external
measures. In addition to gathering concurrent validity estimates, predictive validity
estimates have also been investigated.
Table 64, Table 65, Table 66, and Table 67 present the correlation coefficients
between the scores on the Star Reading test and each of the other tests for which
data were received. Table 64 and Table 65 display “concurrent validity” data; that
is, correlations between Star Reading test scores and other tests administered
within a two-month time period. The date of administration ranged from spring
1999–spring 2013. More recently, data have become available for analyses
regarding the predictive validity of Star Reading. Predictive validity provides an
estimate of the extent to which scores on the Star Reading test predicted scores
on criterion measures given at a later point in time, operationally defined as more
than 2 months between the Star test (predictor) and the criterion test. Predictive
validity provides an estimate of the linear relationship between Star scores and
scores on tests covering a similar academic domain. Predictive correlations are
attenuated by time due to the fact that students are gaining skills in the interim
between testing occasions, and also by differences between the tests’ content
specifications. Table 66 and Table 67 present predictive validity coefficients.
The tables are presented in two parts. Table 64 and Table 66 display validity
coefficients for grades 1–6, and Table 65 and Table 67 display the validity
coefficients for grades 7–12. The bottom of each table presents a grade-by-
grade summary, including the total number of students for whom test data were
available, the number of validity coefficients for that grade, and the average value
of the validity coefficients.
The within-grade average concurrent validity coefficients for grades 1–6 varied
from 0.72–0.80, with an overall average of 0.74. The within-grade average
concurrent validity for grades 7–12 ranged from 0.65–0.76, with an overall average
of 0.72. Predictive validity coefficients ranged from 0.69–0.72 in grades 1–6, with
an average of 0.71. In grades 7–12 the predictive validity coefficients ranged
from 0.72–0.87 with an average of 0.80. The other validity coefficient within-grade
averages (for Star Reading 2.0 with external tests administered prior to spring
1999, Table 68 and Table 69) varied from 0.60–0.77; the overall average was 0.72.
The process of establishing the validity of a test is laborious, and it usually takes
a significant amount of time. As a result, the validation of the Star Reading test is
an ongoing activity, with the goal of establishing evidence of the test’s validity for a
variety of settings and students. Star Reading users who collect relevant data are
encouraged to contact Renaissance Learning.
Since correlation coefficients are available for many different test editions, forms,
and dates of administration, many of the tests have several validity coefficients
associated with them. Data were omitted from the tabulations if (a) test data quality
could not be verified or (b) when sample size was very small. In general, these
correlation coefficients reflect very well on the validity of the Star Reading test as a
tool for placement in Reading. In fact, the correlations are similar in magnitude to
the validity coefficients of these measures with each other. These validity results,
combined with the supporting evidence of reliability and minimization of SEM
estimates for the Star Reading test, provide a quantitative demonstration of how
well this innovative instrument in reading achievement assessment performs.
Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a
1 2 3 4 5 6
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a
1 2 3 4 5 6
DSTP S 06 SS – – 158 0.68* 126 0.43* 141 0.62* 157 0.59* 75 0.66*
Dynamic Indicators of Basic Early Literacy Skills (DIBELS) – Oral Reading Fluency
DIBELS F 05 WCPM – – 59 0.78* – – – – – – – –
DIBELS W 06 WCPM 61 0.87* 55 0.75* – – – – – – – –
DIBELS S 06 WCPM 67 0.87* 63 0.71* – – – – – – – –
DIBELS F 06 WCPM – – 515 0.78* 354 0.81* 202 0.72* – – – –
DIBELS W 07 WCPM 208 0.75* 415 0.73* 175 0.69* 115 0.71* – – – –
DIBELS S 07 WCPM 437 0.81* 528 0.70* 363 0.66* 208 0.54* – – – –
DIBELS F 07 WCPM – – 626 0.79* 828 0.73* 503 0.73* 46 0.73* – –
Florida Comprehensive Assessment Test (FCAT)
FCAT S 06 SS – – – – – – 41 0.65* – – – –
FCAT S 06–08 SS – – – – 10,169 0.76* 8,003 0.73* 5,474 0.73* 1,188 0.67*
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS – – – – 3,641 0.83* 3,025 0.84* 2,439 0.83* 145 0.81*
Gates–MacGinitie Reading Test (GMRT)
GMRT/2nd Ed S 99 NCE – – 21 0.89* – – – – – – – –
GMRT/L-3rd S 99 NCE – – 127 0.80* – – – – – – – –
Idaho Standards Achievement Test (ISAT)
ISAT S 07–09 SS – – – – 3,724 0.75* 2,956 0.74* 2,485 0.74* 1,309 0.75*
Illinois Standards Achievement Test – Reading
ISAT S 05 SS – – 106 0.71* 594 0.76* – – 449 0.70* – –
ISAT S 06 SS – – – – 140 0.80* 144 0.80* 146 0.72 – –
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a
1 2 3 4 5 6
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a
1 2 3 4 5 6
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
Table 64: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 1–6a
1 2 3 4 5 6
Summary
Grade(s) All 1 2 3 4 5 6
Number of 255,538 1,068 3,629 76,942 66,400 54,173 31,686
students
Number of 195 10 18 47 47 41 32
coefficients
Average 0.80 0.73 0.72 0.72 0.74 0.72
validity
Overall 0.74
average
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
Table 65: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 7–12a
7 8 9 10 11 12
Table 65: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 7–12a
7 8 9 10 11 12
Table 65: Concurrent Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Spring 1999–Spring 2013, Grades 7–12a
7 8 9 10 11 12
Summary
Grade(s) All 7 8 9 10 11 12
Number of students 48,789 25,032 21,134 1,774 755 55 39
Number of coefficients 74 30 29 7 5 2 1
Average validity – 0.74 0.73 0.65 0.76 0.70 0.73
Overall average 0.72
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
Table 66: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6a
1 2 3 4 5 6
b
Test Form Date Score n r n r n r n r n r n r
AIMSweb
R-CBM S 12 correct 60 0.14 156 0.38* 105 0.11 102 0.52* – – – –
Arkansas Augmented Benchmark Examination (AABE)
AABE F 07 SS – – – – 5,255 0.79* 5,208 0.77* 3,884 0.75* 3,312 0.75*
Colorado Student Assessment Program (CSAP)
CSAP F 04 – – – – – 82 0.72* 79 0.77* 93 0.70* 280 0.77*
Delaware Student Testing Program (DSTP) – Reading
DSTP S 05 – – – – – 189 0.58* – – – – – –
DSTP W 05 – – – – – 120 0.67* – – – – – –
DSTP S 05 – – – – – 161 0.52* 191 0.55* 190 0.62* – –
DSTP F 05 – – – 253 0.64* 214 0.39* 256 0.62* 270 0.59* 242 0.71*
DSTP W 05 – – – 275 0.61* 233 0.47* 276 0.59* 281 0.62* 146 0.57*
Florida Comprehensive Assessment Test (FCAT)
FCAT F 05 – – – – – – – 42 0.73* – – 409 0.67*
FCAT W 07 – – – – – – – – – – – 417 0.76*
FCAT F 05–07 SS – – – – 25,192 0.78* 21,650 0.75* 17,469 0.75* 9,998 0.73*
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS – – – – 6,788 0.78* 5,894 0.80* 5,374 0.80* 616 0.74*
Idaho Standards Achievement Test (ISAT)
ISAT F 08–10 SS – – – – 8,219 0.77* 8,274 0.77* 7,537 0.76* 5,742 0.77*
Illinois Standards Achievement Test (ISAT) – Reading
ISAT–R F 05 – – – – – 450 0.73* – – 317 0.68* – –
ISAT–R W 05 – – – – – 564 0.76* – – 403 0.68* – –
ISAT–R F 05 – – – – – 133 0.73* 140 0.74* 145 0.66* – –
ISAT–R W 06 – – – – – 138 0.76* 145 0.77* 146 0.70* – –
Iowa Assessment
IA F 12 SS – – – – 1,763 0.61* 1,826 0.61* 1,926 0.59* 1,554 0.64*
IA W 12 SS – – – – 548 0.60* 661 0.62* 493 0.64* 428 0.65*
IA S 12 SS – – – – 1,808 0.63* 1,900 0.63* 1,842 0.65* 1,610 0.63*
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).
Table 66: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6a
1 2 3 4 5 6
b
Test Form Date Score n r n r n r n r n r n r
Kentucky Core Content Test (KCCT)
KCCT F 07–09 SS – – – – 16,521 0.62* 15,143 0.57* 12,549 0.53* 9,091 0.58*
Michigan Educational Assessment Program (MEAP) – English Language Arts
MEAP–EL F 04 – – – – – 193 0.60* 181 0.70* 170 0.75* 192 0.66*
MEAP–EL W 05 – – – – – 204 0.68* 184 0.74* 193 0.75* 200 0.70*
MEAP–EL S 05 – – – – – 192 0.73* 171 0.73* 191 0.71* 193 0.62*
MEAP–EL F 05 – – – – – 111 0.66* 132 0.71* 119 0.77* 108 0.60*
MEAP–EL W 06 – – – – – 114 0.77* – – 121 0.75* 109 0.66*
Michigan Educational Assessment Program (MEAP) – Reading
MEAP–R F 04 – – – – – 193 0.60* 181 0.69* 170 0.76* 192 0.66*
MEAP–R W 05 – – – – – 204 0.69* 184 0.74* 193 0.78* 200 0.70*
MEAP–R S 05 – – – – – 192 0.72* 171 0.72* 191 0.74* 193 0.62*
MEAP–R F 05 – – – – – 111 0.63* 132 0.70* 119 0.78* 108 0.62*
MEAP–R W 06 – – – – – 114 0.72* – – 121 0.75* 109 0.64*
Mississippi Curriculum Test (MCT2)
MCT2 F 01 – – – 86 0.57* 95 0.70* 97 0.65* 78 0.76* – –
MCT2 F 02 – – – 340 0.67* 337 0.67* 282 0.69* 407 0.71* 442 0.72*
MCT2 F 07 SS – – – – 6,184 0.77* 5,515 .74* 5,409 0.74* 4,426 0.68*
North Carolina End–of–Grade (NCEOG) Test
NCEOG F 05–07 SS – – – – 6,976 0.81* 6,531 0.78* 6,077 0.77* 3,255 0.77*
New York State Assessment Program
NYSTP S 13 SS – – – – 349 0.73* – – – – – –
Ohio Achievement Assessment
OAA S 13 SS – – – – 28 0.78* 41 0.52* 29 0.79* 30 0.75*
Oklahoma Core Curriculum Test (OCCT)
OCCT F 04 – – – – – – – – – 44 0.63* – –
OCCT W 05 – – – – – – – – – 45 0.66* – –
OCCT F 05 – – – – – 89 0.59* 90 0.60* 79 0.69* 84 0.63*
OCCT W 06 – – – – – 60 0.65* 40 0.67* – – – –
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).
Table 66: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 1–6a
1 2 3 4 5 6
b
Test Form Date Score n r n r n r n r n r n r
South Dakota State Test of Educational Progress (DSTEP)
DSTEP F 07–09 SS – – – – 3,909 0.79* 3,679 0.78* 3,293 0.78* 2,797 0.79*
Star Reading
Star–R F 05 – 16,982 0.66* 42,601 0.78* 46,237 0.81* 44,125 0.83* 34,380 0.83* 23,378 0.84*
Star–R F 06 – 25,513 0.67* 63,835 0.78* 69,835 0.81* 65,157 0.82* 57,079 0.83* 35,103 0.83*
Star–R F 05 – 8,098 0.65* 20,261 0.79* 20,091 0.81* 18,318 0.82* 7,621 0.82* 5,021 0.82*
Star–R F 05 – 8,098 0.55* 20,261 0.72* 20,091 0.77* 18,318 0.80* 7,621 0.80* 5,021 0.79*
Star–R S 06 – 8,098 0.84* 20,261 0.82* 20,091 0.83* 18,318 0.83* 7,621 0.83* 5,021 0.83*
Star–R S 06 – 8,098 0.79* 20,261 0.80* 20,091 0.81* 18,318 0.82* 7,621 0.82* 5,021 0.81*
State of Texas Assessments of Academic Readiness Standards Test (STAAR)
STAAR S 12–13 SS – – – – 6,132 0.81* 5,744 0.80* 5,327 0.79* 5,143 0.79*
Tennessee Comprehensive Assessment Program (TCAP)
TCAP S 11 SS – – – – 695 0.68* 602 0.72* 315 0.61* – –
TCAP S 12 SS – – – – 763 0.70* 831 0.33* 698 0.65* – –
TCAP S 13 SS – – – – 2,509 0.67* 1,897 0.63* 1,939 0.68* 431 0.65*
West Virginia Educational Standards Test 2 (WESTEST 2)
WESTEST 2 S 12 SS – – – – 2,828 0.80* 3,078 0.73* 3,246 0.73* 3,214 0.73*
Wisconsin Knowledge and Concepts Examination (WKCE)
WKCE S 05–09 SS 15,706 0.75* 15,569 0.77* 13,980 0.78* 10,641 0.78*
Summary
Grade(s) All 1 2 3 4 5 6
Number of 1,227,887 74,887 188,434 313,102 289,571 217,416 144,477
students
Number of 194 6 10 49 43 47 39
coefficients
Average 0.69 0.72 0.70 0.71 0.72 0.71
validity
Overall 0.71
average
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).
Table 67: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12a
7 8 9 10 11 12
b
Test Form Date Score n r n r n r n r n r n r
Arkansas Augmented Benchmark Examination (AABE)
AABE F 07 SS 2,418 0.74* 1,591 0.75* – – – – – – – –
Colorado Student Assessment Program (CSAP)
CSAP F 05 – 299 0.83* 185 0.83* – – – – – – – –
Delaware Student Testing Program (DSTP) – Reading
DSTP S 05 – 100 0.75* 143 0.63* – – 48 0.66* – – – –
DSTP F 05 – 273 0.69* 247 0.70* 152 0.73* 97 0.78* – – – –
DSTP W 05 – – – 61 0.64* 230 0.64* 145 0.71* – – – –
Florida Comprehensive Assessment Test (FCAT)
FCAT F 05 – 381 0.61* 387 0.62* – – – – – – – –
FCAT W 07 – 342 0.64* 361 0.72* – – – – – – – –
FCAT F 05–07 SS 8,525 0.72* 6,216 0.72* – – – – – – – –
Florida Comprehensive Assessment Test (FCAT 2.0)
FCAT 2.0 S 13 SS 586 0.75* 653 0.78* – – – – – – – –
Idaho Standards Achievement Test (ISAT)
ISAT F 05–07 SS 4,119 0.76* 3,261 0.73* – – – – – – – –
Illinois Standards Achievement Test (ISAT) – Reading
ISAT F 05 – 173 0.51* 158 0.66* – – – – – – – –
Iowa Assessment
IA F 12 SS 1,264 0.60* 905 0.63* – – – – – – – –
IA W 12 SS 118 0.66* 72 0.67* – – – – – – – –
IA S 12 SS 1,326 0.68* 1,250 0.66* – – – – – – – –
Kentucky Core Content Test (KCCT)
KCCT F 07–09 SS 4,962 0.57* 2,530 0.58* – – – – – – – –
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).
Table 67: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12a
7 8 9 10 11 12
b
Test Form Date Score n r n r n r n r n r n r
Michigan Educational Assessment Program (MEAP) – English Language Arts
MEAP F 04 – 181 0.71* 88 0.85* – – – – – – – –
MEAP W 05 – 214 0.73* 212 0.73* – – – – – – – –
MEAP S 05 – 206 0.75* 223 0.69* – – – – – – – –
MEAP F 05 – 114 0.66* 126 0.66* – – – – – – – –
MEAP W 06 – 114 0.64* 136 0.71* – – – – – – – –
MEAP S 06 – – – 30 0.80* – – – – – – – –
Michigan Educational Assessment Program (MEAP) – Reading
MEAP–R F 04 – 181 0.70* 88 0.84* – – – – – – – –
MEAP–R W 05 – 214 0.72* 212 0.73* – – – – – – – –
MEAP–R S 05 – 206 0.72* 223 0.69* – – – – – – – –
MEAP–R F 05 – 116 0.68* 138 0.66* – – – – – – – –
MEAP–R W 06 – 116 0.68* 138 0.70* – – – – – – – –
MEAP–R S 06 – – – 30 0.81* – – – – – – – –
Mississippi Curriculum Test (MCT2)
MCT2 F 02 – 425 0.68* – – – – – – – – – –
MCT2 F 07 SS 3,704 0.68* 3,491 0.73* – – – – – – – –
North Carolina End–of–Grade (NCEOG) Test
NCEOG F 05–07 SS 2,735 0.77* 2,817 0.77* – – – – – – – –
Ohio Achievement Assessment
OAA S 13 SS 53 0.82* 38 0.66* – – – – – – – –
South Dakota State Test of Educational Progress (DSTEP)
DSTEP F 07–09 SS 2,236 0.79* 2,073 0.78* – – – – – – – –
Star Reading
Star–R F 05 – 17,370 0.82* 9,862 0.82* 2,462 0.82* 15,277 0.85* 1,443 0.83* 596 0.85*
Star–R F 06 – 22,177 0.82* 19,152 0.82* 4,087 0.84* 2,624 0.85* 2,930 0.85* 2,511 0.86*
Star–R F 05 – 5,399 0.81* 641 0.76* 659 0.89* 645 0.88* 570 0.90* – –
Star–R F 05 – 5,399 0.79* 641 0.76* 659 0.83* 645 0.83* 570 0.87* – –
Star–R S 06 – 5,399 0.82* 641 0.83* 659 0.87* 645 0.88* 570 0.89* – –
Star–R S 06 – 5,399 0.80* 641 0.83* 659 0.85* 645 0.85* 570 0.86*
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).
Table 67: Predictive Validity Data: Star Reading 2 Correlations (r) with External Tests Administered
Fall 2005–Spring 2013, Grades 7–12a
7 8 9 10 11 12
b
Test Form Date Score n r n r n r n r n r n r
State of Texas Assessments of Academic Readiness Standards Test (STAAR)
STAAR S 12–13 SS 4,716 0.77* 4,507 0.76* – – – – – – – –
Tennessee Comprehensive Assessment Program (TCAP)
TCAP S 13 SS 332 0.81* 233 0.74* – – – – – – – –
West Virginia Educational Standards Test 2 (WESTEST 2)
WESTEST S 12 SS 2,852 0.71* 2,636 0.74* – – – – – – – –
2
Wisconsin Knowledge and Concepts Examination (WKCE)
WKCE S 05–09 SS 6,399 0.78* 5,500 0.78* 401 0.78*
Summary
Grade(s) All 7 8 9 10 11 12
Number of 224,179 111,143 72,537 9,567 21,172 6,653 3,107
students
Number of 106 39 41 8 10 6 2
coefficients
Average – 0.72 0.73 0.81 0.81 0.87 0.86
validity
Overall 0.80
average
a. * Denotes correlation coefficients that are statistically significant at the 0.05 level.
b. Dates correspond to the term and year of the predictor scores. With some exceptions, criterion scores were obtained during
the same academic year. In some cases, data representing multiple years were combined. These dates are reported as a
range (e.g. Fall 05–Fall 07).
Table 68: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 1–6a
1 2 3 4 5 6
Table 68: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 1–6a
1 2 3 4 5 6
Table 68: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 1–6a
1 2 3 4 5 6
Summary
Grade(s) All 1 2 3 4 5 6
Number of 4,289 150 691 734 1,091 871 752
students
Number of 95 7 14 19 16 18 21
coefficients
Average validity – 0.75 0.72 0.73 0.74 0.73 0.71
Overall average 0.73
Table 69: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 7–12a
7 8 9 10 11 12
Test
Form Date Score n r n r n r n r n r n r
California Achievement Test (CAT)
/4 Spr 98 Scaled – – 11 0.75* – – – – – – – –
/5 Spr 98 NCE 80 0.85* – – – – – – – – – –
Comprehensive Test of Basic Skills (CTBS)
/4 Spr 97 NCE – – 12 0.68* – – – – – – – –
/4 Spr 98 NCE 43 0.84* – – – – – – – – – –
/4 Spr 98 Scaled 107 0.44* 15 0.57* 43 0.86* – – – – – –
A-16 Spr 98 Scaled 24 0.82* – – – – – – – – – –
Explore (ACT Program for Educational Planning, 8th Grade)
Fall 97 NCE – – – – 67 0.72* – – – – – –
Fall 98 NCE – – 32 0.66* – – – – – – – –
Iowa Test of Basic Skills (ITBS)
Form K Spr 98 NCE – – – – 35 0.84* – – – – – –
Form K Fall 98 NCE 32 0.87* 43 0.61* – – – – – – – –
Form K Fall 98 Scaled 72 0.77* 67 0.65* 77 0.78* – – – – – –
Form L Fall 98 NCE 19 0.78* 13 0.73* – – – – – – – –
Metropolitan Achievement Test (MAT)
7th Ed. Spr 97 Scaled 114 0.70* – – – – – – – – – –
7th Ed. Spr 98 NCE 46 0.84* 63 0.86* – – – – – – – –
7th Ed. Spr 98 Scaled 88 0.70* – – – – – – – – – –
7th Ed. Fall 98 NCE 50 0.55* 48 0.75* – – – – – – – –
Missouri Mastery Achievement Test (MMAT)
Spr 98 Scaled 24 0.62* 12 0.72* – – – – – – – –
North Carolina End of Grade Test (NCEOG)
Spr 97 Scaled – – – – – – 58 0.81* – – – –
Spr 98 Scaled – – – – 73 0.57* – – – – – –
PLAN (ACT Program for Educational Planning, 10th Grade)
Fall 97 NCE – – – – – – – – 46 0.71* – –
Fall 98 NCE – – – – – – 104 0.53* – – – –
Preliminary Scholastic Aptitude Test (PSAT)
Fall 98 Scaled – – – – – – – – 78 0.67* – –
Table 69: Other External Validity Data: Star Reading 2 Correlations (r) with External Tests
Administered Prior to Spring 1999, Grades 7–12a
7 8 9 10 11 12
Test
Form Date Score n r n r n r n r n r n r
Stanford Achievement Test (Stanford)
9th Ed. Spr 97 Scaled – – – – – – – – – – 11 0.90*
7th Ed. Spr 98 Scaled – – 8 0.83* – – – – – – – –
8th Ed. Spr 98 Scaled 6 0.89* 8 0.78* 91 0.62* – – 93 0.72* – –
9th Ed. Spr 98 Scaled 72 0.73* 78 0.71* 233 0.76* 32 0.25 64 0.76* – –
4th Ed. Spr 98 Scaled – – – – – – 55 0.68* – – – –
3/V
9th Ed. Fall 98 NCE 92 0.67* – – – – – – – – – –
9th Ed. Fall 98 Scaled – – – – 93 0.75* – – – – 70 0.75*
Stanford Reading Test
3rd Ed. Fall 97 NCE – – – – 5 0.81 24 0.82* – – – –
TerraNova
Fall 97 NCE 103 0.69* – – – – – – – – – –
Spr 98 Scaled – – 87 0.82* – – 21 0.47* – – – –
Fall 98 NCE 35 0.69* 32 0.74* – – – – – – – –
Test of Achievement and Proficiency (TAP)
Spr 97 NCE – – – – – – – – 36 0.59* – –
Spr 98 NCE – – – – – – 41 0.66* – – 43 0.83*
Texas Assessment of Academic Skills (TAAS)
Spr 97 TLI – – – – – – – – – – 41 0.58*
Wide Range Achievement Test 3 (WRAT3)
Spr 98 9 0.35 – – – – – – – – – –
Fall 98 – – – – 16 0.80* – – – – – –
Wisconsin Reading Comprehension Test
Spr 98 – – – – – – 63 0.58* – – – –
Summary
Grade(s) All 7 8 9 10 11 12
Number of 3,158 1,016 529 733 398 317 165
students
Number of 60 18 15 10 8 5 4
coefficients
Average validity – 0.71 0.72 0.75 0.60 0.69 0.77
Overall average 0.71
Table 70: Concurrent Validity Data: Star Reading 2 Correlations (r) with State Accountability Tests,
Grades 3–8a
3 4 5 6 7 8
Date Score n r n r n r n r n r n r
Colorado Student Assessment Program
Spr 06 Scaled 82 0.75* 79 0.83* 93 0.68* 280 0.80* 299 0.84* 185 0.83*
Delaware Student Testing Program—Reading
Spr 05 Scaled 104 0.57* – – – – – – – – – –
Spr 06 Scaled 126 0.43* 141 0.62* 157 0.59* 75 0.66* 150 0.72 – –
Florida Comprehensive Assessment Test
Spr 06 SSS – – 41 0.65* – – – – – – 74 0.65*
Illinois Standards Achievement Test—Reading
Spr 05 Scaled 594 0.76* – – 449 0.70* – – – – 157 0.73*
Spr 06 Scaled 140 0.80* 144 0.80* 146 0.72* – – 140 0.70* – –
Michigan Educational Assessment Program—English Language Arts
Fall 04 Scaled – – 155 0.81* – – – – 154 0.68* – –
Fall 05 Scaled 218 0.76* 196 0.80* 202 0.80* 207 0.69* 233 0.72* 239 0.70*
Fall 06 Scaled 116 0.79* 132 0.69* 154 0.81* 129 0.66* 125 0.79* 152 0.74*
a. Sample sizes are in the columns labeled “n.”
* Denotes correlation coefficients that are statistically significant (p < 0.05).
Table 70: Concurrent Validity Data: Star Reading 2 Correlations (r) with State Accountability Tests,
Grades 3–8a
3 4 5 6 7 8
Date Score n r n r n r n r n r n r
Michigan Educational Assessment Program—Reading
Fall 04 Scaled – – 155 0.80* – – – – 156 0.68* – –
Fall 05 Scaled 218 0.77* 196 0.78* 202 0.81* 207 0.68* 233 0.71* 239 0.69*
Fall 06 Scaled 116 0.75* 132 0.70* 154 0.82* 129 0.70* 125 0.86* 154 0.72*
Mississippi Curriculum Test
Spr 02 Scaled 148 0.62* 175 0.66* 81 0.69* – – – – – –
Spr 03 Scaled 389 0.71* 359 0.70* 377 0.70* 364 0.72* 372 0.70* – –
Oklahoma Core Curriculum Test
Spr 06 Scaled 78 0.62* 92 0.58* 46 0.52* 80 0.60* – – – –
Summary
Grades All 3 4 5 6 7 8
Number of 11,045 2,329 1,997 2,061 1,471 1,987 1,200
students
Number of 61 12 13 11 8 10 7
coefficients
Average – 0.72 0.73 0.73 0.71 0.74 0.73
validity
Overall 0.73
validity
a. Sample sizes are in the columns labeled “n.”
* Denotes correlation coefficients that are statistically significant (p < 0.05).
Table 71: Predictive Validity Data: Star Reading Scaled Scores Predicting Later Performance for
Grades 3–8 on Numerous State Accountability Testsa
3 4 5 6 7 8
Predictor Criterion
Date Dateb n r n r n r n r n r n r
Colorado Student Assessment Program
Fall 05 Spr 06 82 0.72* 79 0.77* 93 0.70* 280 0.77* 299 0.83* 185 0.83*
Delaware Student Testing Program—Reading
Fall 04 Spr 05 189 0.58* – – – – – – – – – –
Win 05 Spr 05 120 0.67* – – – – – – – – – –
Spr 05 Spr 06 161 0.52* 191 0.55* 190 0.62* – – 100 0.75* 143 0.63*
Fall 05 Spr 06 214 0.39* 256 0.62* 270 0.59* 242 0.71* 273 0.69* 247 0.70*
Win 05 Spr 06 233 0.47* 276 0.59* 281 0.62* 146 0.57* – – 61 0.64*
Florida Comprehensive Assessment Test
Fall 05 Spr 06 – – 42 0.73* – – 409 0.67* 381 0.61* 387 0.62*
Win 07 Spr 07 – – – – – – 417 0.76* 342 0.64* 361 0.72*
Illinois Standards Achievement Test—Reading
Fall 04 Spr 05 450 0.73* – – 317 0.68* – – – – – –
Win 05 Spr 05 564 0.76* – – 403 0.68* – – – – – –
Fall 05 Spr 06 133 0.73* 140 0.74* 145 0.66* – – 173 0.51* 158 0.66*
Win 06 Spr 06 138 0.76* 145 0.77* 146 0.70* – – – – – –
Michigan Educational Assessment Program—English Language Arts
Fall 04 Fall 05P 193 0.60* 181 0.70* 170 0.75* 192 0.66* 181 0.71* 88 0.85*
Win 05 Fall 05P 204 0.68* 184 0.74* 193 0.75* 200 0.70* 214 0.73* 212 0.73*
Spr 05 Fall 05P 192 0.73* 171 0.73* 191 0.71* 193 0.62* 206 0.75* 223 0.69*
Fall 05 Fall 06P 111 0.66* 132 0.71* 119 0.77* 108 0.60* 114 0.66* 126 0.66*
Win 06 Fall 06P 114 0.77* – – 121 0.75* 109 0.66* 114 0.64* 136 0.71*
Spr 06 Fall 06P – – – – – – – – – – 30 0.80*
Michigan Educational Assessment Program—Reading
Fall 04 Fall 05P 193 0.60* 181 0.69* 170 0.76* 192 0.66* 181 0.70* 88 0.84*
Win 05 Fall 05P 204 0.69* 184 0.74* 193 0.78* 200 0.70* 214 0.72* 212 0.73*
Spr 05 Fall 05P 192 0.72* 171 0.72* 191 0.74* 193 0.62* 206 0.72* 223 0.69*
Fall 05 Fall 06P 111 0.63* 132 0.70* 119 0.78* 108 0.62* 116 0.68* 138 0.66*
Win 06 Fall 06P 114 0.72* – – 121 0.75* 109 0.64* 116 0.68* 138 0.70*
Spr 06 Fall 06P – – – – – – – – – – 30 0.81*
a. Grade given in the column signifies the grade within which the Predictor variable was given (as some validity estimates span
contiguous grades).
b. P indicates a criterion measure was given in a subsequent grade from the predictor.
* Denotes significant correlation (p < 0.05).
Table 71: Predictive Validity Data: Star Reading Scaled Scores Predicting Later Performance for
Grades 3–8 on Numerous State Accountability Testsa
3 4 5 6 7 8
Predictor Criterion
Date Dateb n r n r n r n r n r n r
Mississippi Curriculum Test
Fall 01 Spr 02 95 0.70* 97 0.65* 78 0.76* – – – – – –
Fall 02 Spr 03 337 0.67* 282 0.69* 407 0.71* 442 0.72* 425 0.68* – –
Oklahoma Core Curriculum Test
Fall 04 Spr 05 – – – – 44 0.63* – – – – – –
Win 05 Spr 05 – – – – 45 0.66* – – – – – –
Fall 05 Spr 06 89 0.59* 90 0.60* 79 0.69* 84 0.63* – – – –
Win 06 Spr 06 60 0.65* 40 0.67* – – – – – – – –
Summary
Grades All 3 4 5 6 7 8
Number of 22,018 4,493 2,974 4,086 3,624 3,655 3,186
students
Number of 119 24 19 23 17 17 19
coefficients
Average – 0.66 0.68 0.70 0.68 0.69 0.70
validity
Overall 0.68
validity
a. Grade given in the column signifies the grade within which the Predictor variable was given (as some validity estimates span
contiguous grades).
b. P indicates a criterion measure was given in a subsequent grade from the predictor.
* Denotes significant correlation (p < 0.05).
Star Reading was released for use in June 2011. In the course of its development,
Star Reading was administered to thousands of students who also took previous
versions. The correlations between Star Reading and previous versions of
Star Reading provide validity evidence of their own. To the extent that those
correlations are high, they would provide evidence that the current Star Reading
and previous versions are measuring the same or highly similar underlying
attributes, even though they are dissimilar in content and measurement precision.
Table 72 displays data on the correlations between Star Reading and scores on
two previous versions: classic versions of Star Reading (which includes versions
2.0 through 4.3) and Star Reading Progress Monitoring (version 4.4.) Both of those
Star Reading versions are 25-item versions that are highly similar to one another,
differing primarily in terms of the software that delivers them; for all practical
purposes, they may be considered alternate forms of Star Reading.
Table 72: Correlations of Star Reading with Scores on Star Reading Classic
and Star Reading Progress Monitoring Tests
Star Reading Star Reading
Classic Versions Progress Monitoring Version
Grade N r N r
1 810 0.73 539 0.87
2 1,762 0.81 910 0.85
3 2,830 0.81 1,140 0.83
4 2,681 0.81 1,175 0.82
5 2,326 0.80 919 0.82
6 1,341 0.85 704 0.84
7 933 0.76 349 0.81
8 811 0.80 156 0.85
9 141 0.76 27 0.75
10 107 0.79 20 0.84
11 84 0.87 6 0.94
12 74 0.78 5 0.64
All Grades 13,979 0.87 5,994 0.88
Combined
Table 73: Correlations of Star Reading 2.0 Scores with SAT9 and California
Standards Test Scores, by Grade
CST English and
Grade SAT9 Total Reading Language Arts
3 0.82 0.78
4 0.83 0.81
5 0.83 0.79
6 0.81 0.78
In summary, the average correlation between Star Reading and SAT9 was 0.82.
The average correlation with CST was 0.80. These values are evidence of the
validity of Star Reading for predicting performance on both norm-referenced
reading tests such as the SAT9, and criterion-referenced accountability measures
such as the CST. Bennicoff-Nan concluded that Star Reading was “a time and
labor effective” means of progress monitoring in the classroom, as well as
suitable for program evaluation and monitoring student progress toward state
accountability goals.
Recently the assessment scale associated with Star Reading has been linked
to the scales used by virtually every state summative reading or ELA test in the
US. Linking Star Reading assessments to state tests allows educators to reliably
predict student performance on their state assessment using Star Reading scores.
More specifically, it places teachers in a position to identify
X which students are on track to succeed on the year-end summative state test,
and
X which students might need additional assistance to reach proficiency.
Methodology Comparison
Recently, Renaissance Learning has developed linkages between Star Reading
Scaled Scores and scores on the accountability tests of a number of states.
Depending on the kind of data available for such linking, these linkages have been
accomplished using one of two different methods. One method used student-
level data, where both Star and state test scores were available for the same
students. The other method used school-level data; this method was applied when
approximately 100% of students in a school had taken Star Reading, but individual
students’ state test scores were not available.
Student-Level Data
Typically, states classify students into one of three, four, or five performance levels
on the basis of cut scores (e.g. Below Basic, Basic, Proficient, or Advanced).
After each testing period, a distribution of students falling into each of these
categories will always exist (e.g. 30% in Basic, 25% in Proficient, etc.). Because
Star data were available for the same students who completed the state test,
the distributions could be linked via equipercentile linking analysis (see Kolen
& Brennan, 2004) to scores on the state test. This process creates tables of
approximately equivalent scores on each assessment, allowing for the lookup of
Star scale scores that correspond to the cut scores for different performance levels
on the state test. For example, if 20% of students were “Below Basic” on the state
test, the lowest Star cut score would be set at a score that partitioned only the
lowest 20% of scores.
School-Level Data
While using student-level data is still common, obstacles associated with individual
data often lead to a difficult and time-consuming process of obtaining and
analyzing data. In light of the time-sensitive needs of schools, obtaining student-
level data is not always an option. As an alternative, school-level data may be
used in a similar manner. These data are publicly available, thus making the
linking process more efficient.
School-level data were analyzed for some of the states included in the student-
level linking analysis. In an effort to increase sample size, the school-level data
presented here represent “projected” Scaled Scores. Each Star score was
projected to the mid-point of the state test administrations window using decile-
based growth norms. The growth norms are both grade- and subject-specific and
are based on the growth patterns of more than one million students using Star
assessments over a three-year period. Again, the linking process used for school-
level data is very similar to the previously described process—the distribution of
state test scores is compared to projected Star scores and using the observed
distribution of state-test scores, equivalent cut scores are created for the Star
assessments (the key difference being that these comparisons are made at the
group level).
Accuracy Comparisons
Accuracy comparisons between student- and school-level data are particularly
important given the marked resource differences between the two methods. These
comparisons are presented for three states1 in Table 74, Table 75, and Table
76. With few exceptions, results of linking using school-level data were nearly
identical to student-level data on measures of specificity, sensitivity, and overall
accuracy. McLaughlin and Bandeira de Mello (2002) employed similar methods in
their comparison of NAEP scores and state assessment results, and this method
has been used several times since then (McLaughlin & Bandeira de Mello, 2003;
Bandeira de Mello, Blankenship, & McLaughlin, 2009; Bandeira et al., 2008).
In a similar comparison study using group-level data, Cronin et al. (2007) observed
cut score estimates comparable to those requiring student-level data.
1. Data were available for Arkansas, Florida, Idaho, Kansas, 2Kentucky, Mississippi, North
Carolina, South Dakota, and Wisconsin; however, only North Carolina, Mississippi, and
Kentucky are included in the current analysis.
Table 75: Comparison of School Level and Student Level Classification Diagnostics for Reading/
Language Arts
State Grade Student School Student School Student School Student School Student School
NC 3 89% 83% 75% 84% 25% 16% 11% 17% 83% 83%
4 90% 81% 69% 80% 31% 20% 10% 19% 82% 81%
5 90% 77% 69% 83% 31% 17% 10% 23% 81% 80%
6 85% 85% 75% 75% 25% 25% 15% 15% 81% 81%
7 84% 76% 77% 82% 23% 18% 16% 24% 80% 79%
8 83% 79% 74% 74% 26% 26% 17% 21% 79% 76%
MS 3 66% 59% 86% 91% 14% 9% 34% 41% 77% 76%
4 71% 68% 87% 88% 13% 12% 29% 32% 79% 79%
5 70% 68% 84% 85% 16% 15% 30% 32% 78% 78%
6 67% 66% 84% 84% 16% 16% 33% 34% 77% 77%
7 63% 66% 88% 86% 12% 14% 37% 34% 79% 79%
8 69% 72% 86% 85% 14% 15% 31% 28% 79% 80%
KY 3 91% 91% 49% 50% 51% 50% 9% 9% 83% 83%
4 90% 86% 46% 59% 54% 41% 10% 14% 81% 80%
5 88% 81% 50% 65% 50% 35% 12% 19% 79% 77%
6 89% 84% 53% 63% 47% 37% 11% 16% 79% 79%
7 86% 81% 56% 66% 44% 34% 14% 19% 77% 76%
8 89% 84% 51% 63% 49% 37% 11% 16% 79% 78%
a. Sensitivity refers to the proportion of correct positive predictions.
b. Specificity refers to the proportion of negatives that are correctly identified (e.g. student will not meet a particular cut score).
c. False + rate refers to the proportion of students incorrectly identified as “at-risk.”
d. False – rate refers to the proportion of students incorrectly identified as not “at-risk.”
Table 76: Comparison of Differences Between Achieved and Forecasted Performance Levels in
Reading/Language Arts (Forecast % – Achieved %)
State Grade Student School Student School Student School Student School
NC Level I Level II Level III Level IV
3 –6.1% –1.1% 2.0% 1.1% 3.6% –0.8% 0.4% 0.9%
4 –3.9% –2.0% –0.1% 1.3% 4.3% 0.4% –0.3% 0.2%
5 –5.1% –1.9% –0.7% 2.4% 8.1% –0.7% –2.3% 0.2%
6 –2.1% 0.2% 0.8% –0.4% 3.2% –11.5% –2.0% 11.7%
7 –6.4% –0.9% 2.9% –0.4% 6.3% –0.7% –2.8% 2.0%
8 –4.9% –3.0% 3.0% 0.4% 5.1% 2.3% –3.1% 0.3%
MS Minimal Basic Proficient Advanced
3 5.2% 14.1% 3.9% 0.5% –6.1% –13.4% –3.0% –1.2%
4 5.6% 10.9% 0.2% –3.1% –3.0% –5.9% –2.8% –1.8%
5 4.2% 12.6% 0.4% –6.7% –2.7% –7.2% –1.9% 1.3%
6 1.9% 6.2% 2.0% –1.5% –3.8% –7.1% 0.0% 2.4%
7 5.3% 7.0% 1.1% –2.8% –6.3% –5.3% –0.2% 1.0%
8 6.8% 5.5% –1.7% –2.8% –4.6% –4.3% –0.5% 1.5%
KY Novice Apprentice Proficient Distinguished
3 –3.5% –1.4% 0.8% –1.4% 6.4% 3.1% –3.7% –0.3%
4 –0.5% –0.3% –2.5% 2.9% 6.8% –2.1% –3.9% –0.5%
5 –1.6% 1.0% –2.3% 3.7% 9.1% –2.9% –5.3% –1.8%
6 –1.5% 1.9% –3.6% –1.1% 7.3% 0.0% –2.3% –0.8%
7 –0.9% 0.6% –2.5% 2.5% 6.6% –1.7% –3.3% –1.4%
8 –0.1% 1.0% –5.1% 1.1% 8.1% –3.0% –2.9% 0.8%
When evaluating the validity of screening tools, NCRTI considered several factors:
X classification accuracy
X validity
X disaggregated validity and classification data for diverse populations
“ROC curve analyses not only provide information about cut scores, but also
provide a natural common scale for comparing different predictors that are
measured in different units, whereas the odds ratio in logistic regression analysis
must be interpreted according to a unit increase in the value of the predictor, which
can make comparison between predictors difficult.” (Pepe, et al., 2004)
“An overall indication of the diagnostic accuracy of a ROC curve is the area
under the curve (AUC). AUC values closer to 1 indicate the screening measure
reliably distinguishes among students with satisfactory and unsatisfactory reading
performance, whereas values at .50 indicate the predictor is no better than
chance.” (Zhou, Obuchowski & Obushcowski, 2002)
Initial Star Reading classification analyses were performed using state assessment
data from Arkansas, Delaware, Illinois, Michigan, Mississippi, and Kansas.
Collectively these states cover most regions of the country (Central, Southwest,
Northeast, Midwest, and Southeast). Both the Classification Accuracy and Cross
Validation study samples were drawn from an initial pool of 79,045 matched
student records covering grades 2–11. The sample used for this analysis was 49
percent female and 28 percent male, with 44 percent not responding. Twenty-
eight percent of students were White, 14 percent were Black, and 2 percent were
Hispanic. Lastly, 0.4 percent were Asian or Pacific Islander and 0.2 were American
Indian or Alaskan Native. Ethnicity data were not provided for 55.4 percent of the
sample.
A secondary analysis using data from a single state assessment was then
performed. The sample used for this analysis was 42,771 matched Star Reading
and South Dakota Test of Education Progress records. The sample covered
grades 3–8 and was 28 percent female and 28 percent male. Seventy-one percent
of students were White and 26 percent were American Indian or Alaskan Native.
Lastly, 1 percent were Black, and 1 percent were Hispanic and, 0.7 percent were
Asian or Pacific Islander.
An ROC analysis was used to compare the performance data on Star Reading to
performance data on state achievement tests. The Star Reading Scaled Scores
used for analysis originated from assessments 3–11 months before the state
achievement test was administered. Selection of cut scores was based on the
graph of sensitivity and specificity versus the Scaled Score. For each grade, the
Scaled Score chosen as the cut point was equal to the score where sensitivity
and specificity intersected. The classification analyses, cut points and outcome
measures are outlined in Table 78. When collapsed across ethnicity, AUC values
were all greater than 0.80. Descriptive notes for other values represented in the
table are provided in the table footnote.
Table 79: Overall Concurrent and Predictive Validity Evidence for Star Reading
Coefficient
Type of
Validity Grade Test N (Range) Range Median
Predictive 3–6 CST 1,000+ 0.78–0.81 0.80
Predictive 2–6 SAT9 44–389 0.66–0.73 0.68
Concurrent 1–8 Suffolk Reading Scale 2,694 0.78–0.88 0.84
Construct 3, 5, 7, 10 DRP 273–424 0.76–0.86 0.82
Concurrent 1–4 DIBELS Oral Reading Fluency 12,220 0.71–0.87 0.81
Predictive 1–6 State Achievement Tests 74,877–200,929 0.68–0.82 0.79
Predictive 7–12 State Achievement Tests 3,107–64,978 0.81–0.86 0.82
Concurrent 3–8 State Achievement Tests 1,200–2,329 0.71–0.74 0.73
Predictive 3–8 State Achievement Tests 2,974–4,493 0.66–0.70 0.68
Disaggregated Validity
Coefficient
Test or
Type of Validity Age or Grade Criterion n (range) Range Median
Predictive (White) 2–6 SAT9 35–287 0.69–0.75 0.72
Predictive (Hispanic) 2–6 SAT9 7–76 0.55–0.74 0.675
Bulut, O., & Cormier, D. C. (2018). Validity evidence for progress monitoring
with Star Reading: Slope estimates, administration frequency, and
number of data points. Frontiers in Education, 3(68), 1–12.
Diggle, P., Heagerty, P., Liang, K., & Zeger, S. (2002). Analysis of longitudinal
data (2nd ed.). Oxford: Oxford University Press.
Duncan, T., Duncan, S., Strycker, L., Li, F., & Alpert, A. (1999). An
introduction to latent variable growth curve modeling: Concepts, issues,
and applications. Mahwah, NJ: Lawrence Erlbaum Associates.
Kolen, M., & Brennan, R. (2004). Test equating, scaling, and linking (2nd
ed.). New York: Springer.
Meyer, B., & Rice, G. E. (1984). The structure of text. In P.D. Pearson (Ed.),
Handbook on reading research (pp. 319–352). New York: Longman.
Neter, J., Kutner, M., Nachtsheim, C., & Wasserman, W. (1996). Applied
linear statistical models (4th ed.). New York: WCB McGraw-Hill.
M R
Maximum likelihood IRT estimation, 31 Rasch model, 36
Measurement precision, 48 Reader
Metropolitan Achievement Test (MAT), 136 Emergent, 133
Probable, 133
Transitional, 133
Receiver Operating Characteristic (ROC) curves,
N 169
National Center for Educational Statistics (NCES), Reliability, 48
98 alternate-form, 52
National Center on Response to Intervention coefficient, 48, 60
(NCRTI), 136, 168, 169 definition, 48
disaggregated validity and classification data, split-half, 48, 51
83 standard error of measurement (SEM), 54
Normal Curve Equivalent (NCE), 111 test-retest, 48, 52
Norming, 96 Renaissance learning progressions for reading, 27
data analysis, 101 Repeating a test, 9
growth norms, 103 ROC analysis, 82
sample characteristics, 97 Rudner’s index, 57
test administration, 101 Rules for item retention, 38
test score norms, 96
O S
Sample characteristics, norming, 97
Oral Reading Fluency, 78
SAT9, 74
Scale calibration, 31, 39
Scaled Score (SS), 40
P Score definitions, 105
Password entry, 13 conversion tables, 118
Pathway to Proficiency, 163 grade placement, 116
Percentile Rank (PR), 110 special scores, 116
Permissions, 13 types of test scores, 105
Post-publication studies, 162 Scores
Post-publication study data Estimated Oral Reading Fluency (Est. ORF),
correlations with a measure of reading 78, 108
comprehension, 76 Grade Equivalent (GE), 106, 109, 110
correlations with reading tests in England, 75 Instructional Reading Level (IRL), 108, 109
correlations with SAT9, 74 Lexile® Measures, 113
investigating Oral Reading Fluency and Normal Curve Equivalent (NCE), 111
developing the Est. ORF (Estimated Oral Percentile Rank (PR), 110
Reading Fluency) scale, 78 special IRL (Instructional Reading Level),
PR. See Percentile Rank (PR) 109
Practice session, 6 Student Growth Percentile (SGP), 112
Program description, 1 test scores, 105
Purpose of the program, 1 Zone of Proximal Development (ZPD), 116
© Copyright 2024 Renaissance Learning, Inc. All rights reserved. (800) 338-4204 [Link]
All logos, designs, and brand names for Renaissance’s products and services are trademarks of Renaissance Learning, Inc., and its subsidiaries,
registered, common law, or pending registration in the United States. All other product and company names should be considered the property of their
respective companies and organizations. R43843.240827