Dyslexia Treatment Efficacy Review
Dyslexia Treatment Efficacy Review
https://doi.org/10.3758/s13428-021-01549-x
Abstract
Poor response to treatment is a defining characteristic of reading disorder. In the present systematic review and meta-analysis, we
found that the overall average effect size for treatment efficacy was modest, with a mean standardized difference of 0.38. Small
true effects, combined with the difficulty to recruit large samples, seriously challenge researchers planning to test treatment
efficacy in dyslexia and potentially in other learning disorders. Nonetheless, most published studies claim effectiveness, generally
based on liberal use of multiple testing. This inflates the risk that most statistically significant results are associated with
overestimated effect sizes. To enhance power, we propose the strategic use of repeated measurements with mixed-effects
modelling. This novel approach would enable us to estimate both individual parameters and population-level effects more
reliably. We suggest assessing a reading outcome not once, but three times, at pre-treatment and three times at post-treatment.
Such design would require only modest additional efforts compared to current practices. Based on this, we performed ad hoc a
priori design analyses via simulation studies. Results showed that using the novel design may allow one to reach adequate power
even with low sample sizes of 30–40 participants (i.e., 15–20 participants per group) for a typical effect size of d = 0.38.
Nonetheless, more conservative assumptions are warranted for various reasons, including a high risk of publication bias in the
extant literature. Our considerations can be extended to intervention studies of other types of neurodevelopmental disorders.
but also inflates the risk of overestimating effect sizes. This between reading scores of r = .80 (which may reasonably
risk can be defined a priori as the “exaggeration ratio” that reflect the reliability of a reading measure used within a spe-
indicates how much an effect size will be overestimated on the cial population, see below), setting a conventional critical α =
average in comparison with a plausible true effect size given .05, and running the ANCOVA as suggested in the previous
that statistical significance is reached (e.g., Gelman & Carlin, paragraph, then about 79 participants per group (i.e., mini-
2014; see also Altoè et al., 2020). Unfortunately, researchers mum sample size of 158) are needed to reach a statistical
who test treatment efficacy in learning disorders frequently power of 80% and to limit the exaggeration ratio to about
encounter this problem. In this field, recruiting large samples 1.10. This and the following calculations were obtained via
is difficult for different reasons. First, children with learning simulation using the R software (R Core Team, 2020), with
disorders represent only a subset of the general population. 10,000 iterations. Further details on our simulations will be
While this subset is epidemiologically relevant, it is small in provided below in Study 2. The R code has been made pub-
absolute terms, totalling no more than 5–10% of all children licly available (see the Open Practice Statement section).
(cf. DSM-5; APA, 2013). Second, studies on treatment effi- With only 20 participants per group (i.e., the median
cacy require considerable compliance from children and their number of participants per group in the studies reviewed by
families, and several hours of their time, making it even more Galuschka et al., 2014), the power is only 28%, and the exag-
difficult to perform treatment studies with large samples. geration ratio is 1.83. This means that under the assumptions
Calculating power—and the exaggeration ratio—in this outlined above, an effect size—calculated with the formula
field may not be easy. It depends on several factors, including suggested by Morris (2008)—associated with statistical sig-
what inferential analysis is performed, pretest–posttest reading nificance would be on average nearly twice as large as the true
score correlation, and how the effect size is calculated. effect size. Using methods less powerful than linear models/
Concerning inferential analysis, the best choice is perhaps ANCOVA with pre-treatment scores as covariate may lead to
the use of linear models/ANCOVA on post-treatment scores, even larger overestimations of effects. For example, simply
testing the effect of group and covarying the pre-treatment comparing post-treatment scores using ANOVA/t test (and
scores (e.g., Gelman, Hill, & Vehtari, 2020; Van Breukelen, ignoring pre-treatment scores) leads to much worse results,
2006). Note that covarying pre-treatment scores here serve to with only 13% power and an exaggeration ratio of 3.01. In
increase power by controlling for the individual baselines, not brief, randomized controlled trials in this field not only are
to correct for initial group differences. Other methods such as unlikely to reach statistical significance (even if the treatment
testing the group by time interaction or comparing gain scores is effective), but are also at risk of overestimating effect sizes.
are also appropriate and lead to unbiased estimates, but they Despite their lack of statistical power, most published stud-
may have slightly less power (e.g., Dimitrov & Rumrill, 2003; ies with dyslexic participants claim that their treatments are
Van Breukelen, 2006). The pretest–posttest scores correlation effective. We reviewed the 22 randomized controlled trials
is importantly related to the reliability of the outcome reading included in Galuschka et al.’s (2014) meta-analysis. In their
measure that greatly affects power, as will be discussed in titles or abstracts, 17 studies claimed improved reading fol-
detail later. lowing the remediation program, four studies stated that re-
The calculation of effect size is not trivial, and different sults were inconsistent or that improvements occurred in cog-
formulae have been proposed. Reading scores, the skill nitive aspects related to but different from reading, and one
targeted by intervention in most reading treatments, are quan- study concluded that the results failed to demonstrate any
titative continuous measures (e.g., reading time/speed, error treatment-related improvement. This “optimism” is particular-
rates). Thus, treatment efficacy can be expressed as a stan- ly worrisome considering the low statistical power of most of
dardized difference between pre- and post-intervention scores. these studies.
Morris (2008) suggests calculating the pre-to-post change in One of the reasons behind the above confidence may be the
the treated group minus the pre-to-post change in the control liberal use of uncorrected multiple testing. We reviewed the
group, all divided by the pooled standard deviation calculated 22 studies included in the meta-analysis by Galuschka et al.
from pre-test scores. (2014), and we found that they tested four different outcomes
Most published randomized controlled trials investigating of reading on average (median = 3). To quantify treatment
treatment efficacy in dyslexia are seriously underpowered un- efficacy, Galuschka et al. (2014) appropriately combined all
der plausible assumptions. Reviewing the 22 published ran- reading outcomes within the same “group comparison” in
domized controlled trials in the meta-analysis by Galuschka each study (e.g., in a study with a treated vs a control group,
et al. (2014), we found 32 and 20 as the mean and median all reading outcomes were combined into a single effect).
number of participants per group, respectively. However, as- Conversely, virtually all studies analysed reading outcomes
suming a real effect size of d = 0.27 (i.e., the unweighted separately. In these studies, the presence of an isolated signif-
average effect size across all treatment approaches calculated icant comparison in one reading outcome, at one post-
from Galuschka et al., 2014), a pretest–posttest correlation treatment time point, was typically used to support a claim
Behav Res
about the efficacy of a specific treatment, even in the presence calculated in their own samples of children with dyslexia. In
of statistically non-significant findings regarding other out- addition, Cirino et al. (2002) reported that the test–retest cor-
comes. This does not necessarily mean that most results are relation among standard scores from major reading batteries
false positives, but that most claims may be supported by ranged between .46 and .92, with a median of about .70, in a
overestimated effects. sample of 78 children with reading disability. This is below
the reliability levels generally reported by standardized batte-
Reliability is crucial, and repeated measurements can ries (for example, 13 out of 40 studies that we reviewed in our
be the key to increase power meta-analysis in Study 1 reported test–retest correlations from
the normative samples of the standardized batteries that they
Collecting several reading outcomes raises the problem of used, and the range of values was from .71 to .96).
correcting p values for multiple comparisons if the outcomes Whatever the pretest–posttest correlation, statistical power
are analysed separately. However, combining effects from can be enhanced by adding more information from repeated
multiple measurements can be a key to obtain more precise measurements. This could be done by assessing reading per-
estimates, and thus to increase power to some extent. In meta- formance not once, but several times at pretest and several
analyses, effect sizes are combined across studies, providing times at posttest. Dyslexia (and learning disorders in general)
more precise estimates than possible in single studies. represents an ideal case because the ability of interest can be
Similarly, effects calculated from multiple outcomes could assessed using relatively simple tasks. In research, reading is
be combined even within a single study, providing more pre- generally assessed using word lists and non-word lists, mea-
cise estimates than possible from a single measurement. suring speed/time and/or accuracy. For such reading tasks,
An additional way to increase precision is using highly several parallel versions could be easily created ad hoc and
reliable measures. The formulae by Morris (Morris, 2008) administered. Parameters such as word frequency, length, and
show how higher pretest–posttest correlation (ρ, a proxy of a orthographic complexity are relatively easy to control, and
measure stability/reliability) reduces the effect size variance. should be equated across parallel versions. In addition, since
Unsurprisingly, higher pretest–posttest correlation increases reading tasks are similar to everyday life reading requests, it
power for ANCOVA on post-treatment scores (covarying by can be assumed that any practice effect induced by the repeat-
pre-treatment scores, as recommended by Van Breukelen, ed measurements remains negligible.
2006, for pretest–posttest-controlled studies), or for any other The idea investigated in this paper is to exploit the advan-
analysis in which pre-treatment scores are included, such as tage of single-case experimental designs, where repeated mea-
testing the group × time interaction. This will be shown in surements are used to estimate with precision the individual
Study 2 via simulation. In brief, stronger pretest–posttest cor- baseline (and change) of a measure (e.g., Krasny-Pacini &
relation means smaller measurement error, thus more precise Evans, 2018), while at the same time keeping the focus on
estimates of the effect and powerful statistical tests. the population-level effect. In our Study 2, we hypothesized a
The test–retest correlation is generally very high for read- scenario in which reading performance is repeatedly assessed
ing measures, but how can it be determined precisely? One at pre-treatment and at post-treatment for all children, showing
may consider the test–retest correlation calculated in the nor- how it leads to superior power as compared to the traditional
mative population if a standardized test battery is used. design with any reading outcome measured once at pre-
However, this correlation may differ when calculated in spe- treatment and once at post-treatment.
cial populations such as dyslexics. Specifically, the test–retest
correlation of reading scores in dyslexia may be smaller than Aims of the present investigation
that of the normative population because of the shrinkage of
the reading score range (dyslexic participant performs at the The present article includes two studies. Study 1 is a system-
lower tail of the distribution). That a correlation decreases as atic review and meta-analysis that updates and extends the
the range of one variable reduces can be easily shown via investigation by Galuschka et al. (2014) to provide a picture
simulation. For example, we can simulate two correlated sets of the latest developments in this field. Results by Galuschka
of scores with r = .95 test–retest reliability from a hypothetical et al. (2014) were formalized and used as a set of Bayesian
population. Simulating selecting a dyslexic subgroup we can informed priors in our own analysis. In recent years, with
then select the cases whose average scores are one standard several new treatments for children with reading problems,
deviation (SD) below the population mean. In this subgroup, the number of published studies has grown considerably. In
the test–retest correlation drops to r = .77. Unsurprisingly, two addition, in recent years several journals have improved their
recent randomized controlled trials, reviewed in Study 1, re- statistical standards and practices to some extent (Giofrè,
ported the test–retest correlation (Wang, 2017; Wang, Liu, & Cumming, Fresc, Boedker, & Tressoldi, 2017). Study 2 is
Xu, 2019), and specified that this measure was .94 according based on the results of our meta-analysis and examines how
to the test battery, but only .81 and .78, respectively, when statistical power can be improved when studying treatment
Behav Res
efficacy in dyslexia. We provide examples of different a priori We considered any study reporting quantitative data
design analyses conducted via ad hoc simulation, showing concerning treatment efficacy on individuals with dyslexia/
how power varies with sample size, test–retest correlation, reading disorder. The following criteria had to be met for
and the study design. We show the advantage of a design inclusion: (a) the treatment approach can be of any type, but
based on collecting and combining several outcome measure- it must aim to improve reading performance as its ultimate
ments at any single time point at both pre-treatment and post- goal; (b) the manuscript must be written in English or any
treatment over the traditional study design collecting a single other language understood by one of the authors (including
measure per outcome at each time point. (Note that according Spanish, French, Italian, Portuguese, and Hungarian); (c) par-
to our suggestion, multiple closely spaced successive mea- ticipants must either be clinically diagnosed with develop-
surements would need to be taken at each spaced-out fol- mental dyslexia (or reading disability or reading disorder) or
low-up measurement point.) having a profile compatible with a reading disorder as report-
ed by the authors; in the latter case, participants must have
reading performance either below the 25th percentile or one
Study 1: A systematic review standard deviation below the population mean as assessed
and meta-analysis of the recent findings using standardized tests in their mother tongue; (d) partici-
pants must be described as having normal intelligence or an
We conducted a systematic review and meta-analysis of stud- IQ not below 70 (if reported); (e) any comorbidity or co-
ies assessing treatment efficacy published in the past eight occurring condition is acceptable, but they must be compatible
years (January 2013 through June 2020). This time span was with dyslexia status (e.g., deafness, neurological conditions,
chosen to update the work of Galuschka et al. (2014) that intellectual disability in one or more participants, or low so-
included studies published until 2013. Our search and inclu- cioeconomic status as a predominant condition of the entire
sion criteria were very similar to those of Galuschka et al. sample, are exclusion criteria); (f) the study must include at
(2014). However, we identified more studies in our quantita- least one control group comprising individuals with dyslexia,
tive synthesis than did Galuschka et al. (2014), suggesting who must either be untreated, waiting list, or active control
that there was a surge of interest in this field in the past (e.g., a placebo condition; no comparisons between
few years. It should be noted that new treatment ap- alternative/competing treatments were considered); (g) group
proaches emerged in these recent studies as compared allocation must be randomized; however, studies which did
to those reviewed by Galuschka et al. (2014), including not explicitly mention whether the allocation was randomized
methodologies inspired by new neuropsychological per- were still included and labelled as “unclear”; the analyses
spectives. Our primary aim was to provide an overview were later performed both with and without these studies;
of the recently published literature with a focus on (h) participants’ reading ability must be assessed at least twice,
methodological and statistical practices. including before (pre-test score) and after treatment (post-test
score).
Method The PRISMA flow diagram summarizing the literature
search and the selection process is reported in Fig. 1. The
Literature search and inclusion criteria full-text eligibility assessment was conducted by two indepen-
dent reviewers. The inter-judge agreement was good, Cohen’s
Articles published from 2013 through June 2020 were k = .77. Disagreements were resolved via discussion with a
reviewed. The searched databases were APA PsycInfo, third reviewer.
Scopus, and PubMed, as they were expected to include
virtually all relevant literature. No further search of the grey Coding of the studies
literature was conducted, as we crucially aimed to review the
characteristics of the published literature. The search keys Two authors coded all studies and double-checked the entries.
used by Galuschka et al. (2014) in their search seemed appro- A different author further checked the final dataset. For each
priate, thus we used the same: (“dyslexia” OR “developmental study, basic information including title, authors and year of
reading disorder” OR “developmental dyslexia” OR “devel- publication were coded. The dataset included as many rows as
opmental reading disability” OR “reading disorder” OR effect sizes. An effect size was defined as standardized mean
“word blindness” OR “spelling disorder” OR “developmental difference between reading scores in a treatment vs control
spelling disorder” OR “specific spelling disorder”) AND group at the post-test (or follow-up), controlling for the pre-
(“treatment” OR “therapy” OR “therapeutics” OR “training” test scores. Effect sizes concerning follow-up assessments
OR “remediation”); that is, at least one term in the first bracket were coded if available but analysed separately. For most
combined with at least one term in the second bracket. The studies, more than one effect size could be calculated (e.g.,
above terms were searched in title, abstract, and keywords. because more than one reading outcome was used, or because
Behav Res
there were multiple treated groups or control groups). sessions. The “subgroup comparison within study” was also
Therefore, the dataset could include several observations per coded. This is an identifier of the treated-vs-control group
study. comparison. It served to distinguish among partially indepen-
For the calculation of the effect size, descriptive statistics of dent effects within the same study when more than one treated
the reading scores (i.e., mean, standard deviation, and number group or more than one control group were reported.
of participants on which they were calculated) were coded for Treatment approaches were coded following the categories
both treated and control group, at both pre- and post-test (or used by Galuschka et al. (2014), who followed the National
follow-up). Descriptive statistics were coded from tables or Institute of Child Health and Human Development (2000)
text where possible, or derived and approximated figures. review, where possible. These include phonemic awareness
Where no descriptive statistics were reported, we coded any instruction, phonics instruction, reading fluency training,
alternative detail that allowed us to estimate the effect size and reading comprehension training, auditory training, medical
its variance (e.g., standardized model coefficients, effect sizes treatment, and coloured overlays. In our review, however,
reported by the authors). If no such details were available, the new approaches were introduced. These included: brain stim-
authors were directly contacted. ulation treatment (e.g., using transcranial direct current stim-
In addition to the effect sizes, a series of sample and meth- ulation [tDCS] to stimulate reading-related brain areas); action
odological details were coded. Sample details included the video game trainings; visual/visual-attentional trainings with a
mean age of participants or age range, gender distribution, neuropsychological approach; working memory training;
and mean IQ (where reported). Methodological details includ- modelling (a Bandura-inspired approach); reading accelera-
ed type of reading outcome (characters [for Chinese partici- tion program (a training aiming to improve eye movements,
pants], words, pseudowords, or text reading, lexical decision), which is particularly used in non-alphabetic languages);
treatment approach, duration of the intervention in weeks, vergence training; multisensory stimulation approaches; and
duration of each session in minutes, and total number of mixed approaches (i.e., treatments that combine elements
Behav Res
from several other approaches). Concerning statistical analy- In our case, the multilevel structure was: Study > Group com-
ses, we coded how treatment efficacy on reading measures parison within study > Effect size. In addition, but only to
was tested, whether any correction for multiple comparisons present the forest and funnel plot of the effects, and for sim-
was adopted, and whether power analysis was mentioned as plicity in assessing the publication bias (see below), we com-
the rationale for sample size. Two reviewers independently bined the effect sizes within the same group comparison using
coded these aspects and subsequently resolved discrepancies the formulas for non-independent outcomes suggested by
through discussion. Borenstein et al. (2009; pp. 227–228). To compute the vari-
ance for a combined effect, the between-effect correlation was
Analytic strategy assumed to be r = .70. A sensitivity analysis showed that any
alternative correlation between .30 and .90 had negligible ef-
The analytic strategy followed the recommendations by fects on the point estimates.
Borenstein, Hedges, Higgins, & Rothstein (2009). The R soft- Concerning moderators, we tested the age of participants
ware (R Core Team, 2020) was used to perform all analyses. (categorized as children [mean age below 18 years] or adults
All plots were drawn with the “ggplot2” package (Wickham, [mean age above 18 years]) and treatment intensity (in terms
2016) of R. Meta-analytic estimates and meta-regressions of total number of sessions and duration of treatment in
were computed using random-effects models. A random- weeks). They were tested via meta-regressions. Treatment ap-
effects modelling approach was chosen because it allows us proach, on the contrary, was not tested as a moderator, be-
to better account for the expectably large heterogeneity in the cause there were very few studies for each single approach.
effect size across studies (Borenstein et al., 2009). This ap-
proach assumes that the effect sizes are sampled from a nor-
mally distributed population of effects sizes, rather than all Bayesian estimation and definition of prior knowledge
reflecting the same true effect size. To determine the hetero-
geneity across studies, we looked at the estimated standard A Bayesian approach to data analysis was adopted. We chose
deviation among the true effects across studies (known as τ; it because it enabled us to formalize and include prior infor-
Borenstein et al., 2009). mation from the meta-analysis by Galuschka et al. (2014). For
Where the descriptive statistics were available, the effect a full account of the advantages of a Bayesian approach, see
size and its variance were calculated using the formula Kruschke and Liddell (2018) and McElreath (2016). All meta-
recommended by Morris (2008) for the “pretest–posttest-con- analytic models were fitted using the “brms” package of R
trol group” designs. This consists of the mean post-test vs pre- (Bürkner, 2017), which uses the Markov chain Monte Carlo
test gain in the treatment group minus the post-test vs pre-test (MCMC) Bayesian estimation method implemented in the
gain in the control group, divided by the pooled pre-test stan- STAN programming language (Stan Development Team,
dard deviation. Where the descriptive statistics were not avail- 2018). All models presented below were run with four
able (5% of the effects in our dataset), we used the effect sizes MCMC chains each with 5000 iterations, for a total of
as reported by the authors (provided that they represented the 10,000 post-burning effective iterations in each model. For
difference in pre-post gain in the treated group minus the any purpose of model comparison and statistical inference,
control group), but its variance was still calculated using the the widely applicable information criterion (WAIC;
Morris (2008) formula. As the pretest–posttest correlation (ρ) Watanabe, 2010) was used (smaller values of WAIC
was never reported, we assumed it to be .80 for the calculation indicate a better fitting model; McElreath, 2016). In examin-
of the effect variance. Any alternative value between .50 and ing any model coefficient, the mean value of its posterior
.90 affected negligibly the point estimates, but they obviously distribution was taken as the point estimate, while its 95%
affected the estimated precision of the effects, and thus het- Bayesian credible interval (BCI) was computed with the per-
erogeneity (which was estimated higher for higher ρ). Effect centile method.
sizes obtained from scores expressing performance negatively We defined prior knowledge from Galuschka et al. (2014).
(e.g., reading times, errors) were sign-inverted for Prior distributions were defined only for the analysis
consistency. concerning the pretest–posttest comparison, for the following
Most studies reported more than one effect, and in many parameters: the overall mean effect size, the heterogeneity
cases also more than one group comparison within study (i.e., across studies and across group comparisons within study,
comparison between a treated-vs-control pair of groups). and for the estimated mean effect size of the specific treatment
These dependencies imply that effect sizes within the same approaches that were considered both in our meta-analysis
study and within the same comparison within study provide and in Galuschka et al. (2014). A “prior” indicates the proba-
partially redundant information, which must be accounted for. bility distribution of an unknown parameter of interest (e.g.,
Therefore, we adopted a multilevel modelling approach, as an effect size), before computing any analysis on the new data
implemented in the “brms” package of R (Bürkner, 2017). at hand.
Behav Res
To define the prior distributions for the overall mean effect comparison, and nine included two group comparisons. The
size and its heterogeneity across studies, we reran a meta- latter subset included six studies presenting two treated groups
analytic model including all 22 studies reviewed by compared against the same control group, and three studies
Galuschka et al. (2014). For brevity, we have reported all presenting two treated groups each compared against a differ-
details on the prior definition in the Supplemental material, ent, matched control group. All studies presented a pretest–
Part 1. Here we report the prior distribution only for the mean posttest comparison; seven studies also reported a follow-up.
effect size. It was set as Student’s t distributions with three Concerning the pretest–posttest comparisons, a total of 190
degrees of freedom (a standard of the “brms” package of R) effect sizes were estimated. This meant an average of 4.8
with M = .30, SD = .06. The central prior value for the τ outcomes per study and an average of 3.9 outcomes per group
coefficient of heterogeneity was .08, both at the study level comparison within study.
and at the “group comparison within study” level (see details An estimated total of 1862 participants were involved
in Supplemental material, Part 1). Uninformed default priors across the 40 studies, including 1103 treated and 759 control
were used for the moderators in the meta-regressions. participants. The total number of groups was 90, including 49
treated and 43 control groups. The median and mean number
Assessment of publication bias of participants per group was 15 and 20.2 (treatment group:
median = 15, mean = 22.5; control group: median = 15, mean
We used the precision-effect test and precision-effect estimate = 17.7).
with standard errors (PET-PEESE), because it represents a The estimated grand mean of the age of participants was
less bad option among other conventional meta-analytic ap- 11.5 years (the range of estimated mean age of samples was
proaches (Stanley, 2017). However, assessing the publication 7.7–25.9 years). Thirty-six studies had participants in the de-
bias with a limited number of studies, nearly all of them pre- velopmental age (mean age between 7 and 14 years), and four
senting small samples, and with predictably high heterogene- studies had young adult participants (mean age between 22
ity, is difficult and any result must be taken with caution. and 26 years).
The PET-PEESE method consists of two conditional meta- The median treatment session duration was 35 minutes,
regressions in which the standard error (PET) or variance ranging between 15 and 450 minutes (including two studies
(PEESE) are entered as moderators of the effect size. The with self-paced sessions). The median number of sessions per
regression coefficient is interpreted for evidence of the bias, treatment was 18, ranging between 2 and 225 sessions. The
whereas the model intercept can be interpreted as the bias- median treatment duration was of 5 weeks, ranging between 1
adjusted effect size estimate. The PET method is used first. and 30 weeks.
It assumes a constant publication bias at all levels of precision,
which is correct if the true effect is null. If the estimated effect Meta-analytic estimates
size remains significant, however, the PEESE method is rec-
ommended, which assumes larger publication bias for less The following analyses refer only to the pretest–posttest com-
precise studies by using a quadratic model (Stanley, 2017). parisons, except where indicated otherwise.
To avoid the further complication of multi-level modelling, The overall meta-analytic estimate of the effect size, com-
the PET-PEESE meta-regressions were applied on the effect puted with multilevel modelling on 40 studies and a total of 49
sizes combined by group comparison within study, and these group comparisons within study, was a medium standardized
were treated as independent. The formula for non-independent difference of d = 0.38 [95% BCI: 0.31, 0.46]. The estimated
outcome suggested by Borenstein et al. (2009) was used to heterogeneity was substantial: across studies, τ = 0.12 [0.02,
combine the effect sizes (see above in the Analytic Strategy 0.24]; across group comparisons within study, τ = 0.17 [0.06,
section). Therefore, the PET-PEESE method was applied to 0.27]. This means that, while the average effect size is esti-
the data shown in the forest (Fig. 2) and funnel (Fig. 3) plots. mated as 0.38, the true effect sizes across studies are estimated
Furthermore, uninformed default priors were used for all mod- to range mostly between 0.14 and 0.62. The mean meta-
el parameters when assessing the publication bias. analytic estimate obtained after excluding the five studies for
which randomization was unclear remained virtually the
Results same, d = 0.35 [0.28, 0.42].
The descriptive forest plot is shown in Fig. 2. A funnel plot
Overview and characteristics of the studies is shown in Fig. 3.
The following treatment approaches were used in at least
A total of 40 studies met the criteria for being included in the three different studies: phonemic awareness instruction, pho-
quantitative analysis. Assignment of participants was explic- nics instruction, mixed, brain stimulation, visual-attentional/
itly randomized in 35 studies, and unclear in the remaining neuropsychological, action video games, reading acceleration
five studies. Thirty-one studies included one group program, and working memory. Meta-analytic estimates
Behav Res
Fig. 2 Descriptive forest plot of the effect sizes combined by group comparison within study. Note. Effect sizes within the same group comparison were
combined assuming a between-effect correlation of r = .70 (alternative r values do not affect point estimates but affect their CIs)
calculated separately by these treatment approaches can be between .20 and .61; however, given the large heterogeneity
found in Supplemental material, Part 2. These estimates vary of the effects, the very limited number of studies for each
approach, and the resulting large BCIs, such estimates must
be taken with caution. Each of the following remaining ap-
proaches was used in less than three studies: medical treat-
ment, modelling, music training, reading fluency training,
vergence.
There was no evidence in favour of the age class of the
sample moderating the treatment efficacy, as shown in a mul-
tilevel meta-regression, ΔWAIC = +0.6. The estimated treat-
ment efficacies confirmed that the effect was negligible: for
children, d = 0.37 [0.30, 0.46]; for adults, d = 0.40 [0.13,
0.68]. There was no evidence in favour of a role of number
of sessions, ΔWAIC = +0.8, |B| < 0.001, or a role of duration
of treatment in weeks, ΔWAIC = -0.2, B = −0.004, as moder-
ators of the treatment efficacy.
Finally, we analysed the pre-test vs follow-up comparisons
which could be estimated from seven studies. Since all but one
of these studies included only one group comparison, we en-
Fig. 3 Funnel plot of the effect sizes combined by comparison within tered only studies as the random effect. Uninformed default
study priors were used for all model parameters in this case. The
Behav Res
meta-analytic estimate was d = 0.38 [0.23, 0.61]. Curiously, corrected for all comparisons (i.e., also for testing of multiple
this is the same estimate as for the pretest–posttest comparison ANOVAs). Finally, out of the 33 studies that tested multiple
(but with larger uncertainty). The heterogeneity across studies reading outcomes, only two used multivariate ANOVA
was again large, τ = 0.14 [0.00, 0.46]. (MANOVA) to handle such multiplicity.
Fig. 4 Illustration of the data that may emerge a from a traditional design measuring the individual reading performance only once at pre-test and at post-
test and b from a design with repeated measurements and estimation with uncertainty of the individual parameters (error bars)
that reading performance would be assessed more than once at Participants were assigned randomly to the treated or control
pre-treatment and at post-treatment. group in all subsequent examples.
Response to treatment can be assumed to vary across indi-
viduals. To do so, the effect size can be sampled for each
Method
treated participant instead of being fixed. Assuming a normal
distribution of effect sizes, this can be centred on a meta-
All analyses described below were performed using R soft-
analytic estimate (e.g., around 0.38 in our case), and have a
ware and STAN, and the code is publicly available online.
plausible standard deviation (SD) that indicates how much the
response to treatment may vary across participants. For exam-
Conducting design analysis via simulation ple, sampling from N(0.38, 0.20) means that the large majority
(about 95%) of treated participants would benefit from an
For simplicity, in our examples we assumed that reading effect that varies between about .00 and nearly .80 across
scores would always be normally distributed (when measured individuals. This seemed plausible, so we used this distribu-
both across different participants and within the same partic- tion in all examples below. Nonetheless, we found that even
ipant). Design analysis for a simple 2 (Group: treated vs con- an effect fixed to .38 for all treated participants virtually led to
trol) × 2 (Time: pretest vs posttest) design, with one reading the same power levels. Simulating a treatment efficacy that
outcome measured once per time, requires assuming two most varies across participants may be interesting if one plans to
important parameters: the effect size and the pretest–posttest investigate individual differences in response to treatment,
correlation. The pretest–posttest correlation was always set by however, as we will comment in the Discussion section.
generating paired arrays of correlated normally distributed Finally, an additional term could be added to the post-
scores for the pretest and posttest observations with the de- treatment scores of all participants to simulate a practice ef-
sired Pearson’s r. This correlation could be simulated in alter- fect. Unless the practice effect is assumed to vary across
native ways, for example by determining the within- vs participants, however, it will be practically negligible for
between-participant error variances. Nonetheless, we opted both statistical power and the effect size calculation. In
to directly generate correlated scores because a pretest– addition, practice effect in everyday-like reading tasks
posttest correlation parameter is easier to formalize and to (e.g., reading a list of words) is likely negligible in chil-
compare with values from the existing literature (e.g., based dren with dyslexia. Therefore, we did not consider it in
on the reliability scores of the test batteries). As the scores are the following examples.
sampled from a standard normal distribution, the standardized Once a simulated dataset has been generated, the selected
effect size was simply added to the post-treatment scores for statistical analysis must be performed. In our case, we per-
the treated (and not for the control) group. However, one formed ANCOVA/linear models to test the effect of group
could easily simulate data on their real scale by linearly on post-treatment scores, covarying pre-treatment scores (as
transforming all scores, or even generate non-normally distrib- suggested by Van Breukelen, 2006, but see also Gelman et al.,
uted scores (e.g., correlated non-normally distributed scores 2020). However, one may use any alternative statistical
can be simulated using the “semTools” package of R, methods of choice, for example testing group × time interac-
Jorgensen, Pornprasertmanit, Schoemann, & Rosseel, 2020). tion, or using Bayesian estimating methods and inferential
Behav Res
criteria. At this point, p value corrections should be applied to on the data averaged by participant and by time. In the
the simulated results if multiple reading outcomes are collect- latter case, however, the observed effect size would
ed (in our simulated examples below, however, we assumed somehow inflate, because averaging reduces the mea-
examining only one outcome). Concerning the estimation of surement error, thus decreasing the standard deviation.
the effect size, we used the formula suggested by Morris
(2008). Using Bayes factor instead of p value
To conduct design analysis, the entire process described so
far must be repeated by several iterations (we opted for 10,000 None of the 40 studies that we reviewed in our meta-analysis
iterations) for each of a series of alternative sample sizes. At used Bayesian methods. However, such methods are becom-
each iteration, both the inferential criterion (e.g., p value from ing increasingly popular in the social sciences. A fully
the ANCOVA) and the observed effect size (i.e., the ef- Bayesian approach should encompass the definition of in-
fect size calculated on the simulated data) must be re- formed priors, the consideration of posterior distributions,
corded. The entire process will end when the sample and the discussion of the phenomenon at hand in probabilistic
size leads to the desired level of power (e.g., more than terms and in light of the prior expectations (e.g., Kruschke &
80% of iterations ending with p < .05 or with Bayes Liddell, 2018; McElreath, 2016). Covering the complexity
factor > 3) and/or to a desired level of the exaggeration and the advantages of this approach, however, goes beyond
ratio (e.g., the observed effect size associated with sta- the scope of the present article. Rather, we aimed to show how
tistical significant being not larger than 10–15% more the design analysis for treatment efficacy in dyslexia is affect-
than the true effect size on average). ed using a more simplified inferential procedure based on the
Bayes factor (BF). Specifically, we used the popular
Use of repeated measurements “BayesFactor” package in R (Morey & Rouder, 2018) to fit
Bayesian linear models and calculate the BF, with default
For the example concerning repeated measurements, we settings.
assumed that a reading outcome would be collected not Using BF as the inferential criterion does not affect the
once, but three times at pre-treatment and three times at procedure for the simulation of the design analysis as de-
post-treatment, which seems feasible in an experimental scribed above, but it opens new inferential scenarios.
setting. Due to the structure of the data, ANCOVA/ Specifically, defining H0 as the null hypothesis (e.g., group
linear models covarying pre-treatment scores can no × time interaction is null, or the effect of group on post-
longer be performed in this case, unless scores are av- treatment scores is null) and H1 as the hypothesis that the
eraged by participant and by time (which we discourage treatment efficacy is non-zero, the BF can either support H1,
for reasons explained below in the Discussion). support H0, or remain indecisive. Popular interpretive thresh-
Therefore, we assessed treatment efficacy testing the olds for the BF are the following: BF > 3 supports H1 with
group × time interaction, which is still an appropriate moderate (or stronger) evidence; BF < 1/3 supports H0 with
choice (e.g., Dimitrov & Rumrill, 2003). Since repeated moderate (or stronger) evidence; BF between 1/3 and 3 leaves
measurements were examined, we used mixed-effects one with inconclusive results or just anecdotal evidence (e.g.,
models, fitted using the “lme4” package in R (Bates, Raftery, 1995; Schönbrodt & Wagenmakers, 2018).
Maechler, Bolker, & Walker, 2015). We entered group
(treatment vs control) and time (pretest vs posttest) as Results
the fixed effects of interest, participants as random ef-
fects (with random intercepts), and reading scores (with- In a first analysis, we systematically examined how power and
out averaging) as the response variable. Random slopes the exaggeration ratio vary with the pretest–posttest correla-
were not fitted in this case, because the limited number tion (from .60 to .90) at different sample sizes (i.e., number of
of repeated measurements at each time point meant that participants per group), using the traditional pretest–posttest-
they could not be estimated accurately. When we fit controlled design. We assumed a true effect of d = 0.38, which
them in a separate simulation, however, the power was is in line with our meta-analytic results, and we used
not affected to any visible extent. Nonetheless, fitting ANCOVA covarying pre-treatment scores, with a critical α
random slopes is recommended for larger numbers of = .05, for statistical inference. The results are shown in Fig. 5.
repeated measurements (e.g., five or more). As can be seen, higher pretest–posttest correlation of scores
The calculation of the empirical effect size in this crucially enhances power.
case is not trivial. For simplicity, we applied the formu- In a second analysis, we focused on the use of repeated
la suggested by Morris (2008) despite multiple observa- measurements with an outcome collected thrice at each time
tions being collected for each participant at each time point. Again, we assumed an effect size of d = 0.38, and we
point. Alternatively, one could calculate the effect size varied the correlation between repeated measurements from
Behav Res
Fig. 5 Design analysis showing a power and b exaggeration ratio for lines represent in panel (a) the acceptable level of power = .80, and in
different numerosity of groups and pretest–posttest correlations for a panel (b) the perfect equivalence between estimated and true effect size
treatment with a standardized effect size of d = 0.38, using a traditional (exaggeration ratio = 1.0) and the acceptable level of exaggeration ratio =
experimental design (i.e., reading is measured once at pre-treatment and 1.1
once at post-treatment for each participant). Note. The horizontal dashed
.60 to .90. The latter parameter now describes the correlation In a third analysis, we repeated the design analysis with
between any pair of reading measures collected on the same both the traditional design and the suggested repeated mea-
participant (regardless of whether they were collected at pre- surements approach, but now using the BF instead of p value
test or posttest). Figure 6 shows the results. As can be seen, the as the inferential criterion, and fitting models with the
power has clearly enhanced since the previous example. “BayesFactor” package in R. Once again, we set the effect
Specifically, for r = .80, Fig. 5 suggested that nearly 40 par- size d = 0.38, and a repeated measures correlation of r = .80.
ticipants per group were needed for the power to exceed 80% As its default setting, a Cauchy distribution with scale =
(i.e., when only two groups are being compared, the total 0.50 was used as the prior for the standardized fixed ef-
sample size needed is about 80), whereas Fig. 6 suggests that fects (meaning that the prior was centred on zero, with
the same level of power could be reached with only about 15– half of its distribution beyond d < −0.50 or d > 0.50).
20 participants per group using repeated measurements and Since it was unclear how mixed-effects linear models
mixed-effects models under the proposed scenario (i.e., with (and specifically the random effects) would be estimated
two groups, the sample size needed is 30–40 participants). using the “BayesFactor” package, we averaged scores by
participant and by time in the case of repeated
Fig. 6 Design analysis showing a power and b exaggeration ratio for treatment for each participant. Note. The horizontal dashed lines represent
different numerosity of groups and different correlations between in panel (a) the acceptable level of power = .80, and in panel (b) the
repeated measurements, for a treatment with a standardized effect size perfect equivalence between estimated and true effect size (exaggeration
of d = 0.38, using a repeated-measurement experimental design with three ratio = 1.0) and the acceptable level of exaggeration ratio = 1.1
distinct measures of reading at pre-treatment and three measures at post-
Behav Res
Fig. 7 Design analysis for testing treatment efficacy using the Bayes design with a single measurement at pretest and posttest; panel B refers to
factor (BF). True effect size is set as d = 0.38; correlation between an alternative design with three measurements of reading at pre-treatment
repeated measurements is set as r = .80. Panel A refers to the classical and three measurements of reading at post-treatment for each participant
measurements, and we always used linear models on post- wrongly supporting H0 when the treatment is actually
treatment scores, covarying pre-treatment scores (using effective was around 5% for sample sizes up to 100
frequentist methods as in the first two examples, we found (i.e., up to 50 participants per group), but only in the
that this alternative affected power negligibly). classical design (panel A). In all other cases, regardless
Figure 7 shows the results of the design analysis of the sample size, the predominant interpretive risk is
using BF for the traditional design (panel A) and for that of remaining indecisive (BF between 1/3 and 3).
the repeated measures design that we proposed, with
three reading measurements at pre-treatment and at
post-treatment (panel B). Power was defined as the Discussion
probability of supporting H1 with BF > 3.
Unsurprisingly, Fig. 7 shows that the repeated measure- The first aim of this paper was to estimate the average effect
ment design was more powerful than the traditional de- size of treatments for dyslexia by performing an updated me-
sign. Using the BF, however, did not increase power as ta-analysis. The second aim was to provide recommendations
compared to using frequentist methods (Figs. 5 and 6). on how to increase power when testing treatment efficacy in
This simply suggests that BF > 3 is roughly a stricter dyslexia, highlighting the importance of conducting a priori
criterion than p < .05. An interpretive advantage of design analyses.
using the BF may be that, when H1 fails to be support- Concerning the first aim, the overall meta-analytic estimate
ed, it is possible to distinguish explicitly the case in for the pretest–posttest effect was d = 0.38, with a narrow 95%
which H1 can be rejected from the case in which the CI, which is encouraging. For the pretest–follow-up effect, the
results remain indecisive. In Fig. 7, the “risk” of point estimate was the same. This is a larger effect size than
Behav Res
the meta-analytic estimates of around 0.20–0.30 found by Specifically, some studies focused on increasing visual-
Galuschka et al. (2014), but it still qualifies as a modest effect. attentional skills that allegedly underpin reading ability,
A notable difference between our results and those of the such as skills linked with the magnocellular pathway
previous meta-analysis is the estimated precision of the ef- (e.g., Stein, 2018). Randomized controlled trials focusing
fects. Our forest plot (Fig. 2) has error bars nearly 50% shorter, on direct brain stimulation also emerged. Our results, how-
on average, than those presented by Galuschka et al. (2014), ever, suggested that these novel approaches were not more
despite even smaller average sample sizes in our case. This is effective than the traditional ones on average (see
due to our calculation of the effect sizes using the formula Supplemental material, Part 2).
proposed by Morris (2008), which incorporates the Concerning the second aim, we showed that under plausi-
(assumed) pretest–posttest correlation in the estimated preci- ble assumptions, the sample size required to obtain a power of
sion. As Galuschka et al. (2014) did not incorporate such 80% is about double the median sample size in published
information, their CIs are equivalent as assuming zero corre- randomized controlled trials. Nonetheless, it could be consid-
lation between pretest and posttest scores that may not be a erably reduced through a strategic use of repeated measure-
realistic assumption. Although this is unlikely to affect the ments. By plausible assumptions we meant a standardized
meta-analytic point estimates of the effect size to a large de- effect size of d = 0.38, which corresponds to the overall mean
gree, this clearly affects the estimated standard errors, and estimate that resulted from our meta-analysis, and a pretest–
therefore confidence intervals, significance levels, and the es- posttest correlation of around r = .80. In fact, a more conser-
timated heterogeneity of the effect size across studies. vative effect size assumption (i.e., between 0.20 and 0.30)
The 40 studies that we reviewed had a median participant should be preferred by those who plan to assess treatment
number of 15 per group, corresponding to a median sample efficacy, both for caution if any novel approach is examined,
size of 30 for any single treatment-vs-control group compari- and because our meta-analysis suggested the presence of a
son. This number is even smaller than the median across the publication bias in the previous literature. Furthermore, in all
22 studies reviewed previously by Galuschka et al. (2014). It our examples we assumed that only one reading outcome
is in line with studies generally published in cognitive psy- would be tested, thus without need to correct p value for mul-
chology and only slightly larger than the average sample size tiple tests. Note that any p value correction, by requiring
in neuroimaging studies (Szucs & Ioannidis, 2017, 2020). stricter inferential criteria, would further reduce power and
Nonetheless, we showed that with this median sample size, increase the exaggeration ratio.
most studies in this field are seriously underpowered under Given the importance of the test–retest/pretest–posttest cor-
any plausible assumptions. Furthermore, the PET-PEESE me- relation parameters, they should always be considered when
ta-regression method suggested that publication bias was like- testing treatment efficacy. Concerning the test–retest correla-
ly. However, the relatively limited number of studies, the fact tion, the actual reliability of standardized reading measures
that nearly all of them had small sample sizes, and the sub- may be higher than .80 in normative populations. However,
stantial heterogeneity meant that the latter analysis may be as explained in the Introduction, there are reasons to think that
unreliable (much like any other conventional approach of this value may not be as high as .80 in children with dyslexia.
this kind; Stanley, 2017). The bias-adjusted estimate was de- Calculating such parameter from one’s own experimental
flated by nearly 50% (PET method) or less than 10% (PEESE sample is an option, but since sample sizes are generally small,
method) vis-à-vis the non-adjusted estimate. any estimate will probably be imprecise. It is worth noting that
Despite their average statistical power being low, most the classical test–retest correlation may reflect not only the
studies that we reviewed in the present meta-analysis claimed stability in measuring the underlying construct of interest
that the proposed treatment was effective. Specifically, when (i.e., reading ability), but also task-specific features.
we examined all titles and abstracts, we found that 33 out of 40 Considering the correlation among different parallel versions
studies (83%) claimed that the treatment proposed was an of a task is better than considering the test–retest correlation
effective remedy for participants with reading impairments. for the same version. An even better approach could be mea-
This may be because, as shown by our systematic review, suring the latent reading ability factor using a variety of dif-
most studies tested several outcomes without correcting p- ferent tasks. In this case, structural equation modelling, al-
values for all multiple comparisons. This practice may result though typically requiring larger samples, may be the appro-
in many false positive findings even if statistically significant priate analytical approach. This is a venue for future research.
outcomes are found. Concerning experimental design, we showed the ad-
From a theoretical point of view, our review revealed vantage of collecting repeated measurements of individual
that a variety of new treatment approaches have been used reading scores at both pre-treatment and post-treatment
in recent years. A prominent one was the times. This allows us to exploit the advantage of a
neuropsychologically inspired approach, including visual/ single-case experimental design approach, in terms of
visual-attentional trainings and action video games. having robust estimates of the individual parameters,
Behav Res
while keeping the focus on the population-level effect. It may affect the design analysis. Using BF with default param-
is well known that increasing the number of observations eters did not help increasing power as compared to traditional
might also lead to an increment of the statistical power frequentist methods. It may have the interpretive advantage of
and of the precision, even with the same number of par- distinguishing between an uncertain outcome and a case in
ticipants (e.g., Maxwell, Delaney, & Kelley, 2018). which H0 is supported by the BF. If the study is adequately
Nonetheless, none of the 40 studies that we reviewed here powered for an effect size of interest, however, failing to reject
adopted the design that we proposed, and only one H0 should imply rejecting H1 even using a frequentist ap-
(Wolff, 2014) did something similar by combining differ- proach. Furthermore, proper design analysis with BF should
ent reading outcomes in a latent factor using structural be conducted also under a null hypothesis scenario (e.g.,
equation modelling (which typically requires large sample Schönbrodt & Wagenmakers, 2018), to check whether H0
sizes, however). would be consistently supported—should it be true—under
We suggest that repeated measurements should be collect- the chosen assumptions. In any case, if Bayesian inference is
ed using different versions of a same task. In fact, while prac- used, we recommend adopting a fully Bayesian approach,
tice effects for everyday-like reading requests such as those including an explicit formalization of the priors and the con-
posed by classical reading tasks (e.g., reading word lists) are sideration of the posterior probability of the effect size (e.g.,
likely negligible for children with dyslexia, we cannot exclude McElreath, 2016), rather than using a simplified criterion such
that it may become an issue when repeatedly presenting the as the BF calculated with the default parameters set by the
exact same stimuli. Luckily, creating different versions of software.
reading tasks is relatively easy because the material generally From a methodological point of view, we recommend
consists of simple verbal stimuli (controlled for a few impor- adopting a simulation approach to design analysis. Although
tant parameters such as word frequency, length, and ortho- there are tools for the analytical calculation of power for a
graphic complexity). variety of statistical methods, simulation allows maximum
Concerning data analysis, we suggest using mixed-effects flexibility. The latter is crucial because, as discussed, both
models, with participants as random effects, to exploit the power and the exaggeration ratio depend on several aspects,
information available with the recommended repeated mea- which may be difficult to control analytically. Via simulation,
sures design. The population-level effect (i.e., the average one can perform design analysis under several alternative ad
pretest–posttest gain) can be examined by considering the hoc scenarios, for example assuming individual heterogeneity
fixed-effects part of the model. However, the random-effects in response to treatment, varying the correlations among the
part can be of interest as well. Examining random slopes outcome variables (or among the predictors in non-
(which we did not do here for simplicity) represents an ideal experimental settings), adopting alternative criteria for infer-
approach for precisely estimating how individuals may re- ence (e.g., p value vs Bayes factor), and considering not only
spond differently to treatment. In fact, previous studies have power, but also other parameters such as the risk of overesti-
suggested that response to treatment in dyslexia likely reflects mation (exaggeration ratio; Altoè et al., 2020; Gelman &
individual differences (e.g., Aravena, Tijms, Snellings, & van Carlin, 2014) or the risk of supporting the wrong hypothesis
der Molen, 2016; Zhao et al., 2019). Accurately investigating (especially if Bayesian inference is used).
the variability in the random slopes, however, may require Finally, although we stressed the importance of a priori
many repeated measurements of a reading outcome per time design analysis, we would like to warn the readers about its
point, which is something that can be considered in a simulat- limitations. Specifically, our simulation analysis in Study 2
ed ad hoc design analysis. An easier alternative to mixed- may have suggested that under ideal conditions (i.e., enough
effects models could be fitting linear models on the scores repeated measurements of the outcome at each time point
averaged by participants and by time. However, this latter combined with high measure reliability), high power may be
option loses information on intra-individual variability. reached even with very small samples. This may not always
Using repeated measurements opens further questions and be the case, however. Data collected on small samples could
areas of investigation. It raises the issue of how reading per- still be unreliable for unforeseen reasons. First, as mentioned
formance varies intra-individually. How do reading measure- above, the time spacing between repeated measurements mat-
ments vary over time in the short- and medium-term? For ters. Repetitions too close in time may lead to an extremely
example, do circadian oscillations affect reading performance, high correlation between measurements collected within the
as they have been shown to affect other cognitive abilities same time point (without increasing the correlation between
(e.g., Hahn et al., 2012), especially in children with dyslexia? pre-treatment and post-treatment). In this case, repeated mea-
The ideal temporal spacing of repeated measurements within surements would be redundant, thus failing to enhance preci-
the same time point is a matter for future research. sion. The recommended design is most advantageous when
Concerning the inferential criteria, we briefly examined such correlation is not too high (ideally, when it is as high as
how the use of a Bayesian criterion to quantify evidence the correlation between pre-treatment and post-treatment
Behav Res
scores), thus effectively serving to reduce the measurement or time-consuming to collect repeatedly at each time point.
error in the individual estimates of the underlying reading Many measures in the psychological research present even
ability. Second, if response to treatment were largely hetero- lower reliability than reading measures (e.g., mathematics or
geneous across individuals, no result from any small sample other related fields), however, meaning that the benefit of
could be generalized to the population, regardless of the pre- increasing precision by collecting several repeated observa-
cision of the estimates within the studied sample. For these tions, as suggested in the current report, might be even larger
reasons, we would not recommend planning to use sample in areas outside the reading research. In any case, a priori
sizes below 40 (i.e., below 20 participants per group), even design analysis, including the discussion and formalization
under ideal conditions. of all expectations about the phenomenon at hand, is funda-
mental. For complex designs such as those described in this
article, we suggested running ad hoc simulations to maximize
Conclusions flexibility. Finally, whatever a priori assumptions are made,
all parameters formalized in the design analysis (e.g., pretest–
Testing small samples is often unavoidable in studies on posttest correlations) should later be checked against the em-
neurodevelopmental disorders, including randomized con- pirical data. Should any large divergence emerge, the design
trolled trials assessing treatment efficacy in learning disorders. analysis may need to be reconsidered and adjusted
Unfortunately, plausible effect sizes in this field are also small. retrospectively.
This is true not only because the real effect sizes in psychol-
ogy are generally limited (Open Science Collaboration, 2015),
but also because learning disorders are characterized by poor
response to treatment by definition (DSM-5; APA, 2013).
This combination of small samples and small true effects leads Supplementary Information The online version contains supplementary
material available at https://doi.org/10.3758/s13428-021-01549-x.
to low statistical power. The latter not only makes it difficult to
distinguish true- from false-positive results (Szucs &
Ioannidis, 2017), but also leads to an increased risk of Acknowledgement The study was supported by a grant from MIUR
overestimating the effect sizes (Gelman & Carlin, 2014), the (Dipartimenti di Eccellenza DM 11/05/2017 n. 262) to the Department
so-called winner’s curse effect (Young, Ioannidis, & Al- of General Psychology, University of Padua.
We would like to thank Dr Altoè and the PsicoStat group from the
Ubaydli, 2008; see also Button et al., 2013). This means that University of Padua for useful advice and support.
low-powered studies risk either reporting effects that are sta-
tistically significant but overestimated or reporting accurate Open Practices Statement The data and R code for all main analyses
effect size estimates that fail to reach statistical significance. presented in Study 1 and Study 2, along with additional examples, are
available at: https://osf.io/npebz/?view_only=1b0445400eab40c
To enhance power, we suggest that researchers assessing b9fc5a9e08bfd5767 Neither of the studies was preregistered.
treatment efficacy in dyslexia could exploit the advantages of
estimating individual parameters with precision, like in the Funding Open access funding provided by Università degli Studi di
single-case experimental designs (e.g., Krasny-Pacini & Padova within the CRUI-CARE Agreement.
Evans, 2018), but keeping the focus on the population-level
effects. Using stable and reliable reading measures that ensure
high pretest–posttest correlation is important. However, fur- References
ther benefit may come from collecting several measurements
of reading performance both pre-treatment and at post-
treatment for all participants. We showed that, under reason- *References marked with an asterisk were included in the
able assumptions, even three distinct measurements at pretest quantitative meta-analysis.
and three measurements at posttest, analysed using mixed-
Altoè, G., Bertoldo, G., Zandonella Callegher, C., Toffalini, E., Calcagnì,
effects models, may crucially increase power. A., Finos, L., & Pastore, M. (2020) Enhancing statistical inference in
In conclusion, increasing power when testing treatment psychological research via prospective and retrospective design
efficacy is challenging, but small effects may still be detected analysis. Frontiers in Psychology, 10:2893. https://doi.org/10.
3389/fpsyg.2019.02893
reliably, even with modest samples. We chose to focus on the
American Psychiatric Association [APA] (2013). Diagnostic and statis-
treatment of dyslexia because of the increasingly large body of tical manual of mental disorders (5th Ed.). Arlington, VA: American
literature in this field and because reading ability is relatively Psychiatric Publishing.
easy to assess. However, all considerations presented here Aravena, S., Tijms, J., Snellings, P., & van der Molen, M. W. (2016).
could be extended to other types of learning disorders (e.g., Predicting responsiveness to intervention in dyslexia using dynamic
assessment. Learning and Individual Differences 49, 209–215.
dyscalculia), or even to other neurodevelopmental disorders. https://doi.org/10.1016/j.lindif.2016.06.024
In fact, outcomes different from reading may be more difficult
Behav Res
*Bar-kochva, I. (2016). An examination of an intervention program de- disabilities. Folia Phoniatrica et Logopaedica, 70, 59-73. https://
signed to enhance reading and spelling through the training of mor- doi.org/10.1159/000489091
phological decomposition in word recognition. Scientific Studies of *Flaugnacco, E., Lopez, L., Terribili, C., Montico, M., Zoia, S., & Schon,
Reading, 20(2), 163-172. https://doi.org/10.1080/10888438.2015. D. (2015). Music training increases phonological awareness and
1108321 reading skills in developmental dyslexia: A randomized control trial.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear PlosONE, 10(9):e0138715. https://doi.org/10.1371/journal.pone.
mixed-effects models using lme4. Journal of Statistical Software, 0138715
67(1), 1-48. https://doi.org/10.18637/jss.v067.i01. *Franceschini, S., Gori, S., Ruffino, M., Viola, S., Molteni, M., &
*Bedoin, N. (2017). Rebalancing the global and local visuo-attentional Facoetti, A. (2013). Action video games make dyslexic children
analyses to improve reading. A.N.A.E., 148, 276-294. read better. Current Biology, 23(6), 462-466. https://doi.org/10.
*Bonacina, S., Cancer, A., Lanzi, P. L., Lorusso, M. L., & Antonietti, A. 1016/j.cub.2013.01.044
(2015). Improving reading skills in students with dyslexia: The ef- *Franceschini, S., Bertoni, S., Gianesini, T., Gori, S., & Facoetti, A.
ficacy of a sublexical training with rhythmic background. Frontiers (2017). A different vision of dyslexia: Local precedence on global
in Psychology, 6:1510. https://doi.org/10.3389/fpsyg.2015.01510 perception. Scientific Reports, 7:17462. https://doi.org/10.1038/
Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2009). s41598-017-17626-1
Introduction to meta-analysis. West Sussex, UK: John Wiley & *Franceschini, S., Trevisan, P., Ronconi, L., Bertoni, S., Colmar, S.,
Sons, Ltd. Double, K., Facoetti, A., & Gori, S. (2017). Action video games
Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel improve reading abilities and visual-to-auditory attentional shifting
models using Stan. Journal of Statistical Software, 80(1), 1-28. in English-speaking children with dyslexia. Scientific reports, 7:
https://doi.org/10.18637/jss.v080.i01 5863. https://doi.org/10.1038/s41598-017-05826-8
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., *Frijters, J. C., Lovett, M. W., Sevcik, R. A., & Morris, R. D. (2013).
Robinson, E., & Munafò, M. R. (2013). Power failure: why small Four methods of identifying change in the context of a multiple
sample size undermines the reliability of neuroscience. Nature component reading intervention for struggling middle school
Reviews Neuroscience, 14, 365-376. https://doi.org/10.1038/ readers. Reading and Writing, 26, 539-563. https://doi.org/10.
nrn3475 1007/s11145-012-9418-z
*Christodoulou, J. A., Cyr, A., Murtagh, J., Chang, P., Lin, J., Guarino, Galuschka, K., Ise E., Krick, K., Schulte-Korne, G. (2014). Effectiveness
A. J., Hook, P., & Gabrieli, J. D. E. (2017). Impact of intensive of treatment approaches for children and adolescents with reading
summer reading intervention for children with reading disabilities disabilities: a meta-analysis of randomized controlled trials.
and difficulties in early elementary school. Journal of Learning PlosONE, 9(8):e105843. https://doi.org/10.1371/journal.pone.
Disabilities, 50(2), 115-127. https://doi.org/10.1177/ 0105843
2F0022219415617163
Gelman, A., & Carlin, J. (2014). Beyond power calculations: assessing
Cirino, P. T., Rashid, F. L., Sevcik, R. A., Lovett, M. W., Frijters, J. C.,
type S (sign) and type M (magnitude) errors. Perspectives on
Wolf, M., & Morris, R. D. (2002). Psychometric stability of nation-
Psychological Science, 9(6), 641-651. https://doi.org/10.1177/
ally normed and experimental decoding and related measures in
2F1745691614551642
children with reading disability. Journal of Learning Disabilities,
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories.
35(6), 526-539. https://doi.org/10.1177/2F00222194020350060401
Cambridge: Cambridge University Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
(2nd ed.). Lawrence Erlbaum Associates. Giofrè, D., Cumming, G., Fresc, L., Boedker, I., & Tressoldi, P. (2017).
*Costanzo, F., Rossi, S., Varuzza, C., Varvara, P., Vicari, S., & The influence of journal submission guidelines on authors’ reporting
Menghini, D. (2019). Long-lasting improvement following tDCS of statistics and use of open research practices. PLOS ONE, 12(4),
treatment combined with a training for reading in children and ado- e0175583. https://doi.org/10.1371/journal.pone.0175583
lescents with dyslexia. Neuropsychologia, 130, 38-43. https://doi. *González, F. G, Žarić G, Tijms, J., Bonte, M., Blomert, L., & van der
org/10.1016/j.neuropsychologia.2018.03.016 Molen, M. W. (2015). A randomized controlled trial on the benefi-
*Costanzo, F., Varuzza, C., Rossi, S., Sdoia, S., Varvara, P., Olivieri, M., cial effects of training letter-speech sound integration on reading
Koch, G., Vicari, S., & Meneghini, D. (2016). Evidence for reading fluency in children with dyslexia. PlosONE, 10(12):e0143914.
improvement following tDCS treatment in children and adolescents https://doi.org/10.1371/journal.pone.0143914
with dyslexia. Restorative Neurology and Neuroscience, 34(2), 215- *Gorgen, R., Huemer, S., Schulte-Korne, G., & Moll, K. (2020).
226. https://doi.org/10.3233/RNN-150561 Evaluation of a digital game-based reading training for German
*Dai, L., Zhang, C., & Liu, X. (2016). A special Chinese reading accel- children with reading disorder. Computers & Education, 150:
eration training paradigm: To enhance the reading fluency and com- 103834. https://doi.org/10.1016/j.compedu.2020.103834
prehension of Chinese children with reading disabilities. Frontiers *Gori, S., Seitz, A. R., Ronconi, L., Franceschini, S., & Facoetti, A.
in Psychology, 7:1937. https://doi.org/10.3389/fpsyg.2016.01937 (2016). Multiple causal links between magnocellular-dorsal path-
*Decker, M. M., & Buggey, T. (2014). Using video self- and peer model- way deficit and developmental dyslexia. Cerebral Cortex, 26(11),
ing to facilitate reading fluency in children with learning disabilities. 4356-4369. https://doi.org/10.1093/cercor/bhv206
Journal of Learning Disabilities, 47(2), 167-177. https://doi.org/10. Hahn, C., Cowell, J. M., Wiprzycka, U. J., Goldstein, D., Ralph, M.
1177/2F0022219412450618 Hasher, L., & Zelazo, P. D. (2012). Circadian rhythms in executive
Dimitrov, D. M., & Rumrill, P. D. (2003). Pretest-posttest designs and function during the transition to adolescence: the effect of synchrony
measurement of change. Work, 20(2), 159-165. between chronotype and time of day. Developmental Science, 15,
*Ebrahimi, L., Pouretemad, H., Khatibi, A., & Stein, J. (2019). 408-416. https://doi.org/10.1111/j.1467-7687.2012.01137.x
Magnocellular based visual motion training improves reading in *Heth, I., & Lavidor, M. (2015). Improved reading measures in adults
Persian. Scientific Reports, 9:1142 https://doi.org/10.1038/s41598- with dyslexia following transcranial direct current stimulation treat-
018-37753-7 ment. Neuropsychologia, 70, 107-113. https://doi.org/10.1016/j.
*Ferraz, E., dos Santos Goncalves, T., Freire, T., de Lima Ferreira Mattar, neuropsychologia.2015.02.022
T., Lamonica, D. A. C., Maximino, L. P., & Crenitte, P. A. P. *Horowitz-Kraus, T., Cicchino, N., Amiel, M., Holland, S. K., &
(2018). Effects of a phonological reading and writing remediation Breznitz, Z. (2014). Reading improvement in English- and
program in students with dyslexia: Intervention for specific learning Hebrew-speaking children with reading difficulties after reading
Behav Res
acceleration training. Annals of Dyslexia, 64, 183-201. https://doi. *Nukari, J. M., Phil, L., Poutiainen, E. T., Arkkila, E. P., Haapanen, M.
org/10.1007/s11881-014-0093-4 L., Lipsanen, J. O., & Laasonen, M. R. (2020). Both individual and
*Horowitz-Kraus, T., Vannest, J. J., Kadis, D., Cicchino, N., Wang, Y. group-based neuropsychological interventions of dyslexia improve
Y., & Holland, S. K. (2014). Reading acceleration training changes processing speed in young adults: A randomized controlled study.
brain circuitry in children with reading difficulties. Brain and Journal of Learning Disabilities, 53(3), 213–227. https://doi.org/10.
Behavior, 4(6), 886-902. https://doi.org/10.1002/brb3.281 1177/0022219419895261.
Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Rosseel, Y. Open Science Collaboration (2015). Estimating the reproducibility of
(2020). semTools: Useful tools for structural equation modeling. R psychological science. Science, 349:aac4716. 10.1126/
package version 0.5-3. https://CRAN.R-project.org/package= science.aac4716
semTools R Core Team (2020). R: A language and environment for statistical
*Kashani-Vahid, L., Taskooh, S. K., Moradi, H. (2019). Effectiveness of computing. R Foundation for Statistical Computing: Vienna,
'Maghzineh' Cognitive Video Game on Reading Performance of Austria. https://www.R-project.org/
Students with Learning Disabilities in Reading. International Raftery, A. E. (1995). Bayesian model selection in social research.
Serious Games Symposium (ISGS). Sociological Methodology, 25, 111–163. https://doi.org/10.2307/
*Koen, B. J., Hawkins, J., Zhu, X., Jansen, B., Fan, W., & Johnson, S. 271063
(2018). The location and effects of visual hemisphere-specific stim- Ramsay, M. W., Davidson, C., Ljungblad, M., Tjamberg, M., Brautaset,
ulation on reading fluency in children with the characteristics of R., & Nilsson, M. (2014). Can vergence training improve reading in
dyslexia. Journal of Learning Disabilities, 51(4), 399-415. https:// dyslexics? Strabismus, 22(4), 147-151. https://doi.org/10.3109/
doi.org/10.1177/2F0022219417711223 09273972.2014.971823
Krasny-Pacini, A., & Evans, J. (2018). Single-case experimental designs Schönbrodt, F.D., & Wagenmakers, E. (2018). Bayes factor design anal-
to assess intervention effectiveness in rehabilitation: A practical ysis: Planning for compelling evidence. Psychonomics Bulletin and
guide. Annals of Physical and Rehabilitation Medicine, 61(3), Review, 25, 128–142. https://doi.org/10.3758/s13423-017-1230-y
164-179. https://doi.org/10.1016/j.rehab.2017.12.002 Shaywitz, S., Shaywitz, B., Wietecha, L., Wigal, S., McBurnett, K.,
Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian new statistics: Williams, D., Kronenberger, W. G., & Hooper, S. R. (2017).
Hypothesis testing, estimation, meta-analysis, and power analysis Effect of atomoxetine treatment on reading and phonological skills
from a Bayesian perspective. Psychonomic Bulleting & Review, in children with dyslexia or attention-deficit/hyperactivity disorder
25, 178–206. https://doi.org/10.3758/s13423-016-1221-4 and comorbid dyslexia in a randomized, placebo-controlled trial.
*Layes, S., Chouchani, M. S., Mecheri, S., Lalonde, R., & Rebai, M. Journal of Child and Adolescent Psychopharmacology, 27(1), 19-
(2019). Efficacy of a visuomotor-based intervention for children 28. https://doi.org/10.1089/cap.2015.0189
with reading and spelling disabilities: a pilot study. British Journal Stan Development Team (2018). Stan Modeling Language Users Guide
of Special Education, 46(3), 317-339. https://doi.org/10.1111/1467- and Reference Manual, Version 2.18.0. https://mc-stan.org
8578.12278 Stanley, T. D. (2017). Limitations of PET-PEESE and other meta-
*Layes, S., Lalonde, R., & Rebai, M. (2019). Effects of an adaptive analysis methods. Social Psychological and Personality Science,
phonological training program on reading and phonological pro- 8(5), 581-591. https://doi.org/10.1177/2F1948550617693062
cessing skills in Arabic-speaking children with dyslexia. Reading Stein, J. (2018). What is developmental dyslexia?. Brain Sciences, 8(2),
& Writing Quarterly, 35(2), 103-117. https://doi.org/10.1080/ 26. https://doi.org/10.3390/brainsci8020026.
10573569.2018.1515049
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of pub-
*Lofti, S., Rostami, R., Shokoohi-Yekta, M., Ward, R. T., Motamed- lished effect sizes and power in the recent cognitive neuroscience
Yeganeh, N., Mathew, A. S., & Lee, H. J. (2020). Effects of com- and psychology literature. Plos Biology, 15(3), e2000797. https://
puterized cognitive training for children with dyslexia: An ERP doi.org/10.1371/journal.pbio.2000797.
study. Journal of Neurolinguistics, 55:100904. https://doi.org/10.
Szucs, D., & Ioannidis, J. P. A. (2020). Sample size evolution in neuro-
1016/j.jneuroling.2020.100904
imaging research: An evaluation of highly-cited studies (1990–
*Luniewska, M., Chyl, K., Debska, A., Kacprzak, A., & Plewko, J.
2012) and of latest practices (2017–2018) in high-impact journals.
(2018). Neither action nor phonological video games make dyslexic
NeuroImage, 221(1), 117164. https://doi.org/10.1016/j.neuroimage.
children read better. Scientific reports, 8:549. https://doi.org/10.
2020.117164.
1038/s41598-017-18878-7
*Toste, J. R., Capin, P., Williams, K. J., Cho, E., & Vaughn, S. (2019).
*Luo, Y., Wang, J., Wu, H., Zhu, D., & Zhang, Y. (2013). Working-
Replication of an experimental study investigating the efficacy of a
memory training improves developmental dyslexia in Chinese chil-
multisyllabic word reading intervention with and without motiva-
dren. Neural Regeneration Research, 8(5), 452-460. https://doi.org/
tional beliefs training for struggling readers. Journal of Learning
10.3969/j.issn.1673-5374.2013.05.009
Disabilities, 52(1), 45–58. https://doi.org/10.1177/
Maxwell, S. C., Delaney, H. D., & Kelley, K. (2018). Designing exper- 0022219418775114.
iments and analyzing data: A model comparison perspective (3rd
Van Breukelen, G. (2006). ANCOVA versus change from baseline had
ed.). Routledge.
more power in randomized studies and more bias in nonrandomized
*Meng, X., Lin, O., Wang, F., Jiang, Y., & Song, Y. (2014). Reading
studies. Journal of Clinical Epidemiology, 59(9), 920-925. https://
performance is enhanced by visual texture discrimination training in
doi.org/10.1016/j.jclinepi.2006.02.007
Chinese-speaking children with developmental dyslexia. PlosONE,
*Wang, L. C. (2017). Effects of phonological training on the reading and
9(9):e108274. https://doi.org/10.1371/journal.pone.0108274
reading-related abilities of Hong Kong children with dyslexia.
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with
Frontiers in Psychology, 8:1904. https://doi.org/10.3389/fpsyg.
Examples in R and Stan. CRC Press: Boca Raton, FL.
2017.01904
Morey, R. D., & Rouder, J. N. (2018). BayesFactor: computation of
*Wang, L. C., Liu, D., & Xu, Z. (2019). Distinct effects of visual and
Bayes factors for common designs. R package version 0.9.12-4.2.
auditory temporal processing training on reading and reading-
https://CRAN.R-project.org/package=BayesFactor
related abilities in Chinese children with dyslexia. Annals of
Morris, S. B. (2008). Estimating effect sizes from pretest-posttest-control
Dyslexia, 69, 166-185. https://doi.org/10.1007/s11881-019-00176-
group designs. Organizational Research Methods, 11(2), 364-386.
8
https://doi.org/10.1177/1094428106291059
Behav Res
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation *Yang, J., Peng, J., Zhang, D., Zheng, L., & Mo, L. (2017). Specific
and widely applicable information criterion in singular learning the- effects of working memory training on the reading skills of
ory. Journal of Machine Learning Research, 11, 3571-3594. Chinese children with developmental dyslexia. PlosONE, 12(11):
*Werth, R. (2019). What causes dyslexia? Identifying the causes and e0186114. https://doi.org/10.1371/journal.pone.0186114
effective compensatory therapy. Restorative Neurology and Young, N. S., Ioannidis, J. P. A., & Al-Ubaydli, O. (2008). Why current
Neuroscience, 37(6), 591-608. https://doi.org/10.3233/RNN- publication practices may distort science. Plos Medicine, 5(10):
190939 e201. https://doi.org/10.1371/journal.pmed.0050201
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. *Zhao, J., Liu, H., Li, J., Sun, H., Liu, Z., & Gao, J. (2019). Improving
Springer-Verlag New York. sentence reading performance in Chinese children with develop-
*Wolff, U. (2014). RAN as a predictor of reading skills, and vice versa: mental dyslexia by training based on visual attention span.
Results from a randomized reading intervention. Annals of Dyslexia, Scientific Reports, 9:18964. https://doi.org/10.1038/s41598-019-
64, 151-165. https://doi.org/10.1007/s11881-014-0091-6 55624-7
*Wolff, U. (2016). Effects of a randomized reading intervention study
aimed at 9-year-olds: a 5-year follow-up. Dyslexia, 22(2), 85-100. Publisher’s note Springer Nature remains neutral with regard to jurisdic-
https://doi.org/10.1002/dys.1529 tional claims in published maps and institutional affiliations.