Late-Emerging and Resolving Dyslexia: A Follow-Up Study from Age 3 to 14

This study focuses on the stability of dyslexia status from Grade 2 to Grade 8 in four groups: (a) no dyslexia in either grade (no-dyslexia, n = 127); (b) no dyslexia in Grade 2 but dyslexia in Grade 8 (late-emerging, n = 18); (c) dyslexia in Grade 2 but not in Grade 8 (resolving, n = 15); and (d) dyslexia in both grades (persistent-dyslexia, n = 22). We examined group differences from age 3.5 to age 14 in (a) reading, vocabulary, phonology, letter knowledge, rapid naming, IQ, verbal memory; (b) familial and environmental risk and supportive factors; and (c) parental skills in reading, phonology, rapid naming, verbal memory, and vocabulary. Our findings showed group differences both in reading and cognitive skills of children as well as their parents. Parental education, book-reading frequency, and children’s IQ, however, did not differentiate the groups. The children in the persistent-dyslexia group exhibited widespread language and cognitive deficits across development. Those in the resolving group had problems in language and cognitive skills only prior to school entry. In the late-emerging group, children showed clearly compromised rapid naming. Additionally, their parents had the most severe difficulties in rapid naming, a finding that suggests strong genetic liability. The findings show instability in the diagnosis of dyslexia. The members of the late-emerging group did not have a distinct early cognitive profile, so late-emerging dyslexia appears difficult to predict. Indeed, these children are at risk of not being identified and not receiving required support. This study suggests the need for continued monitoring of children’s progress in literacy after the early school years.

A considerable amount of research over the past decades has focused on reading disability (RD). This work has increased the understanding of the etiology and assessment of RD, as well as of the risk and protective factors for RD (for a recent review, see Snowling and Hulme 2013). Most of the studies, however, have focused on the early phases of reading development or used relatively short follow-ups. The few studies that have followed reading development for several years have suggested that, although there is high stability in RD across grades at the group level (e.g., Compton et al. 2008;Landerl and Wimmer 2008), there are-at least in English-also a considerable number of cases that move across the clinical threshold over time: children who are no longer affected (resolving RD; e.g., Catts et al. 2012) and children who do not develop RD until Grade 4 (late-emerging RD; e.g., Catts et al. 2012;Etmanskie et al. 2015;Leach et al. 2003;Lipka et al. 2006). Little research has been conducted on the characteristics of resolving and late-emerging RD, although the proportion of late-emerging cases has been reported to be approximately 40 % of all RD cases in English (Catts et al. 2012;Leach Heikki Lyytinen is a professor in University of Jyväskylä Electronic supplementary material The online version of this article (doi:10.1007/s10802-015-0003-1) contains supplementary material, which is available to authorized users. Lipka et al. 2006). The scarcity of information on these resolving and late-emerging RD groups presents challenges to early identification and prevention efforts of RD.
Four previous studies have directly addressed the stability of RD classification (Catts et al. 2012;Etmanskie et al. 2015;Leach et al. 2003;Lipka et al. 2006). The studies have varied in their identification of RD. Lipka et al. (2006) identified three types of RD on the basis of reading accuracy and letter knowledge in a sample of 44 children followed from Kindergarten to Grade 4. They found that 36 % of all RD cases were late-emerging. Catts et al. (2012) followed 493 children from Kindergarten to Grade 10. They reported that approximately 25 % of all RD cases had late-emerging problems in decoding. Catts et al. (2012) also found a small group of children with resolving RD (5 % of all RD cases). Both Lipka et al. (2006) and Catts et al. (2012) used two measures for word-reading accuracy and none for reading fluency or spelling in the identification of RD. Etmanskie et al. (2015) used a combined measure of reading accuracy and reading comprehension. Leach et al. (2003), however, used two measures for reading accuracy, two for reading fluency, and two for spelling. They tested 161 fourth-and fifth-grade students and then examined retrospectively if the children had previously been identified by teachers as having reading difficulties. Of the 66 children with RD, 21 (32 %) were identified as late-emerging. Leach et al. (2003) also identified a small group of children with resolving RD (12 % of all RD cases).
All of the above studies on dyslexia stability were conducted in English. In the present study, we examined the stability of RD in Finnish, which is among the most transparent orthographies. Transparency of orthography has been found to have important effects on the development of reading skills and thus the findings in English-speaking samples may not be applicable to other, more transparent orthographies. In transparent orthographies, such as those of German or Finnish, letter-sound connections are much more consistent (Seymour et al. 2003) and children learn to decode much more quickly in these languages than children do in English (Aro and Wimmer 2003). In Finland, for example, there are only 23 fully consistent letter-sound connections to learn for accurate decoding of all Finnish words and most children are accurate decoders after just a few weeks in school. Research on the different developmental trajectories of the aforementioned RD groups in the context of transparent orthographies is, however, completely lacking although there is correlative evidence of strong stability in reading development also in transparent languages (e.g., de Jong and van der Leij 2002; Landerl and Wimmer 2008;Parrila et al. 2005). On the other hand, the correlation coefficients are still far from unity, suggesting that different developmental trajectories may also exist in transparent orthographies.
In transparent orthographies the main characteristic of readers with RD is slow reading (e.g., de Jong and van der Leij 2003;Landerl and Wimmer 2008;Landerl et al. 1997;Zoccolotti et al. 2005) while reading accuracy has been shown to be an easy skill to acquire (Aro and Wimmer 2003;Seymour et al. 2003). Therefore, unlike previous studies on RD stability in English, we focus on reading fluency. We use multiple measures for reading speed, which are ageappropriate but otherwise very similar across ages, to ensure dyslexia criteria as consistently as possible across ages. We categorize children as (a) no dyslexia, (b) late-emerging dyslexia, (c) resolving dyslexia, and (d) persistent dyslexia based on their reading speeds in Grade 2 and in Grade 8. In addition, we validate the differences between groups by comparing their reading speed performance in tasks that were not used in the RD grouping, as well as by comparing the reading speed of the follow-up sample to the level of their classmates. Finally, we will test how many of the children are expected to change dyslexia group from Grade 2 to Grade 8 randomly due to limits of measurement reliability in order to see if the changes in dyslexia groups are truly developmental. Although we do acknowledge that reading skill is continuous, the categorical approach has clinical relevance and allows comparisons to the previous research conducted in English.
In terms of identification of risk and protective factors, as well as early identification of children at risk for dyslexia, an examination of the differences among the developmental RD groups is important. Differences in cognitive profiles is one potential explanation for the differences in the emergence of RD. Children with poor early reading skills that resolve later on may have different cognitive vulnerabilities and strengths than children with late emerging or persistent dyslexia. In the early phases of reading development letter knowledge and phonological skills seem to be particularly important, as basic decoding requires solid knowledge of letters, sounds, and their connections (e.g. Georgiou et al. 2008;Puolakanaho et al. 2007). Later on, decoding becomes automatized in typically developing children. However, reading continues to be slow and laborious in poor readers, partly because many words are still decoded letter-by-letter (Eklund et al. 2015;Marinus and de Jong 2010). Later reading fluency is predicted by rapid naming (Lervåg and Hulme 2009;Torppa et al. 2012;van Bergen et al. 2014a, b). Relating this to the identified groups, the persistent group is expected to show problems in all reading related cognitive skills. Yet our focus is particularly on the cognitive profiles of the two unstable groups. On the one hand, the resolving group may lag behind in acquiring phonological skills and grapheme-phoneme knowledge. On the other hand, the late-emerging group can be expected to have problems particularly in rapid naming, which would not be apparent until the demands on fluency increase.
Two previous studies on the stability of dyslexia diagnosis have also compared the groups in terms of their cognitive skills (Catts et al. 2012;Lipka et al. 2006). Catts et al. (2012) concluded that all of the RD groups showed Kindergarten-age cognitive deficits in comparison to typical readers, but the groups did not differ from each other. The late-emerging group with problems in word reading alone had difficulties particularly in phonological awareness and sentence repetition. The resolving group showed problems of phonological awareness and letter identification, but since the size of this group was small (n=11) and consisted of mixed cases with difficulties either in reading comprehension, in word reading, or in both, the findings call for replication. Similarly to Catts et al. (2012), Lipka et al. (2006) reported that their late-emerging group had difficulties in phonological awareness. It was suggested that the children's phonological skills were sufficient for the early grades, but that they started to fall behind when cognitive demands in reading increased. In the present study we include, in addition to phonological skills, several other skills that have been shown to be closely linked to reading development: letter knowledge, rapid naming, verbal short-term memory, and vocabulary (e.g., Puolakanaho et al. 2007;Snowling et al. 2003;van Bergen et al. 2014a, b). Unlike previous studies, we report performance on these skills both prior to and following school entry.
In addition to cognitive skills, other risk or supportive factors may explain developmental differences among the groups. One such factor is family risk for dyslexia, which was not examined in the previous studies on the stability of RD. The risk for dyslexia has been reported to range from fourfold (Puolakanaho et al. 2007) to tenfold (van Bergen et al. 2012) for children with family risk compared to children without such risk. Family risk has predicted children's reading development over and above children's skills in the key cognitive precursors, such as phonological awareness, rapid naming, and letter knowledge (Puolakanaho et al. 2007;Torppa et al. 2011). Furthermore, studies predicting children's skills with parent's skills have suggested that specific parental skills may be informative in assessing children's liability for dyslexia beyond their own cognitive development (Torppa et al. 2011;van Bergen et al. 2014a).
A third factor that could explain the differential developmental trajectories is the amount of environmental support. Leach et al. (2003) compared the groups in terms of print exposure, but their study did not find differences between the early-emerging and late-emerging groups. However, they used an author-recognition test, which is an indirect measure of children's print exposure. It remains possible, therefore, that a more direct evaluation of the amount of reading activities would find differences between the RD groups. In the present study we examine the amount of book reading children do with their parents and the amount they do alone. We also examine group differences in parental education. Finally, we examine gender distributions among the groups. Both gender and print exposure comparisons were motivated by consistent findings of a gender gap in literacy, which is often attributed to fewer reading activities among boys (see OECD 2010a, b). This paper examines the following research questions: What is the instability of dyslexia between Grade 2 and 8? Do children change dyslexia status more often than unreliability of diagnostic tests predicts? Do the four groups differ in (a) the development of reading speed, (b) the development of language and cognitive skills, (c) the amount of book reading, (d) gender, or (e) parental education and reading(−related) skills?

Method Participants
All children (n=182) 1 were participants of the Jyväskylä Longitudinal Study of Dyslexia (JLD) (see Lyytinen et al. 2008), originally selected for one of two samples: those with family risk for dyslexia or those without it. Children at risk (n=101) had a parent and one or more other close family members with dyslexia. The parents' dyslexia status was confirmed through an extensive test battery (see Leinonen et al. 2001). All children spoke Finnish as their native language and had no mental, physical, or sensory impairments. None of the children had a standard score below 80 in both performance and verbal IQ assessed in Grade 2 (WISC-III-R; Wechsler 1991). There were 86 girls and 96 boys in the sample. In addition to the follow-up sample, their classmates' reading skills were assessed in Grade 2 (n=1356), Grade 3 (n=2575), and Grade 7 (n=1451). The classmates' data provided a reference point for typical development. All children attended mainstream public schools following the national curriculum. JLD has received ethical consent from University of Jyväskylä ethical board.

Measures
Children's Cognitive and Literacy Skills The children's cognitive and literacy skills were assessed individually by trained testers prior to school entry (from age 3.5 to 6.5) and in Grades 2, 3, and 8. Children in Finland enter Grade 1 in the fall of the year they turn 7 years.
Vocabulary At age 3.5 and 5.5, the Boston Naming Test (Kaplan et al. 1983) was used. The Finnish translation of the BNT (Laine et al. 1993;Laine et al. 1997) contains 60 pictured items which the child is asked to name. Testing is continued until six consecutive errors are incurred. The score is based on the total number of items that are spontaneously correct plus the number of items correctly identified following a semantic stimulus cue (e.g. violinan instrument, tennis racketyou play a game with it). In Grade 2, the WISC-III (Wechsler 1991) was used.
Memory Verbal short-term memory was assessed at age 6.5 and in Grades 3 and 8 with a forward digit span test. The measure was the number of correctly repeated number sequences of 12 items.
Phonological Awareness At ages 4.5, 5.5, and 6.5, phonological awareness was measured with a composite mean of z scores from four tasks: first phoneme identification, first phoneme production, segment identification, and synthesis ). Cronbach's alphas were 0.71 at age 4.5, 0.58 at age 5.5, and.85 at age 6.5. In Grades 3 and 8, the common unit task was used: the task was to repeat aloud a sound that was common to two different pseudowords presented via earphones (Torppa et al. 2012). The score was the number of correct responses (phoneme or letter name) out of 15 items. Cronbach's alpha was 0.81 in Grade 3 and 0.85 in Grade 8.
Rapid Naming Children were asked to name as rapidly as possible, a matrix of 30 (age 5.5. years) or 50 objects ((age 6.5, and in Grades 2, 3, and 8; Denckla and Rudel 1974)) made up from five different pictures of objects: a car, a house, a fish, a pencil and a ball. All the Finnish names for these objects are two-syllabic high-frequency words. Total naming time (in seconds) was used as the score.
Letter Knowledge All 29 lowercase letters (23 typically used and 6 for the rare loan words) in the Finnish alphabet were presented at ages 4.5, 5.5, and 6.5. The measure was the number of correctly named letters. Cronbach's alphas were 0.83 at age 4.5, 0.88 at age 5.5, and 0.93 at age 6.5.

IQ (Grade 2)
Four performance-quotient subtests (Picture Completion, Block Design, Object Assembly, and Coding) and five verbal-performance subtests (Similarities, Vocabulary, Comprehension, Series of numbers, and Arithmetic) of the WISC-III-R were used. The estimate of the IQ was calculated according to the manual. The Cronbach's alpha for the composite of the subtests was 0.70.
Children's Reading Fluency Reading fluency was assessed individually in Grades 2, 3, and 8 and in groups at school in Grades 2, 3, and 7. The Cronbach's alphas for the composites of reading fluency were 0.92 in Grade 2, 0.86 in Grade 3, and 0.90 in Grade 8.
Word-List Reading (Grades 2, 3, and 8) In the Lukilasse nationally standardized reading test (Häyrinen et al. 1999), participants had 2 min (Grades 2 and 3) or 1 min (Grade 8) to read aloud as many words as possible from a 90-item (Grade 2) or 105-item (Grades 3 and 8) list. The measure of the word-list reading speed was the number of correctly read words within the time limit. The inter-rater reliability was 0.99.
Text Reading (Grades 2, 3, and 8) Age-appropriate ordinary texts were selected with lengths of 124, 189, and 204 words (for Grades 2, 3, and 8, respectively). Total reading time was the measurement of text reading speed.
Pseudoword Text Reading (Grades 2, 3, and 8) Children read aloud a short text made up of 19 (Grade 2) or 38 pseudowords (Grades 3 and 8). The words and structure of the sentences resembled real Finnish in form but had no meaning. Total reading time was the measure of pseudoword reading speed.
Wordchains (Grades 2, 3, and 7) In Grades 2 and 3 the test included 79 wordchains each containing 2-4 words, and in Grade 7 it consisted of 75 wordchains each containing 4 words. The task was administered in a group context in classrooms. The child's task was to scan and mark with a pencil the boundaries in the chain where one word ends and another starts. The number of correct answers during the time limit of 2 min (Grades 2 and 3) or 3 min and 30 s (Grade 7) was used as a measure of reading speed.
Print Exposure Print exposure was assessed via parental questionnaires on the amount of book reading (Grades 2, 3, and 7) and through self-reports (Grade 7). Prior to school entry at ages 4, 5, and 6, the amount of book reading was assessed as the amount of shared book reading with parents. To produce a composite score of shared reading, we obtained parental reports of both frequency and time spent on children's reading activities in the home. Two items assessed the frequency: How often (a) the mother reads with the child, and (b) the father reads with the child. Two items covered the amount of time spent with print materials: (a) the typical duration of a reading episode (i.e., the child reads with an adult), and (b) the total time per day the child spends reading a book with an adult. Shared reading composites were derived by calculating the mean of these four item scores. In Grades 2 and 3, a composite score was derived for two items pertaining to independent book reading: (a) frequency of reading alone, and (b) the typical duration of a reading episode. Parents responded to the first item using a five-point scale (1 = not at all/seldom … 5 = several times a day) and to the second item using a three-point scale (1 = less than 15 min/episode… 3 = longer than 45 min/ episode). In Grade 7, print exposure was based on both the child's self-report and a parental report. The questions and scales were the same as in Grades 2 and 3, a total of four items. Cronbach's alpha was 0.79, 0.84, and 0.84 for Grades 2, 3, and 7, respectively.
Dyslexia Criteria in Grades 2 and 8 Dyslexia criteria were based on the following tasks: (a) word list reading, (b) text reading, and (c) pseudoword text reading. First, a cut-off criterion for deficient performance was defined for each measure, using the 10th percentile in the distribution of the children without family risk (n=81). Subsequently, a child was considered to have dyslexia if the child scored below the criterion in at least two out of three measures of reading speed. In comparison to the larger samples with classmates (n=1386 and n=1489 in Grade 2 and 7, respectively) the mean reading skill of children having dyslexia were at the level of the 8th and the 6th percentile in Grade 2 and 7, respectively.
Parental Assessment The literacy skills of the parents were assessed before the child's birth. In the present study we included text reading speed because it resembles the children's tasks. When the children were between ages 3 and 6, we invited the parents for reassessments to measure their reading-related cognitive skills. Because we were not able to reassess all parents, the sample size for cognitive measures is somewhat lower than for reading speed (i.e., n=74 vs. n=100 for the at-risk group parents and n=45 vs. n=81 for the control parents with typical reading skills). Comparisons of the attendees and non-attendees revealed that the educational level and age of the parents was not different There were, however, differences in parental reading skills: the average reading level of the parents who decided to attend reassessments was somewhat lower than that of the non-attendees. For one of the parents in the family risk group, the text reading task was very difficult and testing was discontinued.
Text-Reading Fluency Parents read aloud two passages (218 and 128 words, respectively) as fluently and accurately as possible. A measure of reading fluency was the average reading time for the two texts. Cronbach's alpha was 0.96.
Phoneme Deletion Parents pronounced a given word without the second phoneme. The task included 16 words (e.g., kaupunki 'city' became kupunki) of 4 to 10 letters with 2 to 4 syllables. Deletion of the second phoneme yielded a pseudoword. Stimuli were presented via headphones. A new stimulus was presented after a response or after a 20-second period of silence. The number of correct responses was calculated.
Rapid Naming On each of three tasks, participants named, as rapidly as possible, a matrix of 50 items comprising objects, digits, or a mixture of digits, objects, and letters. In the parental assessment, stimuli were presented on a computer screen. Total naming time (in seconds) was used as the score. A mean composite score of the three standardized rapid naming scores was calculated for the analyses. Cronbach's alpha for the rapid naming composite was 0.86.
Verbal Short-Term Memory In the digit span subtest of WAIS-III (Wechsler 1991), participants repeated strings of digits, increasing in length, in both the forward and reverse directions. Two sets of items, one for forward, the other for reverse, were used. Scaled scores were derived from the manual.
Vocabulary In the vocabulary subtest of WAIS-III (Wechsler 1991), participants defined 35 words in their own words. Scaled scores were derived from the manual.

Results
We first classified children in four groups, according to their dyslexia status in Grades 2 and 8: (a) no dyslexia in either grade (no-dyslexia, n=127); (b) no dyslexia in Grade 2 but dyslexia in Grade 8 (late-emerging, n=18); (c) dyslexia in Grade 2 but not in Grade 8 (resolving, n=15); and (d) dyslexia in both grades (persistent-dyslexia, n=22). Because we classify children using cut-offs at two time points, which are always somewhat arbitrary (e.g. Francis et al. 2005), and because individual test scores are never 100 % reliable, we first examined how many of the Grade 2 children would be expected to have changed group by Grade 8 just by chance. To do this we did a simulation study using the reliability of the three reading tasks (measured by test-retest correlations between Grades 2 and 3). Our 1-year test-retest correlations reflect both measurement unreliability and trait stability. Therefore, they estimate reliability conservatively. Reading measures between Grade 2 and 3 correlated 0.80 for word list reading, 0.66 for pseudoword text reading, and 0.87 for text reading. We decided to use a test-retest correlation estimate of 0.80 (average of the test-retest correlations of the three tasks) and trait true score stability was set to be one. In the simulation study we set the number of cases to be 100,000 and examined how many children would be observed to change their group if we assume that there is no true score changes but all changes are seen only in the observed scores (being thus random changes due to unreliability of measurement). In the simulation the same identification procedure with three measures and 10 % cut-off was used as was used for the identification in the other analyses. The results showed that 5.6 %, which would correspond to ten children in our sample, change their group just because of the unreliability of the measures between Grades 2 and 8. As 18 % of our sample (33 children) changed their diagnostic group, we conclude that, even with the conservative reliability estimate, the instability of dyslexia is not fully due to random changes across cut-offs.

Group Differences in Parental Education, Gender, and Family Risk
There were no differences between groups in IQ or parents' education (see Table 1). Children's gender was unevenly distributed in the groups, χ 2 (3)=11.06, p<0.05: there were more boys than expected in the late-emerging group (adjusted standardized residual = 2.5) and more girls in the resolving group (adjusted standardized residual = 2.4). In addition, family risk for dyslexia was unevenly distributed in the groups, χ 2 (3)= 19.54, p<0.001: at-risk children were overrepresented in the persistent and late-emerging groups (adjusted standardized residual = 2.6 and 2.5, respectively) and underrepresented in the no-dyslexia group (adjusted standardized residual=−4.4).

Group Differences in Reading Speed Development
In order to examine if the groups were different in reading speed development, we conducted group comparisons using both the individually administered dyslexia criterion tasks (in Grades 2, 3, and 8) and group administered tasks (in Grades 2, 3, and 7) that were not part of the dyslexia criterion. The group tasks were also administered to children's classmates in order to obtain a reference sample. Group comparisons were conducted with one-way ANOVAs (see Fig. 1, Table 2). Note that in Fig. 1, there are two different distributions underlying the standardization. In the left panel for the individually administered tasks, the standardization is based on the not-at-risk group's distribution. In the right panel for the group-administered reading tasks, the standardization is based on the larger sample, which includes the classmates of the follow-up group as well. Effect sizes (Cohen's d computed using pooled standard deviation) are reported in Supplementary Table 1. All dyslexia groups showed poorer performance than the no-dyslexia group across measures and time-points. The persistent group stayed at the lowest level throughout the whole follow-up period. The late-emerging group read also somewhat slower than the no-dyslexia group in Grades 2 and 3 and showed a descending trajectory in the follow-up to Grades 7 and 8. The resolving group, on the other hand, did not differ in reading speed from the persistent group in Grades 2 and 3, but showed fast development in the following years. We also conducted a follow-up analysis on the developmental differences in reading speed among the groups with mixed ANOVA where we entered grade as a within-subjects factor and dyslexia group as a between-subjects factor. There was a significant Grade x Group interaction effect, F(3, 174)=18.50, p<0.001, η p 2 = .24, which indicates group differences in the rate of reading speed development between Grades 2 and 8. The clearest difference was between the resolving and lateemerging group: the resolving group made progress in reading speed more quickly than the late-emerging group did.

Group Differences in Cognitive Skills
Next we compared the groups in terms of cognitive skills with one-way ANOVAs (see Table 3 and Supplementary Table 2, Fig. 2). The persistent-dyslexia group showed poor performance in almost all assessed cognitive skills. They performed below the no-dyslexia group in all measures except vocabulary, and phonological awareness at ages 4.5 and 5.5. The strongest effect sizes for the difference between the persistent and no-dyslexia groups were found in phonological awareness from age 6.5 onwards, in rapid naming, and in letter knowledge.
The late-emerging group had problems most clearly in rapid naming speed. They performed below the no-dyslexia group in all assessments of rapid naming and also in Grade 3 verbal short-term memory. In addition, they tended to perform below the no-dyslexia group in the other cognitive skills as well (except for vocabulary), although these medium effect size differences did not reach significance. The late-emerging group and the persistent-dyslexia group differed significantly only in phonological awareness (Grade 8), although medium to strong effect sizes in favor of the late-emerging group emerged for verbal short-term memory (age 6.5) and for letter knowledge (ages 5.5 and 6.5).
The resolving group had problems in cognitive skills but only prior to school entry: they performed below the no- dyslexia group in vocabulary (ages 3.5 and 5), phonological awareness (age 6.5), rapid naming (age 5.5), and letter knowledge (all occasions). In addition, medium effect sizes emerged for all cognitive skills assessed prior to school entry in favor of the no-dyslexia group as well as for verbal short-term memory in all time-points. The resolving group did not differ from the persistent group significantly except for phonological awareness in Grades 3 and 8. However, medium effect sizes in favor of the resolving group emerged also in rapid naming (Grades 3 and 8). On the other hand, medium effect sizes in favor of the persistent group were found in vocabulary (ages 3.5 and 5).
The resolving group and late-emerging group did not differ significantly from each other in any of the cognitive measures. However, there were medium effect sizes in favor of the lateemerging group before school-age in vocabulary (age 5), verbal short term memory (age 6.5) and letter knowledge (age 6.5). On the other hand, at school-age medium effect sizes were found in favor of the resolving group in phonological awareness (Grade 3) and in rapid naming (Grade 8).

Group Differences in Book Reading
Finally, the groups were compared in the amount of book reading (the amount of shared book reading with a parent prior to school entry and the amount of book reading alone after school entry). According to the pairwise comparisons, the groups did not differ in the amount of book reading although the F-test in Grade 3 was significant. The comparisons of the effects sizes suggested that the parents of the resolving group tended to read less to their 4-year olds than the late-emerging  and persistent-dyslexia groups did (medium effect sizes). In addition, an effect size of 0.71 emerged for the comparison of the no-dyslexia group and resolving group in Grade 3, a result that indicates that the no-dyslexia children spent more time reading books than the resolving children.

Group Differences in Parental Skills
For the group comparisons in parental skills, five groups were compared instead of four because the no-dyslexia group was split into two groups: children with family risk for dyslexia and children without family risk for dyslexia. This division was made in order to provide a more detailed examination of the parental skill differences in these two separate samples. There were 3-4 children without family risk in each of the dyslexia groups whose data was omitted from this comparison resulting in five groups: not-at-risk & no-dyslexia (n=70), atrisk & no-dyslexia (n=57), at-risk & late-emerging dyslexia (n = 15), at-risk & resolving dyslexia (n = 11), at-risk & persistent-dyslexia (n=18). The group comparisons are reported in Table 4, Supplementary Table 3, and Fig. 3. The group comparisons revealed that parents in the not-atrisk & no-dyslexia group performed significantly better than parents in all other groups in reading speed and in phoneme deletion. These parents also performed better in the verbal short-term memory task than parents in all other groups did except for those in the at-risk & resolving group. There were no significant differences in parental vocabulary although large effect sizes indicated that the parents in the at-risk persistent groups tended to have poorer vocabulary than in those in the not-at-risk & no dyslexia group did. The group comparisons in parents' rapid naming showed interesting differences between the family risk groups: the parents of the at-risk & late-emerging and the at-risk & persistent dyslexia groups were slow at rapid naming whereas the parents of the at-risk & resolving and the at-risk & no-dyslexia groups did not differ from the not-at-risk & no-dyslexia group's parents in rapid naming. The effects sizes confirmed that the parents of atrisk & late-emerging and at-risk & persistent dyslexia groups had slow rapid naming speed when they were compared with both no-dyslexia groups and with the at-risk & resolving group.

Discussion
The current study investigated the stability of dyslexia in a prospective study from Kindergarten to Grade 8 in the context of a transparent orthography (Finnish). The children, half of whom had a familial risk for dyslexia, were categorized as having or not having dyslexia based on reading fluency measures in Grades 2 and 8. This yielded three groups of children with reading problems at some point, referred to as resolving, late-emerging, and persistent dyslexia. The three groups were  compared with each other and with a group without dyslexia in six measures: reading fluency, language and cognitive skills, parental reading fluency and cognitive skills, parental education, and the amount of book reading. The group comparisons revealed differences both in the children's readingrelated cognitive development and in a set of similar skills measured in their parents. Parental education, book reading frequency, and children's IQ, on the other hand, did not differentiate the groups. Our findings indicated that reading status was not stable, because less than half of the RD children met the dyslexia criteria in both Grades 2 and 8. In fact, of the 55 children identified as having RD at some point, 15 (27 %) met the dyslexia criteria in Grade 2 only (resolving), and 18 (33 %) only in Grade 8 (late-emerging). The investigation of reading fluency using the group-administered tasks that were external to the dyslexia criteria, validated the groups. The previous longitudinal studies examining the stability of dyslexia were conducted in English (Catts et al. 2012;Etmanskie et al. 2015;Leach et al. 2003;Lipka et al. 2006). In spite of differences in orthographic complexity between this study and previous ones, we found roughly the same proportion of children with late-emerging RD as Catts et al. (2012) and Leach et al. (2003). Nevertheless, the proportion of the resolving group was twice as large than the group in Leach et al. (2003) and four times as large than what was found in Catts et al. (2012). Such cases were not reported in Lipka et al. (2006).
The letter-sound connections in Finnish, which are easier to learn than in English, could enable more children to catch up despite early cognitive difficulties. This suggestion is supported by cross-linguistic comparisons which have shown that dyslexic children's reading is less severely impaired in lowcomplexity orthographies than in high-complexity ones (Landerl et al. 2012). In addition, the use of reading speed as a measure of RD and the longer follow-up period may explain why the proportion of resolving RD is higher in the present study. Although typical Finnish children are fluent readers already in the spring of Grade 2, reading fluency continues to develop rapidly. Grade 2 spring is an interesting assessment time because at this point in Finnish schools a pedagogical shift occurs: starting from Grade 3 the emphasis on learning to read changes into learning by reading and children are expected to read more and the demands for reading speed increase. It seems that for some children, however, the development of reading fluency takes longer but that they can catch up later on.
The process of identifying these groups raises a number of questions: Can these groups of children be identified early on? What intrinsic or extrinsic factors can help children to overcome reading impairments? What factors cause them to succumb to reading impairments later on? The examination of the cognitive differences between the groups showed, in line with the body of literature on early cognitive precursors of dyslexia (e.g., Puolakanaho et al. 2007;Snowling et al. 2003;van Bergen et al. 2014b), the close link of rapid naming, letter knowledge, and phonological awareness to RD. The findings for verbal short-term memory and vocabulary were not as consistent, because all RD groups showed moderate difficulty in verbal short-term memory tasks and only the resolving group had clear early vocabulary difficulties. The cognitive difficulties were limited to skills closely linked to reading. There were indeed no group differences in IQ.
The persistent-dyslexia group had early and persisting deficits across the cognitive foundations of reading, as expected. Their performance in early phonological awareness and expressive vocabulary, however, was not significantly poorer than in the typically reading group. This result, which contradicts previous studies (Catts et al. 2012;Lipka et al. 2006) on Fig. 3 Parental skills by risk group and RD group the stability of RD, can be explained by differences in the orthography and in the RD classification criteria. The role of phonological awareness in transparent orthographies has been shown to be limited to the very beginning of reading acquisition and particularly to reading accuracy (e.g. de Jong and van der Leij 2002; Landerl and Wimmer 2008). The present study adopted RD criteria that uses reading fluency measures because the reading accuracy approaches a ceiling even with nonword reading measures in Grade 2. The studies conducted in English, however, used mainly reading accuracy measures (e.g., Lipka et al. 2006) or reading accuracy and comprehension measures (e.g., Etmanskie et al. 2015) in their RD criteria.
The late-emerging group differed significantly from the nodyslexia group especially in rapid naming prior to and after school entry. This finding was expected because rapid naming has been shown to be a strong predictor of reading speed measures (e.g., Puolakanaho et al. 2007;van Bergen et al. 2014a). The skills of the late-emerging group seemed to be sufficient for the early grades, but not for reaching the typical level of fluency in reading in later grades. This finding is in accordance with the idea that the major bottleneck in reading development in Finnish is in reading speed and rapid naming whereas in the case of English the development of reading accuracy may also be problematic and is linked to phonological skills. Some of the previous studies on dyslexia stability in English (e.g. Catts et al. 2012;Etmanskie et al. 2015) included in their assessment reading comprehension. For comparison, we additionally looked at reading comprehension in the late-emerging group in Grades 2, 3, and 9. Their comprehension skills were age appropriate, which is in line with previous findings (e.g. Torppa et al. 2007) that in Finnish it is possible to obtain adequate reading comprehension, despite slow reading.
The resolving group showed difficulties in phonological awareness, letter identification, rapid naming, and vocabulary prior to school entry. Surprisingly, the resolving group had the lowest level of expressive vocabulary. By school age, however, the cognitive differences between the resolving and no-dyslexia groups had disappeared, suggesting that these children suffered from a developmental delay rather than from permanent cognitive deficits. Another explanation for the fast catch-up of the resolving group may be that school entry meant a clear improvement in the environmental support for reading-related skills. Although our measures of parental education and the reported amount of shared reading did not show significant group differences, more sensitive measures of the home environment might have done so.
Comparisons between the groups in terms of parental skills also revealed interesting differences. First, as expected, children with a familial risk for dyslexia were overrepresented in the dyslexia groups. Second, parental skills were different in the groups. Interestingly, the parents of the late-emerging and persistent group had slower naming rates than the parents of the other groups, a result which matches the findings at the child level. This finding suggests that the late-emerging and persistent group had stronger vulnerability for developing difficulties in rapid naming and reading fluency. These findings support previous studies that show that parental rapid naming is predictive of their offspring's naming and reading fluency (Torppa et al. 2011;van Bergen et al. 2014b). It thus supports the notion that parental skills are informative of their offspring's liability for dyslexia (van Bergen et al. 2014b).
Finally, the gender difference between the late-emerging (22 % girls) and resolving (80 % girls) groups was striking. This finding is in line with the evidence that girls outperform boys in reading in upper grades (OECD 2010a, b) and that fewer girls have reading disabilities (Rutter et al. 2004). In the present data, in Grade 2 dyslexia was as prevalent among boys and as it was among girls, whereas in grade 8, 65 % of the dyslexic adolescents were male. One explanation is that girls are more motivated to do schoolwork (Li and Lerner 2011) and to read. Our related finding of no group differences in the amount of book reading, however, does not support the link between book reading and skill development. However, a more comprehensive measurement of print exposure, also including digital reading and school engagement, might show different results.
It has been proposed that in the diagnosis of specific learning disabilities (DSM-5) (see Tannock 2013), diagnostic criteria should include the early onset of symptoms of the disability. The late-emerging group is interesting in relation to this proposal, because they do not meet this criterion. As a result, they did not fulfill the proposed dyslexia criteria, despite the observation that their group mean in Grade 8 reading fluency was two standard deviations below those of the unaffected adolescents. Tannock (2013) states that the symptoms Bmay not become fully manifest until the learning demands exceed the individual's limited capacities^ (p.19). Based on the current findings, the early symptoms of the late-emerging group are mild and may not be evident in the early grades, even if reading is assessed more than once (e.g., in Grades 2 and 3). However, slow naming speed in children and their parents seems to be a warning sign that reading speed may develop slower later on.
There are certain limitations in this work that need to be considered. First, it should be noted that our findings call for replication, because the late-emerging and resolving groups were rather small. Second, because the data came from a family risk study (see Lyytinen et al. 2008), it includes a higher prevalence of dyslexia than expected in the general population. It is therefore not suited for estimating the prevalence of persistent, resolving, and late-emerging dyslexia in population. Third, the instability of the RD definition is partially attributable to the use of a categorical approach (see Francis et al. 2005). However, our simulations showed that only 5.6 % of the changes are due to random changes across cut-off criterion due to unreliability of the measures. It should be noted that the clinical question regarding the stability of dyslexia status supports the use of a categorical approach. Additionally, adopting such an approach allowed comparisons with previous investigations. It should also be noted that there are different ways of defining categories. The low achievement approach we used seems to be one of the most stable and reliable ones, although the reliability of the diagnostic tests used is also critical (Brown Waechse et al. 2011). Our reading fluency tests showed high reliability and the group differences were also validated with external reading speed measures.
In conclusion, even in a language environment where children read very accurately after 2 years of reading instruction, we found that reading status was not yet stable at this age. This raises several clinical implications. First, it is important to continue following children's literacy development beyond the early grades. Second, support needs to be provided not only for those who receive an early diagnosis, but also for those who begin lagging behind later in their development. Only continuous follow-ups can detect children who fall behind later on. If an official diagnosis is needed to access extra support, which is the case in several countries, children with late-emerging dyslexia will be deprived of the intervention and adaptations they need.