Syntactic complexity in Finnish-background EFL learners ’ writing at CEFR levels A1 – B2

: The increasing importance of the Common European Framework of Reference (CEFR) has led to research on the linguistic characteristics of its levels, as this would help the application of the CEFR in the design of teaching materials, courses, and assessments. This study investigated whether CEFR levels can be distinguished with reference to syntactic complexity (SC). 14-and 17-year-old Finnish learners of English (N=397) wrote three writing tasks which were rated against the CEFR levels. The ratings were analysed with multi-facet Rasch analysis and the texts were analysed with automated tools. Findings suggest that the clearest separators at lower CEFR levels (A1 – A2) were the mean sentence and T-unit length, variation in sentence length, infinitive density, clauses per sentence or T-unit, and verb phrases per T-unit. For higher levels (B1 – B2) they were modifiers per noun phrase, mean clause length, complex nominals per clause, and left embeddedness. The results support previous findings that the length of and variation in the longer production units (sentences, T-units) are the SC indices that most clearly separate the lower CEFR levels, whereas the higher levels are best distinguished in terms of complexity at the clausal and phrasal levels.


Introduction
The Common European Framework of Reference (CEFR; Council of Europe 2001) is arguably the most influential initiative in foreign language education from Europe.Since its introduction, the CEFR has rapidly become the framework for language education across Europe.The CEFR is seen to have general value for language learning, teaching and assessment.Mainly its 6-point scale defining levels of proficiency from basic to very advanced is now widely used to describe the level of language examinations, curricula, courses, materials, and targets for learning.The importance of the CEFR has, however, brought attention to its limitations.
The most severe issue with the CEFR is probably that its proficiency scale (or its 50+ scales, in fact) is not adequately informed by second language acquisition (SLA) research (Hulstijn 2007, Hulstijn et al. 2010, North 2007, Wiśniewski 2017), even if the scale appears to define developmental stages in learning.A related limitation of the CEFR levels for applying them to the design of level-specific materials, curricula, and assessments is that they define what learners can do with the language; they do not specify which linguistic characteristics (e. g., words and structures) are required, or typically used, in particular foreign languages to the functions and activities described at each level.
These issues have led to calls for research on the relationship between the framework levels and the development of the linguistic aspects of proficiency.Language testers have been at the forefront of applying the CEFR and have faced the framework's limitations (e. g., Alderson, 2007).To increase the validity and applicability of the CEFR levels, language testers and SLA researchers have conducted (often) joint research on the linguistic characteristics of the CEFR levels (see Bartning et al. 2010 and the studies reviewed below).Particularly the language testers interested in diagnostic assessment, that is, predicting and understanding learners' strengths and weaknesses in their L2 skills in order to provide feedback to learners and propose action to address the identified weaknesses, have promoted such research (see Alderson, 2007;Bartning et al., 2010;Huhta et al., forthcoming).
Such collaboration has many benefits.SLA researchers can use the CEFR levels as a reference point, which improves the interpretability of the findings as such levels define informants' second or foreign language (L2) proficiency more transparently than in many previous SLA studies (Carlsen 2012).For their part, language testers can improve the validity of their assessments by grounding them better in SLA research.
The current study contributes to ongoing research on the linguistic basis of the CEFR by investigating two groups of teenage (14 and 17-year-old) Finnish-speaking learners of English as a foreign language (EFL).The study focuses on syntactic complexity (SC) in the learners' writing: how SC relates to communicative CEFR levels (i.e., writing ability as defined in those levels), and whether particular levels can be distinguished from one another in terms of SC.

Syntactic complexity in relation to CEFR levels
Syntactic complexity (SC) has been defined variously in the literature.In SLA research, the T-unit has been a critical index in SC analyses (e. g., Wolfe-Quintero et al. 1998), but several other indices have also been investigated, such as mean length of clause (e. g., Ortega 2003) or complex phrases and complex nominals per clause or T-unit (e. g., Lu 2011; for reviews, see, e. g.Wolfe-Quintero et al. 1998 andOrtega 2003).Language testers investigating the linguistic characteristics of different proficiency levels have used the same SC indices as SLA researchers (e.g., Lu 2011, Kyle andCrossley 2017).
Irrespective of how SC is defined and operationalised, it should be seen as part of a system that comprises several levels and dimensions.Bulté andHousen (2012, 2014) argue that SC is part of linguistic complexity, which, in turn, is part of absolute complexity that concerns the number of different components of a particular linguistic feature and the relationships between those components.SC, Bulté and Housen (2012) maintain, comprises three levels: theoretical (number of syntactic structures and their relationships), observational (how different language forms contribute to complexity at the sentence, clause, and phrase levels), and operational (quantitative indices of SC).Our study agrees with Bulté and Housen's (2014: 45-46) definition of complexity "as an absolute, objective, and essentially quantitative property of language units, features, and (sub)systems thereof in terms of (i) the number and the nature of discrete parts that the unit/ feature/system consists of and (ii) the number and the nature of the interconnections between the parts".
It should be mentioned that conceptualising SC in terms of the indices of complexity typically used in SLA and some language testing research (e. g., mean length of T-units) are rather broad and have their limitations.Biber et al. (2020) argue that such omnibus measures are pretty extensive in linguistic terms and, thus, not easy to interpret linguistically, and a more detailed description of the structural, syntactic and functional features of the various linguistic elements are needed.This is an obvious limitation of such indices for attempts to develop diagnostic tests even if the broad indices of complexity may suffice for the prediction stage in diagnostic assessment (e. g., Huhta et al., forthcoming).Furthermore, findings from Multidimensional studies on register variation in speaking and writ-ing indicate that grammatical complexity features often vary from one register to another (e.g.Biber, 1992;Biber et al., 2020).Thus, findings from different studies may vary due to the different registers that the writers used to elicit.
Since the current study is part of language testing research that aims to predict L2 learners proficiency level from syntactic complexity in their writing, we use traditional omnibus indices of SC.We also use data based on learner performances across several writing tasks, even though that unavoidably hides possible variation in SC due to some register differences (see the Methods section for more information about the tasks).
Next, we review the literature on the relationship between SC in written L2 English and the CEFR proficiency levels.An early study by Kim (2004) explored SC in 33 scripts rated on CEFR scales.She found some SC features to distinguish levels A2 and B2: adverbial and adjective clauses per clause, clauses and dependent clauses per T-unit, dependent clauses per clause, and prepositional, participial and infinitive phrases per clause.Hawkins and Filipović (2012) and Green (2012) explored the CEFR-related Cambridge Learner Corpus and found that mean sentence length significantly differentiated all adjacent levels from A2 to C2.In addition, Green (2012) found the mean noun phrase incidence and the mean number of modifiers per noun to differentiate B2 and C1, and sentence syntax similarity to distinguish C1 from C2. Verspoor et al. (2012) explored descriptive texts written by teenaged Dutch EFL learners on different topics and rated on a 5-point scale corresponding to CEFR levels A1.1, A1.2, A2, B1.1, and B1.2.They found that simple versus complex sentences were strong proficiency level differentiators.Furthermore, sentence length differentiated the proficiency levels and that T-unit length increased from low to high proficiency levels, significantly differentiating A1.2 versus B1.1 and A2 versus B1.2.Relative clauses also increased across levels showing apparent differences between A2 and B1.1.The number of dependent clauses proved to be the only SC feature that differentiated across all adjacent levels studied.Gyllstad et al. (2014) analysed emails and stories written by 54 L1 Swedish EFL learners who were rated to represent CEFR levels A (A1-A2) or B (B1-B2).The researchers found the mean length of T-units, mean length of clauses, and clauses per T-unit to differentiate between A and B levels.Alexopoulou et al. (2017) explored SC in EFL learners' texts, analysing the EFCAMDAT Corpus (http://corpus.mml.cam.ac.uk/efcamdat) based on learners from different L1 backgrounds.They reported an increase in sentence length (across all CEFR levels), clause length (from A2 to B2), and clauses per T-unit (from A1 to B2) but did not report on the statistical significance of their findings.Barrot and Agdeppa (2021) used another corpus (ICNALE-Written; http:// language.sakura.ne.jp/icnale/download.html) comprising essays written by EFL learners from 10 Asian countries.Over 5,000 essays placed at A2, B1.1, B1.2 and B2 (or above) based on learners' TOEFL and other EFL test results were investigated for 14 SC indices.They found several indices to distinguish those CEFR levels, particularly length of clauses, sentences and T-units, and complex nominals per clause or T-unit.Martínez (2018) investigated 188 Spanish secondary level EFL learners who wrote on an opinion topic.The students were from two grades corresponding to A2 and B1 levels.Her study used SC indices proposed by Bulté and Housen (2014), which differ somewhat from those used in most CEFR-related SC studies.Martínez reported significant differences in the length of that-sentences, compound and complex sentence ratios, coordinate and dependent clause ratios, and noun phrases per clause.Finally, Khushik and Huhta (2020) compared teenaged EFL learners from two L1 (Finnish and Sindhi) backgrounds.Investigating one argumentative writing task and almost 30 indices of syntactic complexity, they discovered that most indices differentiated CEFR levels from A1 to B1 but that the results varied depending on the learners' L1.
Previous research on SC across CEFR levels is, thus, rather heterogeneous.The studies often focus on only a few and different, indices making it challenging to form an overall picture of which features differentiate CEFR levels in EFL learners' writing.The research methods in previous studies also vary considerably.For example, the number and type of the writing tasks vary, as do the conditions under which the tasks are completed.Furthermore, the small scale of some studies and the uncertain reliability of the placement of the writing samples on the CEFR levels make the specific conclusions uncertain.However, a consistent finding is that many SC indices increase as writing ability (CEFR level) improves.
The present study departs from most previous ones in at least three ways.First, it covers a wide range of SC indices to obtain a comprehensive picture of the relationship.Secondly, learners' writing skills were measured by combining the results of several writing tasks because we investigate the SC typical of learners' writing at different proficiency levels rather than particular tasks (see Methods section).Thirdly, special attention was paid to the reliable placement of learners' scripts at the CEFR levels through direct double rating on the levels and the use of multi-facet Rasch analysis to mitigate unavoidable rater differences.

Goal and research questions
The study's goal was to shed light on the linguistic characteristics of the CEFR levels by focusing on syntactic complexity.The research questions were: RQ1.To what extent is the syntactic complexity in the writing of two age groups of Finnish EFL learners related to their EFL writing ability?Which SC indices correlate strongest with their ability, and do the two age groups differ?RQ2.Which SC indices distinguish Finnish EFL learners at different CEFR levels, and do the two age groups differ?
To answer the RQs, we draw on a corpus of texts written by teenaged EFL learners collected in a research project focusing on reading and writing development in L1 Finnish and L2 English (Khushik et al.XXXX).The corpus was collected from volunteer learners who completed the tasks in separate data collection sessions in their schools supervised by researchers.The learners were given feedback on their performance, but the tasks were not used for grading purposes.

Participants
Participants represent two groups of EFL learners with Finnish as their L1: 14year-olds in grade 8 in the lower-secondary school (N=202) and 17-year-olds in grade two in the academic upper-secondary school (gymnasium, N=195).

Tasks
Both groups completed three writing tasks: one shared by both and two unique to the group.The shared task was designed in an earlier project focusing on L2 writing in Finland.The task was to express an opinion on one of two topics (should mobile phones be allowed at school; should boys and girls be integrated into the class) and give reasons for their views.The task was based on considering the national curricula for EFL in secondary education; the researchers (university language teachers and researchers) considered the task to enable the stronger (B1-B2) students to display their writing ability while also the weaker (A1-A2) students could address the topics.The unique tasks came from the Pearson Test of English General (Pearson collaborated with the large scale project): the two 8 th graders' tasks were from the PTE B1-level test and the two gymnasium tasks from their B2-level test.The PTE tasks were retired operational tasks developed (includ-Syntactic complexity in Finnish-background ing standard-setting to CEFR levels) by Pearson item writers.The B1 tasks were primarily descriptive, whereas the B2 tasks were similar to the shared task as they involved expressing a viewpoint and justifying it.The topics related chiefly to travelling (e. g., B1: travelling preferences between home and school; B2: opinion on cheap air travel; why a particular journey had been so unforgettable).The students were not told how their writing would be rated; they likely thought they would be evaluated the same way their teacher(s) would dowhich is known to vary, as teachers have great freedom to implement assessment in the Finnish educational system.

Ratings and rating analyses
An overlapping rating design was used that allowed the linking of all raters and tasks.Each rater was given a randomised batch of handwritten texts representing several tasks from both student groups.All texts (3 texts x 397 students; totalling 1180 texts as some students wrote only two texts due to absence from one data collection) were rated by two raters out of a pool of 11 raters.The raters were not told which texts were written by which age group.The raters were trained using the CEFR writing scales, the international benchmarks from the Council of Europe website, and local benchmarks from the earlier writing-focused study.The raters then assessed the texts on the CEFR scale A1-C2.The rating scale was a compilation of several scales taken verbatim from the CEFR, namely overall written production; written interaction; correspondence; notes, messages, forms; creative writing; thematic development; and coherence & cohesion.The scale, thus, focused on the communicative quality of the texts.We excluded the CEFR scales that explicitly address grammatical or lexical aspects of proficiency to decrease potential circularity in the data.Raters can be influenced by other features in learners' writing (e. g., syntactic complexity) than those defined in the scale.
Ratings were coded for analysis by converting CEFR levels ratings to numbers (A1=1, A2=2, B1=3, B2=4).Multi-facet Rasch analysis was then conducted in Facets (Linacre 2009) on the combined 8 th and gymnasium rating data, including all tasks and raters.Facets are currently the standard approach to analysing ratings in language testing (e. g., McNamara and Knoch 2012;Aryadoust, Ng and Sayama 2021) as it can adjust differences in rater severity and task difficulty when estimating learner ability to produce an ability measure that is more accurate than, for example, an average across (raw) ratings.Furthermore, the ability measures for learners from Facets are equal-interval scale values (logit values) accompanied by parallel ability measures called fair averages that are on the same metrics as the CEFR based rating scale.Thus, in our study, we categorised the learners onto the levels A1-B2 for investigating whether specific SC indices differentiate CEFR levels by rounding the fair averages to the nearest whole CEFR level (e. g., 2.25 was rounded down to 2, corresponding A2, and 2.65 up to 3 or B1).
Our decision to combine in the analysis the three writing tasks that each learner wrote, rather than analyse them separately, was based on two related considerations.First, the study contributes to research on the linguistic characteristics of the CEFR proficiency levels (e. g., Bartning et al. 2010 and the studies reviewed above).Thus, the focus was on what characterises learners' writing whose writing ability has been assessed to correspond to particular CEFR levels.Second, our perspective is that of language assessment, where it is common to use multiple tasks to increase the reliability and generalizability of the ability estimates.For example, van den Bergh et al. (2012: 23) state that "to measure writing skills reliably, one needs multiple assignments rated by multiple raters".Incidentally, the developers of the TOEFL iBT found that three tasks were required for obtaining adequate reliability (Chapelle 2008: 331).
Rating quality was investigated by examining raters' Infit Mean Square values, which should usually range from 0.6 to 1.5 (e. g., Engelhard 1994).Rater fit was considered to be appropriate as all Infit Mean Square statistics were smaller than 1.3.All point-biserial estimates of the raters were optimistic and between .27 and .65 (for 9 of the 11 raters, they exceeded .42).This suggests that the raters applied the scale in a relatively consistent way, although their severity varied.However, since Facets adjusts the ability measures by taking into account rater severity, these differences did not prevent a reliable estimation of learners' EFL writing ability, mainly when the ability measures were based on three writing tasks.After rating, the handwritten scripts were transcribed for automated analyses.

Modification of the texts
The scripts were slightly modified for automated analyses.Misspelt words were corrected to allow the tools to identify words correctly, and any missing periods were added to the end of sentences to ensure correct identification of sentence boundaries.Other punctuation, grammatical errors or incorrect word choices were not corrected (on data cleaning, see McNamara et al. 2014: 155-6).No texts were removed from the corpus in the rating and data cleaning stages.

Linguistic analysis of learners' writing
Each script was investigated with two tools designed to analyse English texts: the L2 Syntactic Complexity Analyzer and Coh-Metrix.L2 Syntactic Complexity Analyzer (L2SCA) is a freely available UNIX-based research tool that calculates 14 SC indices (see table 1 and Lu 2010).L2SCA consists of three components: a parser (Stanford parser), a procedure for counting the production units, and an SC analyser.From many Coh-Metrix indices, we chose 16 that relate to SC (see table 2 and Graesser et al. 2004).The degree to which sentences contain fewer vs more words and use simple vs complex syntactic structures.

Left embeddedness
Mean the number of words before the main verb.These are often structurally dense, syntactically ambiguous, or ungrammatical (Graesser et al. 2004) and difficult to process.
Modifiers per noun phrase Mean # modifiers/noun phrases.

Minimal edit distance for parts of speech
Combination of semantic and syntactic dissimilarity and distance between parts of speech across sentences (McCarthy et al., 2009).
Sentence syntax similarity (adjacent sentences) Degree of uniformity and consistency of the syntactic constructions.

Statistical analyses
Pearson correlation coefficients were used to investigate the relationship between SC indices and writing proficiency ratings (i.e., learner ability measures from Facets).To examine the differences between learners placed at different CEFR levels, several MANOVAs were run on groups of independent variables (i.e., count variables, SC variables from L2SCA and Coh-Metrix) to investigate overall differences between CEFR levels.These were followed by univariate tests (in MANOVA) to examine differences between adjacent CEFR levels.Bonferroni correction was applied to control for the familywise error rate associated with the pairwise comparison of several groups (CEFR levels).

Results
Table 3 the distribution of the learners' overall writing ability across CEFR levels, based on rounding Facets fair average values to the nearest whole CEFR level.The ability to write in English varied considerably among the eighth graders despite having studied the language at school since grade three.The most significant proportion (43 %) were at A2 and many also at B1 (35 %), but quite a few were still at A1 (18 %), and only some at B2.In contrast, almost two thirds (64 %) of the gymnasium students were at B1, and the rest at A2 or B2 (16 % and 20 %, respectively).The higher and more homogeneous results achieved by the gymnasium students is explained by the fact that they had studied English three years longer and that gymnasia are attended mainly by the more academically oriented students.

Relationship between syntactic complexity and writing ability
To address Research Question 1 (is SC and writing ability related in Finnish EFL learners), correlation coefficients were computed between the SC indices obtained from the two computer tools and the writing ability measures from Facets.First, we report the correlations between the number of different kinds of linguistic units in learners' writing and their writing ability (see Table 4; Figures 1 and 2).
The number (count) of such unitswords, clauses, T-units, sentencesindicates text length, which has been found to relate to ratings of L2 writing quality: longer texts are generally considered better than shorter texts and are awarded higher ability ratings.The specific reason for investigating this here was to see if the correlations in both age groups were equally strong.The most detailed index of text length, the number of words, correlated strongest with writing ability in both groups (.822 in grade 8; .621 in the gymnasium).However, counts of all other linguistic units also correlated significantly (at p <.001 level) with ability in both groups.Another strong correlation was the number of complex nominals (.625 and .594 in grade 8 and gymnasium, respectively) and clauses (.726 and .472,respectively).There were also differences across the groups: the largest was the sentence count (.573 in grade 8 but only .247 in gymnasium) and the number of dependent clauses (.633 vs .283,respectively).However, the most notable difference was that the correlations across all count variables were significantly more significant in grade 8 (the only exceptions were coordinate phrases and complex nominals).The amount of language produced by the learners, irrespective of the unit of analysis, was a more significant correlation of writing ability in grade 8, whereas its importance was more negligible in the more able gymnasium group.

MOUTON
However, to more comprehensively address RQ1, we investigated indices representing different aspects of SC (see Table 1 & 2).The indices in table 5 concern the length of production units.They are typically operationalised as mean lengths of clauses, T-units and sentences, and as their standard deviations.Table 5 reports the correlations between measures relating to the length of production units and writing ability measures from Facets.All correlations are statistically significant but low or moderate.Again, most correlations are more robust for the 8 th graders, particularly those concerning sentence length (over .4 for both the mean length and variability in sentence length) and T-unit length.However, the mean length of clauses was a more robust correlate of writing ability in gymnasium than in grade 8.The measures of subordination and coordination differed between the groups (see Table 6).Almost all subordination measures correlated modestly with writing ability in grade 8, but no correlations were found for gymnasium.The highest correlations in grade 8 were found for verb phrases per T-unit (.262), dependent clauses per clause (.224), and clauses per sentence (.220).Of coordination measures, only the ratio of T-units per sentence had a small significant correlation with writing ability in grade 8. Particular SC structures were also related to the ratings of writing ability: the number of complex nominals per clause and per T-unit in the gymnasium and verb phrases per T-unit in grade 8.

Syntactic complexity as a way to distinguish CEFR writing ability levels
To address Research Question 2 on whether certain syntactic complexity features distinguish specific CEFR levels, multivariate analyses of variance were used to compare SC features across the levels.Tables A and B in Appendix 1 present the means and standard deviations for the count variables for the two learner groups.
The counts were calculated with the L2SCA.As the relatively high correlations between count variables and writing ability suggested, the number of words, clauses, sentences and phrases increased steadily across levels (see Figure 1 & 2).Tables 8  and 9 summarise the results of multivariate analyses of variance with the CEFR writing level as the independent variable and the counts of various linguistic units as dependent variables.It should be noted that an omnibus Manova analysis was first conducted to the indices reported in each table; in each case, the results were statistically significant, which then warranted the univariate analyses of each SC index reported as the overall F-and p-values, as well as effect sizes in Tables 8-13.Tables 8 and 9 show that, overall, all count indices separated the CEFR levels significantly.Separation was more apparent in grade 8, as indicated by larger effect sizes than in the gymnasium.The tables also display that the number of words learners wrote differed between almost all adjacent CEFR levels.However, almost all count variables are distinguished between A1 and A2 writers on the one hand and between A2 and B1 writers on the other in grade 8.In contrast, these variables did not clearly distinguish A2 and B1 in the gymnasium but did a better job separating B1s from B2s, particularly the number of complex nominals, complex T-units, clauses, and sentences.Tables C and D (Appendix 1) display the means and standard deviations across the CEFR levels for the SC variables obtained from L2SCA.Tables show the mean length of the production units increasing from level to level (Figure 3).A similar trend can be seen for sentence complexity (clauses per sentence) and such structures as the number of complex nominals per clause or T-unit.Tables 10-11 report the statistical significance of the differences for the SC variables obtained from L2SCA both overall (across CEFR levels) and between adjacent CEFR levels (see also Figure 3).The length of the production units separated the levels significantly: Sentence and T-unit length distinguished A1 vs A2 and clause length B1 vs B2.Sentence complexity increased significantly from A1 to A2.Similarly, the only significant subordination index (clauses per T-unit) distinguished between A1 and A2 but not above.Two coordination indices separated CEFR levels overall but failed to distinguish between adjacent levels.In contrast, particular syntactic structures turned out to be significant: the number of verb phrases per T-unit distinguished A1 and A2, whereas the number of complex nominals per clause separated B1 from B2.  Finally, we report on the results for the somewhat different SC indices from Coh-Metrix (see table E and F in Appendix 1 for the means and standard deviations).
Coh-Metrix reports both the mean sentence length and its standard deviation.The tables show that the mean standard deviation of average sentence length primarily increased from level to level.Syntactic simplicity indices had a slight downward trend implying that syntax becomes more complex as proficiency improves.
A similar trend can be seen for syntactic similarity.Left embeddedness and the number of modifiers per noun phrase increased slightly from level to level.Density measures displayed both downward (noun and negative phrase density) upward trends (adverbial, preposition and passive voice density).Tables 12-13 report the statistical significance of the overall and between-level differences in the SC variables from Coh-Metrix (see also Figures 8,9 & 10).The standard deviation of the mean sentence length separates the three lowest levels (A1-B1) mainly.Overall syntactic simplicity decreased from lower to higher levels (particularly in the gymnasium), but none of the adjacent levels was separable.
Sentence syntax similarity indices distinguished CEFR levels more clearly, but the only significant pairwise difference was found between A2 and B1 (in the gymnasium).Left embeddedness and the number of modifiers per noun phrase separated B1 and B2 levels but not below.The minimal edit distance for parts of speech separated A1 and A2 but not beyond.Of the density measures, only infinitive and noun phrase densities distinguished CEFR levels; the former between A1 and A2, between A2 and B1, and the latter between A2 and B1, all in grade 8.

Discussion
The study sheds light on the linguistic characteristics of the CEFR levels by focusing on syntactic complexity in the writing of two groups of Finnish-speaking EFL learners aged 14 and 17.The groups also differed in terms of proficiency: the writing ability of the older gymnasium students was higher since they had studied English longer.Therefore, the comparison of A1 and A2 levels was possible only for the 8 th graders as there were no A1 writers in the gymnasium.For its part, the comparison between B1 and B2 was possible, in practice, only among the gymnasium students since there were only eight B2 writers in grade 8. Our RQ1 concerned the relationship between syntactic complexity in the learners' writing and their writing ability, based on three double-rated writing tasks, and whether the results varied across the two groups.First, we found that text length (number of words, clauses, sentences, etc.) correlated strongly with the ability (even over .8); the correlations were more substantial in the younger group.This suggests that raw text length may be a more vital indicator of L2 writing ability in the early stages of L2 learning, but then its role diminishes but may not disappear, not at least before B2.As for the actual indices of SC, we found that the length of production units (e. g., clauses, sentences) correlated significantly but only moderately with writing quality and more strongly among the 8 th graders.The findings confirm the expectation that simple counts of linguistic units are often quite good predictors of learners' L2 (writing) ability, including counts of such indices of SC as dependent clauses and complex nominals and complex T-units, even if there appear to be differences that relate to learners' age and/or ability.
We discuss the second RQ (whether SC separates CEFR levels) below and compare the findings with previous research.There are still relatively few studies on the relationship between SC in EFL writing and CEFR levels.Table 14 summarises the significant differences in SC between CEFR levels in both our study and previous research.A direct comparison of our findings with those reported previously is complicated since the SC indices investigated and how the results are reported may vary.
Such caveats notwithstanding, Table 14 allows us to compare different studies and examine trends in research on SC.The present study is referred to with the letter 'A' in Table 14, and A8 refers to grade 8 and AG to the gymnasium.The previous studies are numbered from one to nine (see the key after the table).

Dependent clauses / clause
(2) 2, (9) 6, ( 9) Overall, Table 14 shows that a wide range of SC indices has been found to distinguish CEFR levels.Mean sentence length is a consistent separator across the entire scale (Alexopoulou et al. 2017, Hawkins and Filipović 2012, Barrot and Agdeppa, 2021).In our study, it was a significant separator of the levels in the overall analysis for both age groups, but only the A1 vs A2 pairwise comparison in grade 8 was significant.However, variation in sentence length (i.e., standard deviation) increased significantly across A1-B1 for grade 8. T-unit length is a reasonably good separator in the A1-B1 range, whereas clause length seems to distinguish at A2 to B2.The current study partly concurs with these results even though the T-unit length only separated A1 from A2 (grade 8).
Sentence level complexity (clauses or T-units per sentence) has separated only between the two lowest CEFR levels in previous research (partly in this study, too) but other sentence-level indices designed by Bulté and Housen (2014) and employed by Martínez (2018) that is, compound and complex sentence ratiosdistinguished A2 from B1.In addition, Verspoor et al. (2012) reported that the proportion of complex and straightforward sentences separated A1 and A2.
Coh-Metrix includes general indices of syntactic simplicity, similarity and variability, but these appear not to have been investigated widely.Interestingly, Green (2012) found a syntactic similarity to distinguish C1 and C2.We found the same for A2 vs B1 but only in the gymnasium.Furthermore, Khushik and Huhta (2020) found a tendency for syntactic similarity to decrease from A1 to B1, but the adjacent levels could not be significantly separated.In the present study, we found minimal edit distance to distinguish A1 vs A2 vs B1 in grade 8.
A wide range of clause level SC indices has been used previously.Clauses or dependent clauses per T-unit appear to distinguish in the A1-B2 range relatively consistently, but only clauses per T-unit separated only A1 vs A2 in our study.Dependent clauses per clause have also separated across A1-B2 in some previous research, but our study failed to replicate that.Martínez (2018), who used differ-ent SC indices from us, found both coordinate and dependent clause ratios and noun phrases per clause to differentiate A2 and B1.
Several indices that are at the phrasal in nature (or perhaps borderline between phrasal and clausal) are included in Coh-Metrix, but apart from the current authors and Barrot and Agdeppa (2021), they have not been widely used in CEFRrelated SC research; Barrot and Agdeppa found complex nominals per clause or Tunit to distinguish A2 vs B1 vs B2; we only found complex nominals per clause to separate B1 from B2.One of the most interesting of these is the number of modifiers per noun phrase, which Khushik and Huhta (2020) discovered to be the only SC index to show non-linear development from A1 to B1.It first decreased between A1 and A2 but then increased.In the current study, a comparison of A1 and A2 is only possible in the younger age group where the value for this index indeed decreased from A1 to A2, but the difference was not significant.The older age group increased steadily from A2 and was particularly pronounced between B1 and B2.Taken together, the two studies suggest that even if the number of modifiers might first decrease, it appears to increase after A2.Green's (2012) finding that this index separates C1 from B2 suggests that the trend continues even beyond B2.
Previous studies on the other phrasal level indices have discovered some of them to separate some CEFR levels.Infinitive density, in particular, seems to distinguish in the A1-B2 range, including our study.Of the other such indices, only left embeddedness distinguished only B1 vs B2 and adverbial phrase density A2 vs B1.
In summary, our study sheds light on which SC indices distinguish the CEFR levels A1-B2, and we can compare these with the results of previous research.The effect sizes (tables 10-13) indicate that the most important indices that separate CEFR levels A1-B2 among the younger, less proficient learners were infinitive density, mean sentence length (and its standard deviation), T-unit length, and sentence syntax similarity across adjacent sentences.For the older, more proficient group, the key indices were the number of modifiers per noun phrase, mean clause length, sentence syntax similarity, edit distance and left embeddedness.Combining these findings with those found in previous research, we can tentatively conclude that the length of the more extended production units (sentences and clauses) and variation in their length are among the critical SC features that separate EFL writing from A1 to B1.What appears to separate B1 from B2 and above is mainly related to complexity at the clausal and phrasal levels.

Limitations
Some limitations of the study and issues with the comparability of different studies need mentioning.In the literature review, we noted that differences across studies in the SC indices, tasks, learners' age and L1 background, and the reliability of placing writing samples on the CEFR levels are all challenges to comparisons.Automated analyses can also be unreliable.For example, the Charniak parser (Charniak 2010) underlying Coh-Metrix is reported to achieve 89 % accuracy with L1 English texts, and Crossley and McNamara (2014) estimate the accuracy is likely lower for learner writing.Furthermore, the relatively short texts that many learners in our study wrote may not always provide sufficient data for reliable extraction of some SC features.
Our study did not investigate differences in SC between the writing tasks as we aimed to obtain a more generalisable picture of SC by combining the results of several writing tasks, which is a standard practice in language testing.This approach ignores task-related differences in SC due to register variation; however, our tasks represented only two broad registers (argumentation and narration), partly addressing this limitation.One additional avenue for future research is; therefore, studies focusing on particular tasks and/or applying the Multidimensional Model paradigm, which has not yet been used in research on the linguistic basis of CEFR levels (see, e. g., Biber et al. 2020), and which has to potential to provide valuable insights into writing development, for example, for diagnostic assessment purposes (Huhta et al., forthcoming).
The number of learners in some groups in our study was relatively small (e. g., there were only eight 8 th graders whose writing was estimated to be at B2).We decided to leave them in the analyses simply to find out if any of the SC indices would manifest such significant differences between the B1 and B2 level learners in that age group that the difference would be significant.One such index was indeed found (word count; Table 8), and also, the number of verb phrases came close to being a significant separator of B1 and B2 learners.
Another issue with our studyand all CEFR-related researchis the CEFR scale itself.The scales are not ideal for rating purposes since it is unclear how accurately they describe stages of L2 development (e. g., Hulstijn 2007) and since they describe proficiency in rather general terms, unlike scales explicitly developed for rating.Part of this issue is the uncertainty of how much attention the raters paid to SC when rating the performances, even if the scale descriptions did not directly refer to SC.It should be noted, however, that the Facets analyses indicated that the raters could systematically use the scale to distinguish learners with different writing ability levels.Furthermore, significant and relatively strong correlations between the learners' writing ability and the other EFL measures ta-ken by the learners in the more extensive study (e.g., vocabulary, reading and dictation tests) of which this research was part gives further credibility to the writing ability ratings.

Future research
Finally, Table 14 displays a state of the art of research on SC in written L2 English and, thus, provides us with suggestions for further research on SC.First, it shows that most research concerns the lower levels of proficiency, from A1 to B1.Hence, less is known about how SC separates between B2, C1, and C2.Second, the table reflects that most studies have covered only a limited set of SC indices and, therefore, the gaps (empty cells or cells with only one entry) in the table are often simply due to lack of attention to the particular SC index in research.More wideranging studies of SC indices are needed.Furthermore, some of the studies suggest that the L1 background of the language learners may impact SC in their L2 English texts: this is indicated by the different findings by the Khushik and Huhta (2020) for the two L1 groups.Similarly, the current study resulted in several differences in SC between the two age groups, even in the A2-B1 range.The fact that only one of the three writing tasks that each learner completed was the same in both groups makes it impossible to disentangle possible age and task effects.Nevertheless, a further conclusion is that both learners' age and writing task(s) are possible sources of variation in syntactic complexity and, therefore, should be examined in more detail in the future.One additional direction for research could also be mentioned, namely comparing the syntactic complexity of EFL learners at different CEFR levels with the SC of the same-aged native English speakers.This would provide an additional perspective to SC in writing among EFL learners.

Figure 1 :MOUTONFigure 2 :
Figure 1: Error-bar charts for the essential count variables at different CEFR levels

Figure 3 :
Figure 3: Error-bar charts for the mean clause, T-unit and sentence length, and clauses per sentence at different CEFR levels

Figure 4 :
Figure 4: Error-bar charts for subordination indices at different CEFR levels

Figure 7 :
Figure 7: Error-bar charts for syntactic simplicity and sentence syntax similarity indices at different CEFR levels

Table 1 :
Syntactic complexity indices in the L2 Syntactic Complexity Analyzer based on Lu (2010).

Table 3 :
Learners' EFL writing ability in the two student groups as CEFR levels

Table 6 :
Correlations between measures of subordination and coordination and EFL writing ability

Table 7 :
Correlations between measures of syntactic similarity and simplicity and EFL writing abilitySome Coh-Metrix indices capture variation in the syntactic simplicity and similarity (within paragraphs) and the number of modifiers per the main word in sentences.The findings indicate that these indices relate more strongly to writing ability in the more able gymnasium group, where all indices correlated significantly.Modifiers per noun phrase had the highest correlation (.433), but, interestingly, no significant correlation was found for grade 8. Syntactic simplicity and similarity indices all correlated over .2 with writing ability in the gymnasium, as did left embeddedness.Only the syntactic similarity measures and left embeddedness corre-

Table 6 :
(continued)lated with writing in grade 8, but only modestly (around .2 or lower).The negative correlations in Table7indicate that syntactically similar and straightforward (i.e., lacking variation across the text) was associated with lower writing ability.

Table 8 :
Count variables: summary of statistical significance of overall and between CEFR level differences in grade 8

Table 9 :
Count variables: Summary of statistical significance of overall and between CEFR level differences in the gymnasium

Table 10 :
Syntactic complexity indices from L2SCA: summary of statistical significance of overall and between CEFR level differences in grade 8

Table 11 :
Syntactic complexity indices from L2SCA: summary of statistical significance of between and overall CEFR level differences in the gymnasium

Table 12 :
Syntactic complexity indices from Coh-Metrix: summary of statistical significance of overall and between CEFR levels differences in grade 8

Table 13 :
Syntactic complexity indices from Coh-Metrix: summary of statistical significance of overall and between CEFR level differences in the gymnasium

Table 14 :
Summary of significant differences in syntactic complexity across CEFR levels in the current and previous studies

Table A :
Descriptive statistics for the count variables across CEFR levels: grade 8

Table B :
Descriptive statistics for the count variables across CEFR levels: Gymnasium

Table C :
Descriptive statistics for the syntactic complexity indices from L2SCA across CEFR levels: grade 8

Table D :
Descriptive statistics for the syntactic complexity indices from L2SCA across CEFR levels: Gymnasium

Table E :
Descriptive statistics for the syntactic complexity indices from Coh-Metrix across CEFR levels: grade 8

Table F :
Descriptive Statistics for the Syntactic Complexity Indices from Coh-Metrix across CEFR levels: Gymnasium