Reliability Assessment of Scores from Video Recorded TGMD-3 Performances

,


Introduction
Fundamental motor/movement skills (FMS) are needed to manage motor challenges generated by everyday life (Gallahue, Ozmun, & Goodway, 2012).Gallahue et al. (2012) defined such motor skills as balance skills (e.g., balancing on one foot), locomotor skills (e.g., walking, running and hopping) and manipulative skills (e.g., ball handling skills).These FMS create a basis for children to learn more specific skills to participate in games or different sport activities (Gallahue et al., 2012).Children's motor competence becomes visible through children's FMS performances, and is positively associated to their physical activity level (Stodden et al., 2008).Therefore it is important to follow the development and level of children's motor competence through observing children's performances in different FMS.Today, as many children's motor competence and physical activity levels are low (Reilly, 2010;Roth et al. 2010), it is essential to find valid and reliable observational tools to measure children's motor competence.Having psychometrically valid tools will help researchers and teachers monitor change, the impact of interventions, and the impact of policies.Moreover, measurement tools are needed not only for diagnostic purposes but also to find associations and significance of motor skills for overall development, daily wellbeing and health (Robinson et al. 2015).This was well justified in the study by Cools, Martelaer, Samaey and Andriens (2009) who analyzed seven different movement skill measurements.In addition, cultural comparisons also need measurement tools that are not too sensitive to cultural differences (Cools et al., 2009).
When doing research with children, ethical aspects need careful consideration.Observation as a research method is unobtrusive and in that sense much warranted.Unfortunately, reliability of observational tools is questioned.Earlier studies have used either video recordings or live assessments.The TGMD-2 (Ulrich, 2000) was used in the Slotte, Sääkslahti, Metsämuuronen, and Rintala (2015) study.They analyzed children's motor skills through video recordings and reported intrarater reliability for 24 children's motor skills.In their study reliability as intraclass correlation (ICC) was 0.978 for locomotor skills and 0.995 for object-control skills.Another study by Barnett, Minto, Lander and Hardy (2014) also used the TGMD-2 version.They reported reliability based on live observation for interrater reliability in six object control skills.Specifically reliability for object control skills was 0.93 (ICC), varying in individual skills from 0.71 (catch) to 0.94 (dribble).All values reported are in the acceptable range.More reliability studies are needed to provide valuable information for test developers about the characteristics of the test for the future test development.For example, it cannot be assumed that the reliability values found for the TGMD-2 as such using either video recordings or live observations are applicable to the TGMD-3.The TGMD-3, which was used in this study, is a process-oriented measurement, where children's FMS performances are observed and scored by a rater.The TGMD-3 is a new version of the TGMD-2, but also gathers observations of both locomotor and object control (called ball skills) FMS skills, but differs from TGMD-2 in some individual skill components (Ulrich, 2016).In locomotor skills leaping is replaced with skipping, and in ball skills underhand roll is replaced with underhand throwing.Moreover forehand strike is added which makes altogether six locomotor skills and seven ball skills.Similarly, as in the TGMD-2, the resulting score of each skill is based on the sum of either the presence or absence of the performance criteria (3-5 criteria depending on the skill) of that skill.A more precise description of this tool can be found in another article (see Ulrich, 2013).
The TGMD-3, as its earlier version, will probably be used by different professionals in practical settings such as at schools (Cools et al., 2009).It will also be used for research purposes when data must be as reliable as possible (Ulrich, 2016).Video recordings allow more detailed scrutiny and flexibility when doing assessments.Videos can also be replayed several times if needed, and slow speed replayed when the performance criteria is difficult to observe without slow motion.Finding the most and least challenging skills to score from video reliably also helps practitioners in preparation of their live observations.The purpose of this study was to assess the reliability of the TGMD-3 through video recorded performances.First, the consistency of the ratings within two independent assessors, and secondly, the consistency of the ratings between two different assessors in each of the TGMD-3 individual skills were studied.In addition, a more detailed analysis of the most challenging performance criteria to be consistently rated were investigated.

Participants and Settings
Participants of this study were randomly selected from the larger study conducted with six elementary schools and eight day care center/kindergarten children (n = 374, 3-10 years) who had performed the TGMD-3 in Central Finland.Forty children's performances were used to study intrarater reliability of the two assessors (A and B).Participants of the assessor A were 10 boys, ranging from 6-9 years (M = 7.8 ± 1.2) and 10 girls, ranging from 5-9 years (M = 7.4 ± 1.2).Participants of the assessor B were eight boys, ranging from 4-7 years (M = 6.6 ± 1.4) and 12 girls, ranging from 3-7 years (M = 6.1 ± 1.6).Another 20 children's (different from the previous 40 children) performances were randomly chosen for interrater reliability.These children were 10 boys, ranging from 4-6 years (M = 5.9 ± 0.7) and 10 girls, ranging from 5-6 years (M = 6.2 ± 0.5).Institutional approval of the research protocol and informed consent from parents were obtained prior to the study that was approved by the university ethics committee.All children had also the right to refuse participation and refrain from testing any time.None of the assessed children had a disability and/or impairment.

Procedure and Data Collection
All trials were conducted in the school gymnasiums or similar locations that were suitable for the administration of the TGMD-3 according to the test instructions.In few cases the space did not allow the full running distance according to the test instructions.Children performed the TGMD-3 administered by a trained physical education professional (one of the authors) and one Master's student in pairs.The professionals were very familiar with administering the TGMD-2 and had used the test before, and the students (five altogether) had had a twohour training on how to administer the test.One of the two instructed the performer and the other video recorded the performance.The camera was placed optimally (i.e., side view, frontal view or rear view) to best detect skill performance whenever the circumstances permitted.The skills were administered in the order of the scoring sheet as depicted in Table 1.Preceding assessment, an accurate demonstration of the skill was performed by the test administrator.Participants were tested in groups of 3-4, and were given one practice trial to assure that the child understood what to do.One additional demonstration was given if a child did not seem to understand the task.Each participant performed two trials individually for each gross motor skill.
Two physical education teachers with a Master's degree (different from the test administrators) assessed the test performances from the videos.Both teachers had a good knowledge base about children's motor skills and had been assessing several hundred children on their motor skills using TGMD-3.These assessors had also participated in a two hour training session organized by the first author for elaborating performance criteria.They had also established 80% reliability in scoring with the TGMD-3 author through electronic videos.In rating performances, the scoring system was the following: a score of 1 meant the criterion was performed accurately, and a score 0 meant the criterion was not performed accurately or not performed at all.
To determine intrarater reliability, first, the two assessors both coded 20 children's skill performances twice.There was about three months' time interval before their second coding.
Secondly, both assessors were analyzed on their own ability to score the performance criteria of the 13 individual skills similarly between the first and second evaluation.
To determine interrater reliability, first, the two assessors (A and B) coded independently, from the videos, same 20 children.Secondly, these two assessors were analyzed on their ability to agree on scoring of the performance criteria of the 13 individual skills.

Statistical Analysis
To determine intrarater and interrater reliability, a kappa statistic (Cohen 1960) and a percent agreement calculation were used.As in a previous study (Barnett et al. 2014) in which reliability of children's gross motor skills measured with TGMD-2 were assessed, we used the magnitudes according to Landis and Koch (1977) for characterizing the resulting statistics: A kappa statistic <0.20 was considered slight; between 0.21 and 0.40 fair; between 0.41 and 0.60 moderate, and 0.61 and above was considered substantial agreement.Percent agreement was also calculated for each sub skill.Significance level was set at 0.05.Data were analyzed using SPSS (version 22 for Windows).

Results
Intra-and interrater kappa coefficients and corresponding percents of agreement of the assessments for individual skills, subtests of locomotor skills (LS), ball skills (BS) and gross motor test total score (TS) are provided in Table 1.For intrarater reliability assessor A's and B's own kappa coefficients for TS were 0.75 and 0.73, which can be characterized as substantial agreement.Also assessor A's and B's own kappa coefficients were substantial (range from 0.69 to 0.77) in LS and BS.Intrarater percent agreement for LS, BS and TS varied from 87% to 91%.When the individual skills were examined all the kappa values were at least moderate.

Table 1 about here
For interrater reliability kappa coefficients for LS, BS and TS between the two assessors varied from moderate to substantial (range from 0.57 to 0.64).Percent agreement for LS, BS, and TS were all 83% (Table 1).
A more detailed examination of these three skills with the lowest reliability scores was performed (Table 2).For the hop, these criteria were "Arms flex and swing forward to produce force" (κ=0.13,63%) and "Foot of non-hopping leg remains behind hopping leg" (43%).In the latter criterion both raters scored the same amount of 1s and 0s on the same criteria, therefore the Kappa statistic could not be calculated for this criterion.Also, the 4 th criterion "Hops four consecutive…" assessor A scored all cases "1" in both trials and assessor B scored similarly except for one case, which again did not allow the kappa statistic to be calculated.However, the percent agreement in this criterion was high (98%).