I NTERACTION OF S IGHT AND S OUND IN THE P ERCEPTION AND E XPERIENCE OF M USICAL P ERFORMANCE

Spence (2014) demonstrated that visual kinematic performance cues may be more important than auditory performance cues in terms of observers’ ratings of expressivity perceived in audiovisual excerpts of piano playing, and that visual kinematic performance cues had crossmodal effects on the perception of auditory expressivity. The present study was designed to extend these findings, and to provide additional information about the roles of sight and sound in the perception and experience of musical performance. Experiment 1 investigated the relative contributions of auditory and visual kinematic performance features to participants’ subjective emotional reactions evoked by piano performances, while Experiment 2 was designed to explore the effect of visual kinematic cues on the perception of loudness and tempo variability. Experiment 1 revealed that visual performance cues seem to be just as important as auditory performance cues in terms of the subjective emotional reaction of the observer, thus highlighting the importance of non-auditory cues for music-induced emotions. The results of Experiment 2 revealed that visual kinematic cues only affected ratings of loudness variability, but not ratings of tempo variability.

Although it has been established that visual information about the performer's movements consistently enhances the appreciation of a musical performance (Platz & Kopiez, 2012), previous studies have not reliably estimated the relative contributions of visual and auditory performance cues to observers' experience.Although previous investigations have shown that the effect size of the visual component on observers' evaluations could on average be characterized as ''medium'' (Platz & Kopiez, 2012), it is not known how that relates to the effect size of auditory performance cues -especially across different levels of expressivity.Variations in performance features -often collectively referred to as ''expressivity'' -are what differentiate performances of the same notated work, and serve to articulate musical structure (Clarke, 1988), communicate emotional meaning (see Juslin, 2001, for a review), and convey a sense of biological motion (Juslin, 2003).In order to investigate the relative contributions of auditory and visual performance cues to observers' evaluations, an experimental method is needed in which the expressivity conveyed by the visual and auditory components of a performance can be manipulated independently, so as to result in matched and mismatched audiovisual pairings.Such experimental designs have been successfully used to investigate the interaction of auditory and visual components in the perception of note duration (Schutz & Lipscomb, 2007), loudness (Rosenblum & Fowler, 1991), timbre (Saldan ˜a & Rosenblum, 1993), pitch (Thompson, Graham, & Russo, 2005), and interval affect (Thompson, Russo, & Quinto, 2008), demonstrating that visual information can significantly alter the perception of various auditory features.However, the difficulty with applying such a design to a complex action such as musical performance is that musicians find it very difficult to control expressivity in one modality independent of the other (e.g., Thompson & Luck, 2012), and the temporal properties of a musical performance also vary greatly from one performance to the next.
Previous studies have attempted to tackle this issue by combining a constant auditory stimulus with visual information of actors portraying different expressive intentions (e.g., Juchniewicz, 2008;Morrison, Price, Geiger, & Cornacchio, 2009), or have settled for combining structurally incongruent auditory and visual components, thus resulting in functionally incongruent and temporally asynchronous stimuli (e.g., Krahe ´, Hahn, & Whitney, 2013; Petrini, McAleer, & Pollick, 2010).The former approach is problematic because of the limited validity of ''faked'' expressive movements, and the latter because the movements and gestures in musical performance have been found to arise from a representation of the musical structure, and thus convey meaning in association with specific musical passages (e.g., MacRitchie, Buck, & Bailey, 2013).
To address these limitations, a recent study by Vuoskoski et al. (2014) presented a novel method for creating matched and mismatched audiovisual combinations of different expressive intentions.By utilizing motion-capture animations of piano performances and time-warping algorithms, Vuoskoski et al. were able to investigate the relative contributions of auditory and visual kinematic performance cues to the perception of expressivity in a systematic and balanced way.In contrast to previous studies, the mismatched stimuli utilized by Vuoskoski et al. were temporally synchronized and structurally congruent (i.e., the visual kinematic information always represented the same composed structure as the auditory information).Vuoskoski et al. also explored potential crossmodal effects in the perception of auditory and visual expressivity, addressing the question of whether simultaneously presented visual kinematic information might alter the way in which auditory expressivity is perceived, or vice versa.They found that relative to auditory cues, visual kinematic cues actually contributed slightly more to a participant's overall evaluation of expressivity, and that there appeared to be crossmodal interactions at play in the evaluation of both auditory and visual expressivity.
Although Vuoskoski et al.'s (2014) study provides preliminary evidence for the existence of crossmodal effects in the evaluation of expressivity -as well as shedding light on the relative salience of auditory and visual kinematic performance cues -there are some limitations and questions that require further investigation.First, when considering the relative importance of auditory and visual cues from the observer's point of view, the evaluation of perceived expressivity may not capture the most salient or essential aspects of an observer's experience of a musical performance.Instead of the objective appraisal of the expressive components of a musical performance, it is arguably an observer's subjective emotional experience of the performance that ultimately determines their evaluation (cf.Hargreaves & North, 2010).Although there is some evidence to indicate that visual information might enhance emotional reactivity to musical performances (Chapados & Levitin, 2008), it is not yet known how the effect of visual performance cues relates to that of auditory performance cues with regard to the subjective emotional reaction of the observer.Furthermore, the explicit instructions used by Vuoskoski et al. to take both auditory and visual aspects of the performance into account in the evaluations of overall expressivity might have affected which aspects of the material the participants attended to (for details, see Vuoskoski et al., 2014; Experiment 1).In other words, it may be that as a result of the instructions, the participants paid more attention to visual kinematic performance features than they otherwise would.
Second, although the study by Vuoskoski et al. (2014, Experiment 2) demonstrated that visual kinematic cues can have an impact on evaluations of auditory expressivity, the exact nature of these crossmodal effects remains unclear.It is not yet established whether there are crossmodal effects at play in the perception of lowerlevel auditory features such as, for example, loudness.Furthermore, it is possible that the outcome reflects response bias, the participants' ratings of auditory expressivity being affected by the expressive qualities of the simultaneously presented visual kinematic information without their perception of the auditory features actually having been affected (see, e.g., Schutz & Kubovy, 2009).
The aim of the present study was therefore to extend the findings of Vuoskoski et al. (2014), and to provide new information regarding the roles of visual kinematic and auditory cues in the subjective emotional reactions evoked by musical performance, as well as investigating the possible effect of visual kinematic cues on the evaluation of specific auditory performance features.Experiment 1 was designed to investigate the relative contributions of auditory and visual kinematic performance features on participants' subjective emotional reactions, and thus to provide a more ecologically relevant account of the roles of sight and sound in an observer's experience of a musical performance.The difference between the previous Experiment 1 reported by Vuoskoski et al. (2014) and the current Experiment 1 mirrors the well-established distinction between perceived and felt emotion (see, e.g., Sloboda & Juslin, 2010).The former experiment investigated evaluations of a perceived characteristic of the performances (i.e., perceived expressivity), while the current experiment investigates the subjective emotional reactions experienced by participants (i.e., felt emotion).Previous research has suggested that visual information may have a significant impact on an observer's emotional reactions to a musical performance (Chapados & Levitin, 2008;Krahe ´et al., 2013;Vines et al., 2006), but the effect size of visual kinematic performance cues relative to that of auditory performance cues has yet to be investigated.
The aim of Experiment 2 was to explore the effect of visual kinematic cues on the evaluations of auditory expressivity in more detail.The two main auditory characteristics contributing to expressivity in piano performance are variations in timing and dynamics (i.e., tempo and loudness variation; e.g., Gabrielsson, 1999;Palmer, 1997), with the amount of variation being positively associated with perceived expressivity (e.g., Bhatara, Tirovolas, Marie Duan, Levy, & Levitin, 2011).Perceived expressivity is also positively associated with how much a performer moves (e.g., Davidson, 1994;Thompson & Luck, 2012).Since the size of a performer's movements reflects the physical energy used to play the notes, the kinematic visual information specifying a performer's movements might be expected to affect the perception of loudness, which is directly related to physical energy.Visual kinematic cues have previously been shown to affect loudness perception in the context of simple clapping sounds, with the size of clapping motions positively associated with perceived loudness (Rosenblum & Fowler, 1991).
By comparison, it is less obvious how temporally aligned visual kinematic information could affect the auditory perception of tempo variability.Previous research has shown the temporal resolution of the auditory modality to be superior to that of the visual modality (e.g., Burr, Banks, & Morrone, 2009;Freides, 1974;Repp & Penel, 2002), resulting in superior auditory rhythm and beat perception (e.g., Grahn, 2012).However, previous research has also shown that visual kinematic information can influence the perceived duration of notes played on a marimba (Schutz & Kubovy, 2009;Schutz & Lipscomb, 2008), and that the sensitivity to rhythmic deviations can be modulated by point-light animations of a bouncing person (Su, 2014).Nevertheless, since the movements of the pianists were temporally synchronized with the music in all of our stimuli, we hypothesize that the visual kinematic information will have an effect on the perception of loudness variability but not on the perception of tempo variability.
Experiment 1 METHOD Participants.Nineteen participants (7 male, 12 female) aged 18-31 years (M ¼ 23.1, SD ¼ 4.1) were recruited from the University of Oxford community.Fourteen participants (73.7%) reported having received at least some music training on an instrument (ranging from 1 to 18 years; M ¼ 10.6, SD ¼ 5.0).The participants received a monetary incentive (£5) for taking part in the study.All of the experimental procedures followed the University of Oxford Policy on the Ethical Conduct of Research Involving Human Participants and received approval from the Research Ethics Committee.
Stimuli.The stimuli were obtained from a recent study by Vuoskoski et al. (2014), where the stimulus generation process is reported in some detail.However, the process is briefly outlined here, as the method of stimulus generation is crucial for the questions addressed in the study.Two pianists -one male and one femaleperformed Chopin's Prelude in E minor (Op.28, No. 4) with three different levels of expression: Deadpan (reduced level of expressive intensity); Normal (normal level of expressive intensity); and Exaggerated (maximum level of expressive intensity); while their movements were captured at 120 frames per second using an 8-camera optical motion capture system (Qualisys Pro-Reflex).In addition, the MIDI output of the digital piano keyboard used in the performances was recorded, providing a complete record of the performances.To create the audio stimuli, the MIDI data were imported into GarageBand '11 (version 6.0.5),running on Mac OS X.The ''Grand Piano'' software instrument with 50% reverb was used to generate high-quality renditions of the performances.The segment from the beginning of measure 13 to the end of measure 20 was used to create the experimental stimuli, as this section includes the expressive climax of the piece (Sloboda & Lehmann, 2001), and should therefore allow for the greatest amount of variation in terms of expressive intensity.The duration of the resulting six performance excerpts (2 performers x 3 expressive intentions) ranged from 29 to 33 s (M ¼ 31.3,SD ¼ 1.5).The descriptive details of the performance excerpts (mean tempo, mean loudness, tempo and loudness variability, and the total amount of movement) are displayed in Table 1.
In order to generate audiovisual stimuli that would be incongruent in terms of their expressive intention (e.g., deadpan audio þ normal movement) yet temporally synchronized, the motion capture data from each performer were temporally aligned to each of the three audio tracks of that performer using a time-warping algorithm (Verron, 2005; see also Wanderley, Vines, Middleton, McKay, & Hatch, 2005).This procedure involved the generation of timing profiles for each performance by annotating the timing of each eighth-note chord played by the left hand, producing an average resolution of 2.04 time points per s.The motion capture data was then functionalized using cubic splines.Using the annotated timing profiles for each performance, curve-stretching algorithms (see Verron, 2005, for details) were used to stretch and compress the motion capture data of a given performance so that it matched the timing profile of another performance.More specifically, the splines between each note onset were made to match the time separation of the corresponding note onsets in the other performance.Two time-warped versions of each performance were generated to match the timing profiles of the other two performances by the same performer.Note that only the movement data were time-warped while the audio data remained unaltered.Finally, the resulting splines were sampled to create time-warped motion capture data that could be used to generate point-light animations.This method has previously been used for analysis purposes (see Wanderley et al., 2005), as it enables the comparison of different performances independent of original tempo or timing variations.However, the present study (and the previous one by Vuoskoski et al., 2014) are -to the best of our knowledge -the first to use the method to generate time-warped point-light animations.
Point-light animations of the original and time-warped motion capture data were generated using MATLAB and the Motion Capture Toolbox (Burger & Toiviainen, 2013).Light points -connected by lines to form a stickfigure shape -represented each pianist's hands, wrists, elbows, shoulders, head (midpoint and four markers around the head), torso (mid-shoulder and mid-torso), and hips.The keyboard was represented by two markers connected by a line (see Figure 1 for a sample frame).
The 18 animations were combined with the appropriate audio to create 6 matching (e.g., normal audio þ   normal video) and 12 mismatching (e.g., exaggerated audio þ deadpan video) audiovisual stimuli (example stimuli can be downloaded from https://dl.dropboxuser content.com/u/311821/Video_examples.zip).Note that the audio and video from different performers were never combined.In addition, unimodal versions of the stimuli (6 audio-only and 18 video-only stimuli) were also generated.
Procedure.The Max/MSP (version 5.1.9;Cycling '74) graphical programming environment (running on Mac OS X) was used to present the stimuli and to collect the data.The animations were presented with a resolution of 800 x 600 pixels and a frame rate of 30 fps.The audio was presented in WAV format through high quality headphones (Sennheiser HD 219).The participants were instructed to evaluate the intensity of their subjective emotional reactions to the performances, and were informed that a given performance might leave them cold, while another performance might move them in a profound way.The evaluations were made using a horizontal analog scale (width 278 pixels) ranging from ''did not move me at all'' to ''moved me very strongly.''The participants were instructed to base their ratings on their own emotional reactions rather than any specific aspect of the performances (such as the auditory or visual components of the stimuli), so as not to direct the participants towards perceived rather than felt emotion.The output of the scale, as a default property of the Max/MSP-object, provided data in the range 0-127.The participants were instructed to make their evaluations immediately after each excerpt had ended.The experiment started with two practice trials using audiovisual excerpts that were similar to -but not part of -the actual stimulus set, to which the participants were instructed to respond.These responses were not included in the data.The practice trials were followed by the 18 audiovisual excerpts, which were presented in a different random order for each participant.The audiovisual block was followed by two unimodal blocks (audio-only, consisting of 6 audio excerpts; and videoonly, consisting of 18 video excerpts), in which evaluations of felt emotional impact were based only on what was perceived in the presented modality.The audiovisual block was always presented first, as the audiovisual condition was the main focus of interest in the current study.Furthermore, the initial exposure to the audiovisual excerpts provided participants with a relevant framework in which to view the silent point-light animations, which might have seemed strange or arbitrary if presented first.The video-only condition included both the six original animations as well as the twelve time-warped animations that had been altered to match the different audio tracks.The order in which the two unimodal blocks were presented was counterbalanced across participants.Again, the excerpts within the blocks were presented in a different random order for each participant.After the experiment, the participants completed a short questionnaire about their music training and music listening habits, and were fully debriefed.

RESULTS
Emotional impact in unimodal rating conditions.In order to investigate whether the unimodal (audio-only and video-only) representations of different expressive intentions resulted in differing evaluations of felt emotional impact, repeated-measures ANOVAs were conducted to investigate the ratings obtained in the two unimodal conditions.There were two withinparticipant factors; Performance Condition (Deadpan, Normal, or Exaggerated) and Pianist (Pianist 1 or 2), and one between-participants factor; Block Order.The latter factor was added in order to investigate whether the presentation order of the unimodal blocks (audio-only first or video-only first) had any effect on participants' ratings.Note that the audiovisual block always preceded the two unimodal blocks.In the audio-only condition, there was a significant main effect of Performance Condition; F(2, 34) ¼ 7.07, p < .01, 2 G (generalized eta squared; Bakeman, 2005) ¼ .17,as well as a significant main effect of Pianist; F(1, 17) ¼ 5.84, p < .05, 2 G ¼ .04.There was no main effect of Block Order, and no interaction effects.Multiple comparisons of means (paired t-tests, p < .05significance level adjusted using the Holm-Bonferroni method; Holm, 1979) revealed that ratings of emotional impact for the Deadpan performances were significantly lower than those for the Normal and Exaggerated performances, but that the latter two did not differ significantly from each other.A comparison of means also revealed that the performances of Pianist 2 were rated as having a stronger emotional impact on average than those of Pianist 1.The mean ratings for the three different types of performances by the two pianists are displayed in Figure 2.
A similar repeated-measures ANOVA was conducted to analyze the ratings of felt emotional impact obtained in the video-only rating condition, with the difference that two factors regarding performance condition were included: Type of Video, and Type of Time-warp.As the video component of the mismatched stimuli had been slightly altered to fit the accompanying audio track, Type of Time-warp was included to determine whether there were any differences between the different time-warped and original animations.Type of Video and Type of Time-warp both had three levels: Deadpan, Normal, and Exaggerated.Due to a technical failure, one participant's video-only ratings were not recorded, and thus n ¼ 18 for this analysis.The analysis revealed significant main effects of Type of Video; F(2, 32) ¼ 29.17, p < .001, 2 G ¼ .26,Type of Timewarp; F(2, 32) ¼ 4.32, p < .05, 2 G ¼ .01,and Pianist; F(1, 16) ¼ 18.04, p < .001, 2 G ¼ .07.There were no main or interaction effects related to Block Order.Multiple comparisons of means revealed that the emotional impact of the Deadpan video type was rated as significantly weaker than the impact of the Normal or Exaggerated video types, as expected; but that the difference between the latter two -although in the expected direction -was not statistically significant.Multiple comparisons regarding the effect of Type of Time-warp did not reveal any significant differences between the different time-warped and original animations after the Holm-Bonferroni correction had been applied.A comparison of means confirmed that the emotional impact of the performances by Pianist 1 were evaluated as significantly stronger than for those by Pianist 2. There was also a significant interaction between Type of Video and Pianist; F(2, 32) ¼ 10.02, p < .001, 2 G ¼ .04.Multiple comparisons of means revealed that the emotional impact of the performances by Pianist 1 was rated as significantly stronger (than those of Pianist 2) only in the Normal and Exaggerated video conditions.The mean ratings given for the three different types of performance by the two pianists are shown in Figure 2.
Ratings of emotional impact in the audiovisual condition.In order to investigate the salience of the auditory   video types, multiple comparisons revealed that the Deadpan videos received significantly lower ratings of emotional impact than the Normal and Exaggerated videos, but that the latter two did not differ significantly from one other.There was no main effect of Pianist, and no interaction effects.
To further investigate the relative contribution of auditory and visual cues to the emotional impact evoked by the performance excerpts, a linear regression analysis was conducted.The dependent variable was the mean ratings of felt emotional impact for audiovisual stimuli, while the mean ratings of emotional impact for audioonly and video-only stimuli were the independent variables.The two predictor variables were not significantly intercorrelated, r(16) ¼ -.06, ns, but both were significantly correlated with the dependent variable: r(16) ¼ .67,p < .01,for audio-only ratings, and r(16) ¼ .62,p < .01,for video-only ratings.Audio-only and video-only ratings of emotional impact both significantly predicted felt emotional impact in the audiovisual condition, ¼ .71,t(17) ¼ 7.94, p < .001,and ¼ .66,t(17) ¼ 7.42, p < .001,respectively.Together they explained a significant proportion of the variance in the emotional impact felt in the audiovisual condition; R 2 ¼ .88,F(2, 17) ¼ 55.93, p < .001.

DISCUSSION
The results of Experiment 1 demonstrate that each audio type -Deadpan, Normal, and Exaggerated -was rated as eliciting a different level of emotional impact in the audio-only condition.The effect size of audio type ( 2 G ¼ .17)was notably smaller than that in a previous experiment measuring perceived expressivity (using the same stimuli; 2 G ¼ .59;Vuoskoski et al., 2014).This difference in effect size may be attributable to the more subjective and internal character of participants' own emotional reactions as compared to the more manifest and external character of the expressive intentions on which participants were asked to focus in the previous study.Indeed, previous research on music-induced emotions has found that there tends to be more interindividual variability in evaluations of felt emotion compared to evaluations of perceived emotion (e.g., Juslin, 2009).
Interestingly, the effect of video type in the video-only rating condition ( 2 G ¼ .26)was somewhat larger than the effect of audio type in the audio-only condition, though there was no statistically significant difference between the Normal and Exaggerated video types in terms of their emotional impact.Although this effect size is smaller than that observed in a previous experiment investigating the perception of expressivity ( 2 G ¼ .61;Vuoskoski et al., 2014), it is nevertheless striking that point-light animations of pianists performing were nonetheless able to evoke significantly differentiated emotional responses in participants.However, it may also be that participants' evaluations were affected by demand characteristics (e.g., Orne, 1962).When asked to evaluate the emotional impact of stimuli that clearly represent an emotional expression of some kind, it might be that even in the absence of genuine emotional reactions the participants nonetheless move the slider on the basis of perceived expressivity rather than felt emotion (cf.Konec ˇni, 2008).This possibility is supported by the fact that three of the participants reported extremely low (or nonexistent) levels of emotional impact in response to the videoonly stimuli (but not in response to the audiovisual or audio-only stimuli), perhaps reflecting a more rigorous rating strategy on their part than for the other participants.Furthermore, having already responded to an audiovisual block (which was always presented first) it is possible that the participants' unimodal ratings were influenced by previous audio-visual associations.Since the participants were exposed to both matched and mismatched combinations in the audiovisual block, it is unlikely that they would have associated a specific audio-only stimulus with a specific video component or vice versa; but it may be that a more generic association between the two modalities may nonetheless have been induced.
The results of the audiovisual rating condition revealed that both Type of Audio and Type of Video had a significant effect ( 2 G ¼ .10 and .09,respectively) on the emotional impact of the piano performances.The effect sizes of audio type and video type were comparable, in contrast to the differences observed in the unimodal rating conditions.This pattern of results is somewhat different from that found for the perception of expressivity (Vuoskoski et al., 2014), where Type of Video ( 2G ¼ .29)revealed a stronger effect compared to Type of Audio ( 2G ¼ .23).Again, the overall difference in effect size may be related to the subjective and elusive character of emotional reactions as compared with perceived expressive intentions, but the difference in the relative contribution of auditory and visual modalities suggests that while visual kinematic cues may be more salient than auditory cues in communicating expressive intentions, their contribution to the emotional impact of performances is comparable to that of auditory performance cues.The results of the linear regression analysis support this conclusion, by showing that audio-only ratings and video-only ratings explain comparable proportions of the variance in the audiovisual ratings of emotional impact.As in the case of the unimodal rating Interaction of Sight and Sound 463 blocks, it is possible that some of the participants based their ratings of emotional impact on perceived expressivity rather than their actual emotional reactions.Note, though, that this is an issue that affects all studies aiming to investigate music-induced emotions using selfreport measures, and can be minimized by giving clear instructions to participants (see e.g., Konec ˇni, 2008).We gave our participants explicit instructions to focus on the ''emotional effect that the performance has on you,'' and used unambiguous labels (''did not move me at all'' and ''moved me very strongly'') to signify the extremes of the rating scale.
Finally, the contribution of either modality to the emotional impact of a performance may depend on the performer and her or his efficacy in conveying expressive intentions via body movements and auditory cues.
In the present study, the audio-only excerpts of Pianist 2 were evaluated as having a stronger emotional impact than those of Pianist 1, while the video-only ratings revealed the opposite pattern.These results are in line with the objective measures of auditory and kinematic features (see Table 1), with Pianist 2 displaying more tempo variability, and Pianist 1 displaying more movement overall.However, there was no effect of Pianist in the ratings obtained in the audiovisual condition (and no interaction effects), thus suggesting that the relative contribution of the auditory and visual modalities to the emotional impact of audiovisual performances may not be significantly affected by differences in expressive efficacy.Furthermore, it should be noted that the facial expressions of performers -which would sometimes be visible to the audience in real-life performance situations and which are eliminated in this study by the use of stick figures -may add significantly to the overall emotional impact of a musical performance.

Experiment 2
The results of Experiment 1 revealed that auditory and visual kinematic performance cues seem to account for comparable proportions of participants' subjective emotional reactions to piano performance excerpts.However, the potential crossmodal effects involved in the process remain unclear.A previous study by Vuoskoski et al. (2014) revealed that visual kinematic cues can affect the ratings of perceived auditory expressivity, but it is not yet known whether this effect reflects actual crossmodal interactions between the auditory and visual modalities, or whether instead it could be attributed, for example, to some kind of response bias.Furthermore, if the observed effects were due to crossmodal interactions, it is unclear which aspects of perceived auditory expressivity are affected by visual kinematic cues.Thus, the aim of Experiment 2 was to investigate whether visual kinematic cues might affect the perception of the key auditory features contributing to perceived expressivity, namely loudness and tempo variability.Since the aim was to obtain as reliable and consistent an evaluation of loudness and tempo variation as possible, only those participants with musical instrument training were recruited to take part in this experiment.

METHOD
Participants.Seventeen participants (7 male, 10 female) aged 18-61 years (M ¼ 26.3, SD ¼ 11.7) were recruited from the University of Oxford community.All of the participants had received a minimum of two years of music training on an instrument (ranging from 2 to 17 years; M ¼ 10.2, SD ¼ 4.9).The participants received a monetary incentive (£5) for taking part in the study.
Stimuli.The stimuli were the same as those in Experiment 1.
Procedure.The procedure was almost identical to that of Experiment 1, with the difference that instead of emotional impact, the participants were asked to evaluate the amount of loudness (dynamic) variation, and the amount of tempo variation, in the performances.They were instructed that ''A performance with no variation in dynamics or timing would sound flat and mechanical, while a performance with an extreme amount of variation would have continuous changes in tempo and loudness.''Both evaluations were made using horizontal visual analog scales (width 278 pixels) ranging from ''No variation at all'' to ''An extreme amount of variation.''The order in which the scales were presented was balanced across participants.The same rating scales were also used in two unimodal rating conditions.In the video-only condition, the participants were instructed to ''try to imagine how the music produced by the pianists' movements would sound, and evaluate the amount of variation in the timing and dynamics of the imagined performances.''The audiovisual block was always presented first, followed by the audio-only and video-only blocks.As the presentation order of the unimodal blocks had no significant effect on participants' ratings in Experiment 1, all participants in this experiment completed the unimodal blocks in the same order.

Unimodal perception of loudness and tempo variability.
In order to determine whether the perceived amount of loudness and tempo variation differed significantly between the different performance conditions, the ratings obtained in the unimodal audio-only rating condition were analysed using repeated-measures ANOVAs.The mean ratings are displayed in Figure 4.There were two within-participant factors: Type of Audio (Deadpan, Normal, or Exaggerated), and Pianist (1 or 2).One participant's audio-only ratings were not recorded due to a technical failure, and thus n ¼ 16 for this analysis.In the ratings of the perceived amount of loudness variation, there was a significant main effect of Type of Audio; F(2, 30) ¼ 41.01, p < .001, 2 G ¼ .40,but no effect of Pianist nor any interaction.Multiple comparisons of means (paired t-tests, level of statistical significance adjusted using the Holm-Bonferroni method) revealed that all three audio types differed significantly from each other in terms of the perceived amount of loudness variation, with the Deadpan audio type receiving the lowest and the Exaggerated audio type the highest ratings.A similar analysis was conducted on the ratings of the amount of tempo variation.This analysis yielded a significant main effect of Type of Audio; F(2, 30) ¼ 46.61, p < .001, 2 G ¼ .48,but once again no effect of Pianist and no interaction effect was observed.Multiple comparisons of means revealed that all three audio types differed significantly from each other in terms of the perceived amount of tempo variation, with the Deadpan audio type receiving the lowest and the Exaggerated audio type the highest ratings.
The next step was to investigate the ratings of loudness and tempo variation obtained in the video-only condition, where the participants were instructed to base their ratings on how they imagined the music produced by the observed movements would sound.Repeatedmeasures ANOVAs with three within-participants factors -Type of Video, Type of Time-warp, and Pianist -were conducted to investigate whether the participants were able to consistently estimate the amount of loudness and tempo variation based on the pianists' movements alone.Type of Time-warp was included as a factor in order to see whether there were any differences between the time-warped and the original animations, since timewarping changes the timing of the movements.In the ratings of loudness variation, there were significant main effects of Type of Video; F(2, 32) ¼ 49.35, p < .001, 2 G ¼ .38,and Pianist; F(1, 16) ¼ 26.29, p < .001, 2 G ¼ .09,but no main effect of Type of Time-warp.Multiple comparisons of means revealed that all three video types differed significantly from one another in terms of loudness variability, with the Deadpan video type receiving the lowest and the Exaggerated video type the highest ratings.Furthermore, a comparison of means revealed that Pianist 1 was rated as exhibiting more loudness variation.There were also interaction effects between Type of Video and Pianist; F(2, 32) ¼ 20.77, p < .001, 2 G ¼ .07,and between Type of Time-warp and Pianist; F(2, 32) ¼ 7.30, p < .01, 2 G ¼ .02.Multiple comparisons of means revealed that Pianist 1 was rated as exhibiting more loudness variation than Pianist 2 only in the Normal and Exaggerated video types.Multiple comparisons investigating the interaction effect between Type of Time-warp and Pianist failed to reach statistical significance after the Holm-Bonferroni correction had been applied.
A similar analysis was conducted on the ratings of tempo variation obtained in the video-only condition, yielding significant main effects of Type of Video; F(2, 32) ¼ 38.73, p < .001, 2 G ¼ .26,Type of Time-warp; F(2, 32) ¼ 4.94, p < .05, 2 G ¼ .02,and Pianist; F(1, 16) ¼ 6.18, p < .05, 2 G ¼ .02.Multiple comparisons of means revealed that the Deadpan video type was rated as significantly lower in tempo variation than the Normal and Exaggerated video types, but that there was no Interaction of Sight and Sound 465 statistically significant difference between the latter two.Multiple comparisons for the main effect of Type of Time-warp failed to reach statistical significance after the Holm-Bonferroni correction had been applied.A comparison of means also revealed that Pianist 1 was rated as exhibiting more tempo variation than Pianist 2, with interaction effects between Type of Video and Pianist; F(2, 32) ¼ 3.36, p < .05, 2 G ¼ .02,and between Type of Time-warp and Pianist; F(2, 32) ¼ 5.78, p < .01, 2 G ¼ .01.Multiple comparisons revealed that Pianist 1 was rated as exhibiting more tempo variation than Pianist 2 only in the case of the Exaggerated video type.Furthermore, multiple comparisons revealed that Type of Time-warp only had a significant effect on the ratings of tempo variation in the case of Pianist 2, with the videos warped to Exaggerated audio receiving higher ratings than those warped to Normal or Deadpan audio.
Bimodal perception of loudness and tempo variability.In order to investigate the potential effect of visual cues on the perception of loudness variation, the ratings of loudness variation -obtained in the audiovisual rating condition -were analysed using a repeated-measures ANOVA.The mean values are displayed in Figure 5.There were three within-participants factors: Type of Audio, Type of Video, and Pianist.The analysis yielded significant main effects of Type of Audio; F(2, 32) ¼ 72.69, p < .001, 2 G ¼ .38,Type of Video; F(2, 32) ¼ 3.71, p < .05, 2 G ¼ .01,and Pianist; F(1, 16) ¼ 6.70, p < .05, 2 G ¼ .03.There were no interaction effects.Multiple comparisons of means revealed that all three audio types were rated as significantly different in terms of the amount of loudness variation, with the Deadpan audio type receiving the lowest and the Exaggerated audio type the highest ratings.For the effect of Type of Video, multiple comparisons of means revealed that there was a statistically significant difference only between the Deadpan and Normal video types, with the Deadpan video type receiving significantly lower ratings of loudness variation.A comparison of the means also revealed that Pianist 2 was rated as exhibiting more loudness variation than Pianist 1.
Finally, the potential effect of visual cues on the perception of tempo variation was investigated by conducting a similar repeated-measures ANOVA on the ratings of tempo variation (see Figure 6 for mean ratings).Once again, there were three within-participants factors: Type of Audio, Type of Video, and Pianist.The analysis yielded significant main effects of Type of Audio; F(2, 32) ¼ 61.47, p < .001, 2 G ¼ .45,and Pianist; F(1, 16) ¼ 7.70, p < .05, 2 G ¼ .03,but no effect of Type of Video, nor any interaction effects.Multiple comparisons of means revealed that all three audio types were rated as significantly different in terms of the amount of tempo variation, with the Deadpan audio type receiving the lowest and the Exaggerated audio type the highest ratings.A comparison of means also revealed that Pianist 2 was rated as exhibiting more tempo variation than Pianist 1.

DISCUSSION
The ratings of loudness and tempo variability obtained in the audio-only condition demonstrated -in line with the objective measures of loudness and tempo variability (see Table 1) -that the performances produced under all three expressive conditions were evaluated as significantly different in terms of the perceived loudness and  timing variation.Furthermore, there were no significant differences between the two pianists in terms of perceived loudness and tempo variability.In the silent video-only rating condition, where the participants were instructed to imagine how the music produced by the pianists' movements would sound, the participants rated all three video types as significantly different in terms of their loudness variability.Since the total amount of movement increased significantly from Deadpan to Exaggerated performances (see Table 1), this suggests that participants used the size of movements as the cue in their evaluations.This conclusion is further supported by the finding that Pianist 1 -who showed more movement variation across the different performance types (see Table 1, right hand column)was evaluated as exhibiting more loudness variation than Pianist 2 in the video-only condition.In the video-only ratings of tempo variability, the notably larger effect size for Type of Video relative to Type of Time-warp (which represented the timing model to which the animation was time-warped and matched) suggests that participants used the simple amount of movement -rather than the pattern of timing of those movements -as a cue.This finding may be explained by the limited temporal resolution of the visual modality (e.g., Freides, 1974;Welch, DuttonHurt, & Warren, 1986), as well as the strong real-world association between the size of performers' movements and the amount of tempo and loudness variation.
Although Pianist 1 was evaluated as exhibiting more loudness and tempo variation than Pianist 2 in the videoonly condition, this pattern of results was reversed in the audiovisual rating condition.The audiovisual ratings revealed that Pianist 2 was evaluated as exhibiting more loudness and tempo variation than Pianist 1 -a result that is in line with the objective measures of audio features (see Table 1).Interestingly, however, there was no effect of Pianist in the audio-only condition.As in the audio-only condition, all three audio types were evaluated as significantly different in terms of their loudness and tempo variability in the audiovisual rating condition.The effect of Type of Audio on the evaluations of loudness variability was comparable to that observed in the audio-only condition, but Type of Video also had a statistically significant effect.More specifically, when the different audio types were presented in combination with the Deadpan video type, they received lower ratings of loudness variability than when presented together with the Normal video type; while for the ratings of tempo variability, the effect of Type of Audio was comparable to the audio-only ratings, and showed no effect of Type of Video.
These results are consistent with the hypothesis that visual kinematic information exerts a crossmodal influence on the perception of loudness variability, but not on the perception of tempo variability.However, the pattern of crossmodal effects observed in the two experiments reported here was not entirely straightforward.The theory of optimal sensory integration (e.g., Alais & Burr, 2004;Ernst & Banks, 2002), which proposes that more weight is given to the modality that provides the more reliable sensory information, does not fully explain why the loudness variability of the Normal video type was evaluated as significantly higher than that of the Deadpan video type while the Exaggerated video type was not.An alternative account is offered by those studies that have demonstrated that when sounds and sights are perceived as originating from a common event (i.e., the unity assumption), the process of sensory integration is altered in a way that differs from the traditional understanding of optimal integration (e.g., Schutz & Kubovy, 2009).However, studies that have investigated the unity assumption using musical instrument stimuli have reported conflicting findings, either succeeding (Schutz & Kubovy, 2009) or failing (Vatakis & Spence, 2008) to find an effect of the unity assumption.Mitterer and Jesse (2010) propose that multisensory integration may actually be driven by learned co-occurrences of visual and auditory stimuli rather than their perceived common causation: using piano stimuli showing either a key stroke or the actual sound-producing hammer stroke, they demonstrated that multisensory integration was stronger in the case of key strokes.As there is a strong real-world correlation between auditory and visual cues of musical expressivity -with performers finding it difficult to retain their normal level of expression while restricting their movements (Thompson & Luck, 2012) -this account may also reflect the process underlying the effects observed in the present study.
In line with this proposal, it may be that the degree of crossmodal effect observed in the perception of loudness variability varied depending on the ecological plausibility of the audio-video combinations, suggesting that only those cues that could be meaningfully paired with cues in the other modality resulted in crossmodal effects (cf.Warren, Welch, & McCarthy, 1981).This interpretation is in line with the findings of Vuoskoski et al. (2014), who observed that the more contrasting audio-visual combinations resulted in weaker crossmodal effects.
Finally, there is a need to consider the potential effect of response bias on the observed effects.It may be that only participants' evaluations of loudness variability were affected by visual cues, while their perceptions of loudness variability remained unaltered.We did not explicitly instruct the participants to base their evaluations only on the auditory modality, as we expected musically trained participants to have an established understanding of loudness and tempo variability as musical features; and asking participants to base their ratings on one modality while still attending to the other, risks drawing participants' attention to the phenomenon under investigation, thus increasing the likelihood of demand characteristics.The fact that the ratings of tempo variability were not affected by the simultaneously presented visual kinematic information, and that visual information affected ratings of loudness variability only in the case of certain audio-visual pairings (across both pianists), suggests that the observed crossmodal effects cannot be explained solely in terms of response bias.However, further investigation is undoubtedly required to clarify whether visual information about a piano performance could affect the perception of loudness at a sensory level.

General Discussion
This study provides further evidence for the significance of visual kinematic cues in the perception and experience of musical performance.Although previous studies have shown that visual information can influence the emotions induced by a musical performance (e.g., Chapados & Levitin, 2008;Krahe ´et al., 2013;Timmers, Marolt, Camurri, & Volpe, 2006, Vines et al., 2011), they haven't been able to reliably estimate the effects size of visual performance cues relative to that of auditory performance cues.The present study revealed that -in terms of the emotional impact of musical performances -the contribution of visual kinematic performance cues appears to be comparable to that of auditory performance cues.This is not to say that the effect of visual cues would be equal to that of musical cues as a whole, since there is the significant impact of the music's composed structure to consider in addition to auditory performance features.The emotions conveyed and induced by music emerge from the combination of structural and performance features, and are also affected by individual and situational factors (e.g., Scherer & Zentner, 2001).In relation to this complex range of factors, the present study was only designed to investigate the relative contributions of auditory and visual kinematic performance cues by comparing different performances (and combinations of different performances) of the same musical piece.Thus, the results of this study suggest that in terms of the effect of performance cues on observers' subjective emotional reactions to a musical performance, the visual modality appears to be just as important as the auditory modality.
The significant contribution of visual cues to our participants' emotional experiences is striking, since the effects of performance features on the perception and induction of emotion have often been considered only from an auditory perspective (see e.g., Juslin & Timmers, 2010) -despite more widespread recognition of the role of visual factors in judgements of performance expressivity (e.g., Davidson, 1993Davidson, ,1994;;Tsay, 2013).There is some evidence to suggest that the type of emotional expression communicated via visual kinematic cues can have an effect on the type (and intensity) of emotions perceived and experienced by the observer of a musical performance (Chapados & Levitin, 2008;Krahe ´et al., 2013;Timmers et al., 2006;Vines et al., 2011), but more controlled and systematic investigations (e.g., withinparticipants rather than between-participants designs, and more systematically generated stimuli) are needed to explore this issue further.Moreover, recent findings suggest that the emotions felt by a performer also alter the way in which he or she moves, since observers seem to perceive visually and audiovisually presented (but not solely auditorily presented) violin performances as sadder when the performer was actually feeling sad, compared to when they were only expressing sadness (Van Zijl & Luck, 2013).These findings -as well as those of the present study -support the view that observers of a musical performance are able to detect very subtle yet informative cues from visual kinematic information -without necessarily attending to them in a conscious manner (Tsay, 2013).
The results of the present study also provide evidence to support the view that visual kinematic information can have an effect on the judgment of certain auditory performance cues.The results of Experiment 2 revealed that visual kinematic information had an impact on ratings of loudness variability -but not on ratings of tempo variability -suggesting that the crossmodal effects in the perception of auditory expressivity observed in a previous study (Vuoskoski et al., 2014) may be attributed to the effect of visual cues on perceived loudness (rather than tempo) variability.In order to tease out the relative contributions of timing and loudness variability -as well as the effects of visual kinematic cues -on perceived auditory expressivity in more detail, future studies could apply time-warping algorithms to MIDI data as well.
In the case of both experiments, there seemed to be a clearer difference between the Deadpan and Normal performance types than between the Normal and Exaggerated performance types.This is in line with the findings of Vuoskoski et al. (2014), as well as those of Davidson (1993) suggesting that performers may find it easier to ''withhold expression from the piece than exaggerate the expressivity of a piece beyond its normal level'' (Davidson, 1993, p. 109).It should also be noted that performers can differ greatly in terms of how much they move while performing, as well as how much loudness and timing variation they use when communicating their expressive intentions.Indeed, this was the case in the present study, where Pianist 1 displayed more movement variability, whereas Pianist 2 exhibited more tempo variability.Although the effects observed in this study were consistent across pianists (as evidenced by the lack of interaction effects related to Pianist), it may be that the relative contributions of auditory and visual kinematic performance cues may vary across different pianists, especially in the case of more extreme performance styles.Indeed, differences in expressive efficacybetween different performers and between different instruments -may explain the contrasting findings observed in the present study and a previous study by Vines et al. (2011), where different expressive intentions led to differing emotional reactions only in the audiovisual and video-only conditions, but not in the audioonly condition.However, it might also be argued that the pianists included in this study -music students rather than professional concert pianists -utilize more conventional (i.e., less idiosyncratic) expressive devices in their performances, and thus represent the majority of musicians better than do professional concert pianists.
In conclusion, the results of the two experiments reported here demonstrate that visual information about a performer's movements not only has an impact on the intensity of emotional reactions evoked by the performance, but can also change how that performance sounds to an observer.The study has shown that visual performance cues may be just as important as auditory performance cues in terms of the subjective emotional experience of the observer, suggesting that non-auditory cues may contribute more to music-induced emotions than has previously been established.These results confirm the significant role of visual kinematic cues for audience members, and encourage further investigations into the ways in which visual information may interact with auditory information in our perception and experience of a musical performance.
Tempo variability reflects the standard deviation of the divergence from mean tempo, calculated for each eighth note.Root-mean-square energy reflects the mean loudness (and loudness variability) of the audio excerpts.RMS values were calculated for 500 millisecond segments.Amount of movement indicates the total distance travelled by the motion capture markers.*RMS values and standard deviations have been multiplied by 1000.

FIGURE 1 .
FIGURE 1.A sample frame of the point-light animations used in Experiments 1 and 2.
and visual modalities with regard to the emotional impact induced by the audiovisual performance excerpts, a repeated-measures ANOVA was conducted.There were three within-participant factors in the ANOVA: Type of Audio, Type of Video, and Pianist.The analysis yielded significant main effects of Type of Audio: F(2, 36) ¼ 11.22, p < .001, 2 G ¼ .10;and Type of Video: F(2, 36) ¼ 9.12, p < .001, 2 G ¼ .09.The mean ratings (grouped by Type of Audio and Type of Video) are displayed in Figure3.Multiple comparisons of means (paired t-tests, p < .05significance level adjusted using the Holm-Bonferroni method) revealed that all three types of audio were significantly different from each other, with the Deadpan condition receiving the lowest and the Exaggerated condition receiving the highest ratings of felt emotional impact.Regarding the different

FIGURE 2 .
FIGURE 2. The mean ratings of emotional impact (+ standard error of the mean) obtained in the unimodal audio-only and video-only conditions of Experiment 1.The ratings have been scaled to a range of 0-100.

FIGURE 3 .
FIGURE 3. The mean ratings of felt emotional impact (+ standard error of the mean) obtained in the audiovisual condition of Experiment 1, grouped by Type of Audio and Type of Video.The ratings have been scaled to a range of 0-100.

FIGURE 4 .
FIGURE 4. The mean ratings of loudness and tempo variability (+ standard error of the mean) obtained in the unimodal audio-only and video-only conditions of Experiment 2. The ratings have been scaled to a range of 0-100.

FIGURE 5 .
FIGURE 5.The mean ratings of loudness variability (+ standard error of the mean) obtained in the audiovisual rating condition of Experiment 2, grouped by Type of Audio and Type of Video.The ratings have been scaled to a range of 0-100.

FIGURE 6 .
FIGURE 6.The mean ratings of tempo variability (+ standard error of the mean) obtained in the audiovisual rating condition in Experiment 2, grouped by Type of Audio and Type of Video.The ratings have been scaled to a range of 0-100.

TABLE 1 .
Mean Tempo, Tempo Variation, Mean Root-mean-square Energy, and Total Amount of Movement in the Six Performance Excerpts Performance type Mean tempo (bpm) Tempo variability (%) Mean RMS (SD)* Amount of movement (m)