Is Language Special ? Anticipation Timing Accuracy of End of Turns in Known and Unknown Languages

Structured signaling in the acoustic environment between two individuals usually leads to turns to avoid interference. Turn-taking in human communication is a precise system that enables interlocutors to interact very efficiently. Previous studies have detected criteria that allow for optimized timing within a conversation. For instance, lexico-syntax seems to be of outstanding relevance. Other aspects still under consideration in this context are prosody and rhythm beside others. In the current study, we focused on the question if language carries universal acoustic features which might make turn-taking in human communication uniquely efficient in contrast to e.g. 'turn-taking' in animals. We aimed at getting an impression of how language specific properties other than content and grammatical structure affect anticipation performance. Therefore, we contrasted the Anticipation Timing Accuracy (ATA) for mother-tongue stimuli in German, for items in six foreign languages (English, Italian, Polish, Turkish, Arabic, and Korean) and for simple sinusoidal tones. Results showed significant differences between the ATA of the foreign language stimuli. German subjects anticipated the ends of utterances in Indo-European languages and in stress-timed languages (German, English, Arabic) significantly better than the ends of items in non-Indo-European languages and in syllable-timed languages (Italian, Polish, Turkish, Korean, restrictions apply). We conclude that interlocutors’ end-of-utterance anticipation performance is influenced by language inherent universal acoustic features.


Introduction
The efficiency of intraspecific communication is strongly influenced by the respective communication channel.The amount of information mediated in a certain amount of time may thus vary depending on whether a chemical, acoustic, or visual channel is used.Beside the sequentiality of the respective channel, the internal timeline of a dialogue between two individuals is of relevance for the efficiency of communication.Whereas interlocutors may both send visual signals at the same time if a visual channel is used, this opportunity is missing if only an acoustic channel may be used.Complex acoustic signals with duration of up to several seconds require a deferred use of the communication channel in order to avoid interruptions or interferences.Interlocutors therefore attend to a turn structure.In human communication, a turn is not to be confused with a sentence [1].Rather, a turn is defined as comprising all the utterances of a speaker until the listener takes the conversation over, that is, takes the turn.Turn-taking is thus defined as the transition from one turn to the other, as the transfer of speech from one speaker to the next [2].This turn-taking may not occur at random points in the speaker's turn but only at transition relevant places [1].In order to achieve high temporal efficiency in communication, listeners have to anticipate these transition relevant places and thus the end of a perceived speaker's turn.Human communication thus demands a great deal of interlocutors.However, complex birdsong shows a structured succession of sound signals between two or more individuals as well, which may be categorized as chorus, duet, or antiphony [3].
As suggested by the turn-taking system developed by Sacks et al. [4], dialogue partners show accurate timing when they interact communicatively.This turn-taking system has been repeatedly discussed during the last 40 years.As a result, the projectionists' point of view seems to have been confirmed.From their perspective, the precise speaker changes occurring in everyday conversations are possible because recipients anticipate the end of the speaker's turn and, therefore, know when he or she will stop talking.But what is this anticipation process based on?It seems that lexico-syntactic characteristics of an utterance are of particular importance [5][6][7].Additionally, the relevance of some prosodic aspects and suprasegmental features is supported by the results of a series of studies [8][9][10].Wells and Macfarlane [8], Heldner et al. [9], and Dombrowski and Niebuhr [11] found evidence for the relevance of the last major accent and specific F 0 -patterns in turn-end anticipation in corpora of different languages (Swedish, English, and German).Other prosodic cues that might be important for a successful anticipation process are the boundary tone of the turn end (Barkhuysen et al. [12]: Dutch), the speaking rate, the intensity level, specified voice quality features (Gravano & Hirschberg [10]: American English), and an oscillatory speech rate (Couper-Kuhlen [13]: English; Auer et al. [14]: English, German, Italian; Beňuš et al. [15]: American English; Stivers et al. [16]: various languages; Wilson & Wilson [17]).Wilson and Wilson [17] suggested that the precise turn switches in conversation are possible because interlocutors align their cognitive cyclic patterns, that is, "an oscillatory function of readiness to initiate speech" (p.962).They further stated that the speaker's syllable or speech rate is decisive for the frequency of the oscillation.Wagner et al. [18] likewise assumed that prosodic features of an utterance are rhythmically organized so that they follow a regular oscillation pattern.This pattern would simplify interlocutors' cognitive processing and interaction in communication and give them the opportunity to entrain their speech pattern (Inden et al. [19]: German).Thus, next speakers match their turn onset according to the entrained speech rate, achieving very efficient timing in turn-taking [13,14,17].A third group of researchers argues in favor of several aspects -like semantics, syntax, prosody, rhythm, gesture, context, gaze and facial expression -as being altogether important for successful turn-taking processes (Ford & Thompson [20]: American English; Selting [21]: German).
This last approach implies that human communication is not only based on vocal but on visual clues as well.Interlocutors are able to communicate by gaze or gesture parallel to speech and thus indicate interest, repair or their aim to take the turn over.This is why the human turn-taking system is very efficient.In an electroencephalographic (EEG) study with German sentence stimuli, participants were instructed to press a button exactly at the moment an aurally presented sentence ended.A lateralized readiness potential related to the button press was observed 600 to 800 milliseconds before the end of the sentences, indicating that participants were aware of the sentence end 600 to 800ms before it actually ended [22].Although only based on aural clues, this early anticipation leaves plenty of time for the listener to prepare their own turn and allows the turn transition to be smooth.Interestingly, in a follow-up study by Wesselmeier et al. [23] using the same sentence stimuli, the readiness potential was observed as early as 1.400ms before the end of a sentence but was disrupted for stimuli violating syntax or semantics.In those stimuli, the readiness potential occurred 900ms before the end of the sentence.The findings of these studies suggest that upcoming speakers prepare for the end of their interlocutor's turn quite early, which supports an efficient turn-taking process [24].In contrast to possible parallel means of communication in humans, complex vocalization in animals may only occur sequentially.Parallel vocalization would preclude successful communication.As a consequence, complex vocalization takes much more time than human communication and is thus pressed for time.
It is still under debate which aspects carry how much weight when it comes to end-of-turn anticipation.In the current study we aimed at finding out whether language, even if it is unknown to the listener, carries clues in the form of certain signals which indicate utterance-ends.Therefore, we compared the Anticipation Timing Accuracy (ATA) in six foreign languages, of which participants knew only English as their L2, and pure sinusoidal tones varying in their duration as a maximal contrast to natural linguistic stimuli.Anticipation processes are usually studied by the use of behavioral methods.The ATA is an indicator of such conscious behavioral processes and is a sign of how well subjects were able to anticipate the end of an utterance.The intention was to judge the importance of general acoustic natural speech-specific features other than content and grammar, like language universal suprasegmentals such as pitch movement, final lengthening, and other potential cues, for end-of-utterance anticipation.

Materials and Methods
Participants 37 students (18 women, 19 men) of Bielefeld University with German as their native language participated in the experiment.Written consent was obtained for publication of this study.Participants had a mean age of 23.7 years (±2.9) and were right-handed with a mean lateralization quotient of 88.1 (±13.3)according to the Edinburgh Handedness Inventory [25].According to their own accounts, participants did not suffer from any auditory or motor restrictions or diseases which could have influenced results.

Stimuli
Stimuli were either spoken sentences (161 total) or pure 450 Hz sinusoidal tones of different durations (10 total).23 spoken sentences were phrased in German and translated to English, Italian, Polish, Turkish, Arabic and Korean by native speakers of the respective languages, who were fluent in German.The recording of the spoken sentences was done in a sound-attenuated booth with the same native speakers who did the translation.Turkish and Arabic items were recorded with a male.German, English, Italian, Polish, and Korean items with a female speaker.
All participants had good knowledge of English as their L2 (M = ten years of school education), two subjects had had marginal contact to Italian and one to Polish.All were unfamiliar with Arabic, Turkish and Korean.Except for German and English, all languages were judged as unknown foreign languages.The foreign language utterances were presented in order to test the influence of speech patterns and general language specific properties independent of semantics and syntax on end-of-utterance anticipation.The idea was that less participants could rely on semantic and syntactic content, the more they would have to anticipate the ends by use of the remaining language universal suprasegmental features or even more general but typical aspects of natural speech signals.If they are not trained through everyday conversation to do so, the anticipation performance of participants should be as bad for the linguistic but incomprehensible stimuli as they are for the maximally non-linguistic sinusoidal tones that do not contain any linguistic information at all.Other studies used low-pass-filtering or hummed speech to remove semantic content and syntactic structure [6,26].Since the result of this technique makes the speech signal sound less like speech, we decided to use foreign language stimuli instead.These stimuli should have the same effect as hummed speech of filtered stimuli while remaining fully intact natural linguistic signals.Sentences had a mean length of 3591ms and were of a simple syntactic structure (e.g., Books and articles about fossils are very interesting.Newspapers are a source of many different kinds of information.).The ten sinusoidal tones were generated at 450Hz.They varied in length from 2600ms to 4400ms, had a mean length of 3300ms and thus matched the length of the sentences.In total, there were 171 stimuli.
We not only checked for ATA differences between the languages but also classified them as either Indo-European (IE: German, English, Italian, Polish) or non-Indo-European (non-IE: Arabic, Turkish, Korean) [27,28].These two groups were then compared to each other and to the sinusoidal tones.

Procedure
All items were presented to all participants in pseudo-randomized order in a within-subject design.Following randomization, the order was checked manually so that no sentence type was followed by the same type (for instance, no English utterance followed an English utterance).
Prior to the experiment, participants were asked to fill out two questionnaires.The first asked for age, gender, course of study and motoric or sensory deficits (Appendix B).In the second, handedness was determined by the Edinburgh Handedness Inventory [25] (Appendix C).Participants practiced their task with six items (four German sentences, one foreign-language utterance and one sinusoidal tone).Items were presented aurally via E-prime (Psychology Software Tools, vers.2.0) on a PC (Windows 7).Subjects listened to the items with headphones.They then had the task of pushing a button on an external USB response box with an internal clock with the forefinger of their right hand at the exact moment the utterance ended [cf. 6].If the button push occurred too early the utterance was stopped immediately and the next stimulus began after an ISI of 1000ms.Each item required a response in order for the subject to continue with the experiment.The external response box measured the ATA with an accuracy of about ± 2ms.The ATA was defined as the time span between the button push and the actual end of the utterance.After the procedure, language skills of participants in the foreign languages that came up in the experiment were documented in a separate questionnaire (Appendix D).As for the outcome of the experiment we posed the question whether it is possible to predict the ends of utterances in unknown foreign languages more accurate than the ends of sinusoidal tones.

Results
The statistical analysis of the resulting ATA was done via SPSS (IBM, vers.20) under Mac OS X.First, descriptive statistics for each sentence type were calculated.Extreme values defined as values lower and higher than two standard deviations from the mean in each language amounted up to 3.1% of all valid responses and were excluded from the analysis.Further, four items were excluded from the analysis due to a low item-total correlation in the item reliability analysis (one German, Turkish, Arabic, and Korean item respectively).Subsequently, repeated measures ANOVAs were computed with ATA as the dependent and type of sentence as the independent variable with the factor levels German, English, Italian, Polish, Turkish, Arabic, Korean and tones.The Bonferroni multiple comparison test was done post-hoc.
Detailed results are available in Appendix A. There were several mentionable differences between the ATAs related to the different languages (Figure 1).Subjects anticipated the ends of German items better than of any other stimulus type in this category.Further, the ends of tones and of Turkish stimuli were anticipated equally worse than the ends of all other items.We tested the differences between the sentence types for their statistical significance.Since Mauchly's test was significant, the degrees of freedom were corrected by the Greenhouse-Geisser estimates of sphericity (ε = 0.49).Altogether, the factor levels had a highly significant impact on the ATA (F(3.42,119.52) = 100.27,p = .000).The multiple comparison post-hoc test (Bonferroni) showed that almost all item types differed significantly from one another (for p-values see Appendix A).The differences of the ATA of tones and of German (p= .000),Italian (p = .000),English (p = .000)and Arabic items (p = .000)were significant as were the differences between Turkish stimuli and German (p = .000),Italian (p= .000),English (p = .000),Polish (p = .001),Arabic (p= .000)and Korean items (p = .003)(see Figure 1 and Appendix A for further results).Categorizing languages to either IE or non-IE items and comparing them to the sinusoidal tones also revealed an overall highly significant effect (F(1.33, 46.65) = 98.35, p= .000).The post-hoc analysis showed a significant difference between the ATA on IE stimuli and on non-IE items (p = .000),on IE items and tones (p = .000)and on non-IE items and tones (p = .010).The ends of IE utterances were anticipated most accurately (Figure 2).

Discussion
The aim of the current experiment was to find out more about the influence of general language universals as part of a natural speech signal and speech-specific articulatory features on utterance-end anticipation.The question was whether anticipation performance would be better for end-of-utterance detection in unknown foreign languages than for ends of sinusoidal tones.Judging from the questionnaire (Appendix D), skills for all languages used in the experiment except German and English were either completely lacking or too little to influence the mean results in the current experiment.
Results showed that the ATAs of German, English, Italian, and Arabic items differed significantly from the ATA of tones.In contrast, there were no differences between the ATA of Polish, Turkish and Korean items and the ATA of tones.It thus seems to be the case that there is something familiar in Italian and Arabic speech stimuli which enables German native speakers to anticipate the ends of utterances in these languages.In German and English, participants could rely on semantics and syntactic structure, which becomes apparent in the low ATAs.Based on this result one can conclude that syntax and semantics have a great impact on anticipation performance.However, this is no explanation for the low ATAs in relation to Italian items and for the significant difference between Arabic stimuli and tones.In those languages, participants could have had access to suprasegmental characteristics of the utterances at most (and to single word semantics in the cases of Italian and Polish).However, since the ATA-differences between unknown languages and tones were predominantly not significant, it can be concluded that suprasegmental characteristics alone are not sufficient for an appropriate anticipation performance concerning utterance-ends if the languages are examined separately.
Grouping languages to IE utterances and non-IE items throws a somewhat different light on this outcome.Subjects anticipated the ends of IE utterances better than of utterances which are part of other language families.This might be due to the degree of similarity of single words or acoustic patterns in the related languages.Crucially, responses on sinusoidal tones were still significantly worse than on non-IE stimuli.Thus, if subjects did not have access to any semantic content or syntactic structure, as was the case for the non-IE items, they might have used other language universal linguistic features.Those features must have been of an acoustic, suprasegmental or stress related origin.Features that might have been relevant in this context are the last major accent and specific F 0 -contours as identified in a number of corpus studies (Koiso et al. [29]: Japanese; Wells & Macfarlane [8]: English; Caspers [30]: Dutch; Heldner et al. [9]: Swedish).Wells and Macfarlane [8] compared the onsets of turn-competitive to non-competitive utterances by recipients in a natural conversation in English.They found that the onsets of the competitive turns were usually placed right before the last major accent [8].Therefore, the syllable carrying the last major accent is supposed to allow recipients the anticipation of the turn-end, given that it holds unique phonetic features, which were identified in a second experiment.Wells and Macfarlane [8] found a certain order of prosodic features -a step up and a drift down of pitch right after the last major accent.Additionally, the last major accent is pronounced louder and lengthened in contrast to accents which do not indicate a possible place for a speaker change.These characteristics make the last major accent distinct and set it apart from other accents making it possible for conversational partners to use it as an additional indicator to predict the end of a turn.In another corpus study conducted by Heldner et al. [9], results of earlier research concerning the relevance of F 0 -patterns in turn-taking could be replicated for Swedish.They detected rising and falling pitch contours before the end of the turn.Before pauses that did not represent a speaker change, they found a rather flat pitch contour.As for non-IE languages, Koiso et al. [29] found suprasegmental cues, like the duration of the final phonemes, peak F 0 and peak energy, to be relevant for turn-taking in a Japanese corpus.All of these characteristics might also have been relevant for participants in the current study whenever they were not able to anticipate the utterance-end by use of semantic or syntactic criteria.
A further characteristic of language that contrasts it to the sinusoidal tones is a speech specific acoustic pattern, which might be used in end-of-utterance anticipation (Beňuš et al. [15]: American English).Auer et al. [14] define this pattern as the 'beat' of a language which constitutes a rhythmic isochrony which is then used by interlocutors to fit their turn onsets into the beat pattern.Based on this, oscillator models attempting to explain the efficient timing of interlocutors have been and are currently being developed.The idea behind such a model is that both speaker and listener are prepared to initiate speech at a frequency of an oscillation based on the speaker's syllable rate [17].If this were true, participants might not have been able to detect the ends of sinusoidal tones in the current study due to a missing acoustic pattern or missing syllable boundaries.In the past, there have been attempts to classify languages according to their stress pattern.The outcome was that languages tend to fall in one of three categories: They may be judged as either stress-timed, syllable-timed or mora-timed [31].This classification is based upon timing patterns in each language.Thus, stress-timed means that the time span between one stressed syllable and the next is always approximately of the same length and the same applies for syllable-and mora-timed languages.Whereas German, English and Arabic are stress-timed [31,32], the classification of the other languages used in the current study is disputed, or, rather, not trivial.Italian seems to be either stress-or syllable-timed, according to the speaker's dialect.In due consideration of this influencing factor, our Italian stimuli should be classified as syllable-timed, since the native speaker came from Milan [31][32][33].Polish is judged as definitely distinct from English [31,34] and Turkish as rather syllable-timed [35].Korean is discussed as being either syllable-timed [36] or mora-timed [37].Against the background of these attempts, one could expect that end-of-utterance anticipation is easier in languages with a stress pattern equal to that of the mother tongue.In the current study, this would apply to stress-timed languages.In fact, ATAs on stimuli in Arabic, which is one of the stress-timed languages judged as unknown in the current study, were significantly shorter than the ATAs on tones.ATAs on Polish, Turkish, and Korean utterances, which have a different stress pattern than German, did all not differ significantly from ATAs on tones.This could be an indicator of the acoustic pattern being relevant for the anticipation of an utterance-end, albeit the ATAs on Italian items seem not to fit the pattern.One might conclude that an inability to make use of either syntax and semantics or a well-known acoustic pattern leads to an inadequate anticipation performance.This indicates that the acoustic pattern probably is a relevant suprasegmental characteristic in the anticipation of utterance ends.
Our results support the view that conversational partners consider several aspects when anticipating the upcoming end of a turn.This view is also taken by Ford and Thompson [20] who analyzed turn changes from a corpus of two natural conversations in English and found that 71 % of all speaker changes could be contributed to syntactic and intonational completion in combination with pragmatic completion.The authors stated that a syntactic completion point is only interpreted as the end of a turn if it is further reinforced by the intonational contour and the pragmatic content of the utterance.Selting [21] agreed with this position and -based on her results from German conversations -claimed that syntactic and prosodic structures need to be considered equally in the analysis of end-of-turn projection.With the aim to improve a machine's detection of turn-ends, Edlund and Heldner [38] also found that intonation patterns, as detected in turns of a Swedish Map Task corpus, helped the machine to judge whether or not a silent pause actually was a turn-end.This supports the view that several aspects are relevant for turn-end-anticipation as well.Again, there are similar findings for Japanese [39].

Conclusions
As an answer to the research question posed in this study, language is special in the sense that it always provides information of some kind which supports efficient communication, even if only on a suprasegmental level.
Is Language Special?Anticipation Timing Accuracy of End of Turns in Known and Unknown Languages We found a significant result stating that the ends of non-IE items could still be anticipated better than the ends of tones although subjects had no access to content or syntactic structure at all.In the current study, end-of-utterance anticipation was only based on aural clues.In combination with visual information, the outcome might have been even more prominent.It seems like there is some kind of intrinsic feeling or intuition for speech which enables listeners to "gain access" to information in languages they do not know.This shows that speech and language are more than their parts, more than syntax, semantics, and prosody.They carry additional information as a whole, by way of expression, melody and acoustic patterns, which we have not quiet understood so far.
Appendix D Questionnaire 3. The questionnaire asking subjects to indicate their language skills in the foreign languages that came up in the experiment Significant differences to sinusoidal tones are marked by black asterisks, significant differences to Arabic items are marked by red asterisks ( * * * = p ≤.001, * = p ≤.05).Error bars represent 95% confidence interval.