A data intensive approach for characterizing speech interpersonal dynamics in natural conversations
Under the direction of Laurent Prévot (LPL) and Benoit Favre (LIS)
Jury members:
Prof. Julia Hirschberg, Columbia University
Prof. Giuseppe Riccardi, Università degli Studi di Trento
Prof. Stefan Benus, Constantine the Philosopher University
Cr. Roxane Bertrand, LPL
Director: Prof. Laurent Prevot, LPL
Co-Director: Prof. Benoit Favre, LIS
Abstract:
During a conversation, participants tend to tune, consciously or not, their communicative production in regards to their interlocutor. It is generally admitted, that under standard circumstances, these phenomena result in convergence of the two participants’ speech parameters.
Past literature offers a large part of studies describing the effects of convergence in interpersonal dynamics but there are still some unclear aspects.
These concerns firstly the mechanisms that rule the phenomenon in natural conversations. These are hard to be studied due to the spontaneous flow of the conversants that results to be noisy and variable. In second place in this kind of conversation is still not well known how participants modify their speech style (the dynamics i.e.) in the course of the conversation.
In this thesis, we aim to validate previous results in acoustic-prosodic convergence and provide novel approaches to have a partial a posteriori filter on natural conversations and to track the interpersonal dynamics.
We firstly perform a replication study on the speech rate, confirming that speaker speech rate in the entire conversation converge to their interlocutor speech rate baseline (average speech rate they have in other conversations) even if we perform the analysis on smaller subsets of the original dataset. On the other side, we raised that convergence effects are less reliable in magnitude and significance when reducing the size of the dataset.
In the second part, we explore the dynamics of convergence effects by comparing the distances of average acoustic-prosodic features in the two halves of each conversation (interval of the same temporal length) between the speaker and interlocutor. Results exhibit that both energy and speech rate show convergence in the second half of the conversations of the corpus. In addition, we extend this approach by proposing to study natural conversations comparing similar speech activities. This approach has the advantage to have a posteriori control of the natural flow of the speaker and interlocutor in spontaneous conversation. We observed that the comparison of speaker and interlocutor in more homogeneous speech activities leads to having convergent effects even if the size of the sample is much smaller than the uncontrolled dataset. Based on this idea the thesis proposes a way to automatically tag speech activities for unlabelled data of this kind with the use of a recent LSTM net for classification.
Besides measuring distances between speaker and interlocutor we propose a prediction classifier paradigm to explore the speaker and interlocutor position in the second half of the conversation.
By the use of a Random Classifier, we correlate the use of linguistics variables that describes the trend of speech style of speaker and interlocutor with profile information with the increase of accuracy score in predicting the speech rate variation in the second half of the conversation.
In the last part, we deepen the study of the dynamics in a more fine grain segments of the conversations. The goal is the prediction of mean variables (energy, range F0 and speech rate) in the upcoming turn by the use of previous turns history information that include speech style and lexical information; results, achieved by the use of separately LSTM and LSTM with word embeddings layer, exhibit that the use
of interlocutor and speaker speech style in the previous turns reduce the prediction error of the upcoming turn compared to the case of using just past turns of the speaker.
These results extend the landscape of convergence effects in the not controlled dataset and offer novel approaches, concerning the method to control the variability of natural conversations and the prediction task paradigm to evaluate the interpersonal dynamics, consisting in evaluating the influence of the speaker and interlocutor on each other speech style.