Studying Spontaneous Speech Production
This blog post was inspired by the "Diapix Task" that is located within the "How English Varies" project on LanguageARC. To participate, click here.
Spontaneous speech is produced naturally and in the moment by humans. It is how we converse with other humans every day, and it differs from prepared speech such as documents, articles, and blog posts. Spoken language, especially spontaneous or conversational, contains disfluencies like pause fillers ("um" or "uh"), starts and repeats of words, fragments, and interjections. This kind of speech is valuable to linguists who want to understand how people actually use language on a day-to-day basis as opposed to carefully planned, prepared, or written language. It is also valuable in developing language technologies such as speech recognition software where the input is likely to be spontaneous speech rather than carefully prepared speech. Because it is so relevant and useful in our lives, linguists are trying to find new and effective ways to elicit spontaneous speech from participants and study it. This is a slightly difficult task because it cannot be directed or influenced by someone, otherwise it would not be considered spontaneous. Below is a timeline of research studies designed to capture spontaneous speech as well as information about their successes and drawbacks.
1983- Map Task (Brown, Anderson, Yule, & Shillcock)
This task involves two participants; one is an ‘instruction giver’ who communicates details of a map route using locations featured on the map, and the other is an ‘instruction follower’ who doesn't know the route and has slightly different map elements. The goal of the task is for the 'instruction follower' to correctly figure out the route and end location. This task is good for spontaneous speech involving locations and directional vocabulary, and presents the opportunity for participants to speak freely while completing the task. Generally, with both participants speaking an equal amount, (Forsyth et. al, 2011), the “instruction follower” finds the target location faster than if only one participant dominates the conversation. In the original map task, there were 16 pairs of maps, but there have been variations and developments of the task, such as increasing the number of map pairs and performing it in languages other than English.
Rothwell et. al, ACM Digital Library
1994- Crawford, Brown, Cooke, and Green
In this study, two pairs of people each worked together on a crossword puzzle, and were instructed to talk to each other while doing so. Researchers recorded each pair, and because crossword clues involve many different topics, participants produced a large quantity of unique, spontaneous speech. However, solving crossword puzzles requires skill and some degree of knowledge about many different subjects, which can vary between people and cause different results because of the lack of a constant in the study.
2002- Boston Directions Corpus
In this study, participants were given a real-world location and asked to provide directions to a silent participant. Researchers were able to collect a large quantity of directional speech data from the speaking participants.
2010- Cooke and Lu
This study involved both single participants talking aloud to themselves while solving a sudoku puzzle, and pairs of participants working together and communicating to solve a sudoku puzzle. If participants worked in pairs, they were separated by a visual barrier and subjected to the same conditions such as background noise and appearance of the room. According to Cooke and Lu, "easy puzzles led to less need for communication (or speech production for single participants), while more difficult exemplars produced longer pauses and less interaction (or speech produced)". Similar to Crawford et. al (1994), skill variation between participants might have affected the amount or type of speech produced while solving the puzzle.
This study involved two people communicating to find differences between two versions of the same picture. Each participant had their own picture and was not allowed to show it to the other- they would try to work together to identify what unique aspects their own picture had that differed from the other picture. There were 12 pairs of pictures in this task, and participants were not required to have done this sort of task before or have a certain skill set. This task has been performed many times, and a large variety of speech data is able to be collected. For example, " global acoustic-phonetic characteristics, such as fundamental frequency range and mean, vowel space and speech rate, or discourse functions, such as uses of backchannels, hedges, and so forth [are able to be studied]. To also allow for segmental speech analysis, the original Diapix task had differences that were based around keywords containing a selection of vowels" (Baker 2011).
The participants in the original DiapixUK study in 2010 consisted of people from the University of London who were native speakers of Southern British English. The age range was 19-29, and participants were placed in pairs of the same gender, (10 pairs of female participants and 10 pairs of male participants).
DiapixUK image pairs
Diapix Task on LanguageARC
LanguageARC has an online version of the Diapix task that has both similarities and differences to the DiapixUK task. The same picture pairs are used, but instead of two participants collaborating to identify the differences, one participant is asked to record themselves spontaneously identifying the differences between the pictures. Speech data is still able to be collected, but the speech produced might come across as more of a monologue than a dialogue. Similar to the DiapixUK task, there is no preparation or rehearsal of speech in the Diapix LanguageARC Task. To contribute speech to this online Diapix task, head over to the "How English Varies" project on LanguageARC.
Baker, R., & Hazan, V. (2011, March 18). DiapixUK: Task Materials for the Elicitation of Multiple
Spontaneous Speech Dialogs. Psychonomic Society. Retrieved October 24, 2022.
Cooke, M., & Lu, Y. (2010). Spectral and Temporal Changes to Speech Produced in the Presence
of Energetic and Informational Maskers). Acoustical Society of America. Retrieved
October 24, 2022.
Rothwell, C. D., Shalin, V. L., & Romigh, G. D. (2021, April 1). Comparison of Common Ground
Models for Human-Computer Dialogue: Evidence for Audience Design. ACM Digital Library. Retrieved October 24, 2022.