Global TIMIT Datasets
Updated: Feb 7
Global TIMIT is a project on LanguageARC that contains sentences and words from the original TIMIT dataset. The following blog post explores the original TIMIT dataset and other related resources distributed by the Linguistic Data Consortium (LDC) that have come out since.
What is TIMIT?
The original TIMIT is an acoustic-phonetic dataset developed in 1987 through a collaboration among Texas Instruments (TI), the Massachusetts Institute of Technology (MIT), and SRI International. The primary goal of the corpus design was to reflect different phonetic aspects of the English language. It consists of recordings of participants saying various words and sentences along with time-aligned transcriptions of those recordings. There are 6300 recorded utterances in the original TIMIT- more specifically, there are 10 utterances each spoken by 630 speakers. When these utterances were transferred to text, they made up 2342 sentences containing 6099 distinct words. Despite TIMIT being created about 3 decades ago, it is still used in areas such as speech recognition, speech synthesis, diagnosis of speech and language disorders, linguistic phonetics, and more.
Expanding TIMIT to Other Languages
After the development of the original TIMIT and seeing its usefulness for the English language, researchers wanted to create similar datasets for other languages including Thai (the “THAIMIT” project), Mandarin Chinese (the “CHIMIT” project), the Guanzhong dialect of Mandarin, the Ga language, and others. Additionally, there are plans to create new TIMIT datasets in other languages such as Swedish, Italian, and French.
In order to create datasets in other languages, researchers created a method that implemented what they believed to be successful features of the original TIMIT. They wanted to keep the basic concept of having fluent speakers of the language read a large number of sentences. The audio data from speakers would represent phonetic aspects of the target language, and in addition, it would represent other linguistic aspects such as semantic and pragmatic patterns. Researchers decided that recruiting over 600 speakers, as was done for the original TIMIT dataset, was difficult, but they estimated that they could achieve similar results by having a smaller number of speakers each produce a large number of words and sentences.
Researchers additionally had to gather content for participants to read. They used sentences from sources such as Wikipedia and newspapers, which were then evaluated as to whether or not they could be used based on their length and word makeup. From the sentences that were deemed useful, about 3,200 sentences were chosen for speakers to read. Some difficulties were found in the transcription process due to lack of word boundaries in languages such as Mandarin or Thai.
To date, the Linguistic Data Consortium has released five TIMIT corpora in various languages with a half dozen more in the publications pipeline to be released over the next couple of years. The streamlined data collection and corpus creation processes have reduced the costs of creating these datasets while maintaining the usefulness of the data. LDC plans to continue to facilitate the creation and publication of additional TIMIT datasets for a variety of languages.
Head to the LanguageARC website to participate in Global TIMIT and other citizen science language projects. Find LanguageARC on Facebook, Twitter, Instagram, and YouTube, or heard directly to the LanguageARC website to contribute. Look forward to more blog updates in the future.