You Can Help Perfect Transcriptions in the British National Corpus
Updated: Dec 15, 2022
This blog post was based off of the "Perfecting the Audio BNC" project on the LanguageARC website. To participate, click here.
What is the British National Corpus (BNC)?
The British National Corpus (BNC) is a corpus that contains about 100 million samples of written and spoken British English from a variety of contexts such as phone calls, business meetings, and newspaper snippets. The BNC was started in 1991 and completed in 1994, with 2 additional sub-corpora released, BNC Sampler and BNC Baby. Its purpose is to represent diversity within British English, showing differences in speech based on age, social class, and location, among other factors. Written sources make up about 90% of the BNC, and spoken sources make up the remaining 10%. The corpus is encoded using TEI (Text Encoding Initiative), a standard method of digitally encoding text. The BNC has inspired the creation of language corpora in different countries.
John Coleman: Mining a Year of Speech
John Coleman, a professor of Phonetics at the University of Oxford, contributed to the development of an audio edition of the BNC. He explains, in a short YouTube video, why having audio samples of natural speech is important to the field of linguistics. He and many other linguists believe that more researchers should make the effort to collect audio data of spoken language. However, since spoken language is impossible to control or predict, many linguists tend to create more structured, written, or scripted language corpora. John's motivation for contributing to the audio edition of the BNC was the desire to create a resource with audio data of natural spoken language.
Perfecting the Audio BNC
The BNC’s speech samples are transcribed and time stamped by a computer to indicate where each word or phrase is uttered in the audio recording. However, these evaluations are not always accurate, and human expertise is needed to perfect them. The BNC audio samples are already freely available to the public, and your contributions will help improve this resource.
To contribute, head to the LanguageARC website, click on the "Projects" tab, open "Perfecting the Audio BNC", and evaluate how well given transcriptions match an audio clip by choosing between three levels of accuracy.
Burnard, L. (n.d.). What is the BNC? British National Corpus.