This blog post is based on the project "Discovering Grammar Through Translation" on LanguageARC. To contribute, click here.
Creating an Elicitation Corpus
The "Discovering Grammar Through Translation" project on LanguageARC calls for bilingual individuals fluent in English and another language. In this project, participants are given an English sentence with context and their task is to translate it into a different language that they speak. Features of these languages, like plurality, tense, and gender, are revealed in these translations, and this data helps researchers develop multilingual elicitation corpora. An English language corpus created at Carnegie Mellon University’s Language Technologies Institute is used to provide the sentences for this task.
Machine Translation, Citizen Scientists, and Elicitation Corpora
The demand for machine translation, or translation done by computers, is rapidly increasing in today's technological society. Computational linguistics and software experts are a key part in developing machine translation technologies, but humans who speak multiple languages fluently are also needed to help create the multilingual corpora that aid these technologies. These individuals can partake in controlled elicitation, where participants are given sentences from an already existing corpus such as a grammar book to translate into another language. Many people in the world consider themselves to be bilingual or multilingual, and they can contribute as citizen scientists in projects like "Discovering Grammar Through Translation" without being experts in machine translation technology.
Machine Translation Using Elicited Corpora
Some projects such as Nirenburg (1998), Sherematyeva & Nirenburg (2000), and Jones & Havrilla (1998) have used controlled elicitation corpora and translations of them to "teach" grammatical rules such as plurality, gender, and tense to machine translation technology. In Jones and Havrilla (1998), glosses are used to annotate the elicitations and translations. This helps provide a way to organize and effectively use the data when develop machine translation technology. However, creating word for word translations and keeping word boundaries consistent between languages when developing multilingual elicitation corpora can be extremely difficult due to grammatical and structural differences between languages. Despite this, linguists still make efforts to gather and collect data from speakers of different languages, especially underrepresented languages, to attempt to understand unique linguistic features and develop machine translation technologies.
Probst, K., & Levin, L. (n.d.). Challenges in Automated Elicitation of a Controlled
Bilingual Corpus. Carnegie Mellon University.