Natural Language Processing (NLP) has developed the primary tools for how we interact with computers, but yet these tools are only available for English and a few ‘central’, mostly European languages. This has lead to a situation where many people around the world cannot interact with computers in their own language. With the rise of social media and the expansion of technologies into developing markets around the world, the Web is increasingly becoming more linguistically diverse, and this creates new commercial and academic possibilities.
Deep learning has created a recent technological shift towards the creation of language-independent representations of language, but it has previously been assumed that deep learning is not suitable for minority and historical languages as these methods are incredibly data-hungry and such languages have only limited resources. However, both historical and minority languages have many closely related languages and the construction of single models for language families can overcome these resource gaps.
The Cardamom project will proceed on three fronts: Firstly, we will collect resources from social media, the Web and academic sources in order to construct the largest corpus available for these languages. Secondly, we will develop a new model of language creating word embeddings that take into account not only the literal form of words, but also the phonetic and diachronic information. Finally, we will apply our model into two areas: firstly, we will build tools for natural language processing targeting minority languages in Europe and India and secondly, we will look at historical language usage and see how we can understand and track language change over time. This project will widen the scope of NLP to meet the needs of the billion global speakers of minority languages and would advance digital humanities by directly studying the process of language change.