Times are given in Irish Standard Time (IST), i.e., either UTC+0 or UTC+1 (Daylight Saving).

Loading Events

« All Events

  • This event has passed.

Cardamom Seminar Series #20 – Dr Ritesh Kumar (Dr Bhimrao Ambedkar University)

March 27, 2023 @ 5:00 pm 6:00 pm IST

Field Linguistics and NLP: Can they be friends?

The Unit for Linguistic Data at the Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway is delighted to welcome Dr Ritesh Kumar, Assistant Professor at the Dr Bhimrao Ambedkar University, to be the next speaker in our seminar series. He will talk about bridging the gap between NLP and Field Linguistics. Register here.


Field Linguistics has traditionally focused on collecting specific specimens of language under study with an aim to discover grammatical patterns and then producing grammatical descriptions, dictionaries, primers for children and other similar outputs. Given several practical challenges related to the collection of primary linguistic data from the field, including recruitment of willing speakers, funding crunch and challenges of eliciting relevant data, there has been a lot of focus on designing prompts and elicitation methods that could yield a large variety of grammatical structures in the language in the quickest possible way. Within some traditions of linguistic fieldwork, immersion and learning the language developed are the preferred methods over “elicitation” for writing grammatical descriptions, but the focus remains on producing grammatical descriptions. Over the past couple of decades, with the increasing focus on “documenting” languages (especially the endangered languages), a lot of attention has shifted from eliciting a variety of grammatical structures for producing descriptions to capturing linguistic data in all its different domains of usage such that the language could be preserved for posterity and may be utilised for the revitalisation of the language by the community members, researchers and other stakeholders. Irrespective of the way data has been collected and its final goals, there are a few things which are common across all of these exercises of primary data collection – (a) some kind of elicitation tool is generally used; (b) the data collected is highly diverse in terms of languages being represented and grammatical structures as well as domains of use being represented within each language ; (c) the data is highly organised, accurately transcribed and marked with rich grammatical information (inter-linearly glossed) completely manually, making it a prolonged, time-consuming process but resulting in very high-quality data; (d) given the efforts required and also the end goals in mind, the dataset is generally rather minuscule in size.

On the other hand, data collection efforts in NLP have traditionally focused on written documents and read speech of a few languages. While there has been an understanding that the data should be representative of different domains (that mainly leads to lexical diversity), there is hardly any active effort (at least at the methodological level) at ensuring grammatical diversity in the dataset in such a way that most, if not all, kinds of grammatical structures in a given language are included in the corpora – whatever diversity one gets is generally incidental. It has been assumed that this imbalance of grammatical structures could be alleviated to a certain extent if more naturalistic data is collected (such as those collected from general web crawls and social media conversations). But on the other hand, there has been a growing concern towards the huge imbalance in the number of languages represented in different NLP applications and resources. And when a resource representing just around 0.02% of all of the world’s languages (around 100-150 out of 6000-7000 languages being spoken over the globe) is widely recognised as “massively” multilingual, we know that there’s a long way to go. However, given the fact that a majority of languages available on the web have already been included in the Large Language Models, one of the most viable ways to scale up and include more languages in our dataset is to collect primary data from the field.

Given this, it is obvious that the two fields could complement each other in data collection and possibly in the development and usage of language technologies, especially in the context of underrepresented, underresourced, minoritised and endangered languages of the world. Field linguistics could benefit a lot by utilising the tools and technologies provided by NLP for data collection, glossing and analysis; on the other hand, NLP could make use of field methods and resources provided by field linguists for collecting rich datasets from the field. In my talk, I will discuss some of our recent ongoing work in this direction, wherein we are trying to build platforms and pipelines that could assist researchers in scaling up the linguistic field methods for large-scale data collection required in NLP. And at the same time provide a way for field linguists to integrate the NLP applications into their work.

About the Speaker:

Ritesh Kumar is an Assistant Professor of Computational Linguistics at the Department of Linguistics, Dr Bhimrao Ambedkar University, Agra and coordinator of the M.Sc program in Computational Linguistics at the Centre for Transdisciplinary Studies. He is also a Fellow of the Council for Strategic and Defence Research, New Delhi, where he has established and leads the specialised Division on Artificial Intelligence and Linguistics for several years. His research interests lie broadly at the intersection of pragmatics, sociolinguistics and computational linguistics. He has been working on the theoretical and computational modelling of politeness, impoliteness and aggression in language. His research in this field has been supported and funded by organisations like UKIERI, Microsoft Research and Facebook Research and has led to the development of tools and corpora for the automatic detection of aggression and offensive content in languages like Hindi, Bangla and Meitei. At the same time, he is deeply involved with the issues of language endangerment, documentation, revitalization and technology and resource development for minoritised and endangered languages in India. He has been working on the development of language resources and technologies (such as ASR systems, POS taggers, language models, etc.) for various minoritised and endangered languages (such as Awadhi, Beda, Braj Bhasha, Magahi, Toto, Chokri, among others) and also worked towards the development of software tools and infrastructure (such as LiFE App, mScrabble, etc.) for supporting this kind of research. Recently he, along with his students, has also started a startup, UnReaL-TecE, to undertake the development of consumer products for these languages. His research in this field has been generously funded by the Central Institute of Indian Languages, Karya Inc., Panlingua and the Ministry of Electronics and Information Technology, Government of India.


The seminar series is led by the Cardamom project team. The Cardamom project aims to close the resource gap for minority and under-resourced languages using deep-learning-based natural language processing (NLP) and exploiting similarities of closely related languages. The project further extends this idea to historical languages, which can be considered closely related to their modern form. It aims to provide NLP through both space and time for languages that current approaches have ignored.

Registration link: