Cardamom Seminar Series #14 – Adrian Doyle (Insight/DSI, National University of Ireland, Galway)

July 25 @ 5:00 pm 6:00 pm IST

The Cardamom Workbench for Historical and Minority Languages

The Unit for Linguistic Data at the Insight SFI Research Centre for Data Analytics / Data Science Institute, National University of Ireland Galway is delighted to welcome Adrian Doyle, a Research Associate at the Insight Centre for Data Analytics and a PhD candidate in the Moore Institute for Research in the Humanities and Social Studies at the National University of Ireland Galway, to be the next speaker in our seminar series. He will talk about the Cardamom Workbench for Historical and Minority Languages. Register here.


Many research projects in the humanities necessitate the creation of text editions or datasets by language experts whose research interests go beyond the mere production of the text itself. Once such a project has ceased to be funded a common complaint is that no outlet exists for the text generated during the course of the research. Without further investment to produce either a print or digital edition, it may be difficult to propagate valuable texts to other scholars who wish to continue research in the same area. Standards like TEI have existed for decades with the aim of ensuring text data is reusable after its initial production. These can come with a steep learning curve, however, making them unappealing to scholars who wish to focus on linguistic aspects of research, and who cannot devote the time necessary to develop the requisite technical expertise. Moreover, where such text resources are made available online, a common concern is that continuous costs associated with web-hosting or technical maintenance can periodically lead to down-time, and can cast doubt on the long-term availability of the text resource online. In contrast to the text-with-no-outlet problem faced by many humanities scholars, data sparsity often impedes the research potential of computational linguists. Modern computational techniques are typically data driven, and improvements are possible only by increasing the amount of text data available. In the case of historical and minority languages, there is often an insufficient amount of text data available with which to train high-performance models. While many computer scientists lament this fact in research papers, little has been suggested to remedy the situation and, again, researchers may simply wish to move on with their own research, applying techniques to languages for which a more substantial amount of text data already exists.

In this presentation, the new Cardamom Workbench for historical and minority languages will be demonstrated. This workbench is designed to provide linguists with a simple pipeline from uploading their text to either online or print publication. An intuitive GUI will allow users to edit and annotate their text without requiring any pre-existing technical expertise. The workbench will also make some of the most advanced computational techniques accessible to humanities-based researchers who wish to perform such analyses on their own texts. Cardamom technologies are built into the workbench which has been designed to enable the application of both common-practice techniques and more advanced methods to under-resourced languages. This enables language experts to perform tokenisation, part-of-speech tagging, morphological analysis and more without ever writing a line of code, and ensures that the resulting corpus of digital text meets modern annotation standards.

Workbench users will be able to submit their completed text for hosting online using Cardamom project web servers, ensuring long-term availability of the data. It is also envisaged that users will be enabled to develop their texts into full editions for print publication using the workbench. Finally, at the end of the demonstration, an open discussion will take place regarding the Workbench. Attendees from all backgrounds are invited to raise any questions which they may have and to provide any feedback which they feel may result in the development of a tool which best serves their needs.

About the Speaker:

Adrian Doyle is a Research Associate at the Insight Centre for Data Analytics and a PhD candidate in the Moore Institute for Research in the Humanities and Social Studies at the National University of Ireland Galway. He has been a member of the Cardamom research team since its inception. His current research focuses on natural language processing applications and optimisations for Old Irish.

He is the creator and curator of wuerzburg.ie, the online repository for the annotated digital text of the Würzburg glosses. His research interests include tokenisation, part-of-speech tagging, and dependency parsing issues specific to Old Irish, and more generally, IT applications for historical and minority languages.

He received his BA from University College Cork in 2012, after which he went on to receive an MA in Medieval Celtic Languages and Literature from University College Dublin in 2013, and an MSc from University College Cork in Information Systems for Business Performance in 2016. He has also worked as a Business Analyst and has been responsible for the creation and maintenance of database systems for industry clients.


The seminar series is led by the Cardamom project team. The Cardamom project aims to close the resource gap for minority and under-resourced languages using deep-learning-based natural language processing (NLP) and exploiting similarities of closely related languages. The project further extends this idea to historical languages, which can be considered closely related to their modern form. It aims to provide NLP through both space and time for languages that current approaches have ignored.

