Times are given in Irish Standard Time (IST), i.e., either UTC+0 or UTC+1 (Daylight Saving).

Loading Events

« All Events

  • This event has passed.

Cardamom Seminar Series #18 – Dr Koustava Goswami (Adobe Research)

January 30, 2023 @ 5:00 pm 6:00 pm GMT

Multilingual Sentence Embedding for Low-resource Languages

The Unit for Linguistic Data at the Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway is delighted to welcome Dr Koustava Goswami, a research scientist at Adobe research lab, to be the next speaker in our seminar series. He will talk about building multilingual sentence embeddings for resources for low-resource languages. Register here.

Abstract:

Multilingual sentence embeddings capture rich semantic information not only for measuring similarity between texts but also for catering to a broad range of downstream cross-lingual NLP and NLU tasks. State-of-the-art multilingual sentence embedding models require large parallel corpora to learn efficiently, which confines the scope of these models. To alleviate the above challenge, we introduced a novel sentence embedding framework based on an unsupervised loss function for generating effective multilingual sentence embeddings, eliminating the need for parallel corpora. We demonstrated the efficacy of an unsupervised as well as a weakly supervised variant of our framework on multilingual textual similarity, parallel sentence matching and bitext mining NLP tasks. The proposed unsupervised sentence embedding framework outperforms even supervised state-of-the-art methods for certain under-resourced languages on the parallel sentence matching task and on a monolingual (English SentEval tasks) benchmark.

On the other hand, transformers-based large multilingual language models such as XLM-R (XLM-RLarge ) have performed significantly well in diverse semantic understanding and classification tasks for different industrial applications including user intent classification, smart chatbots, sentiment analysis and question answering. However, fine-tuning such large pretrained architectures is a resource and compute intensive limiting its wide adoption in enterprise environments. We present a novel efficient and lightweight framework based on sentence embeddings to obtain enhanced multi-lingual text representations for domain-specific NLU applications. Our framework combines the concepts of up-projection, alignment and meta-embeddings enhancing the textual semantic similarity knowledge of smaller sentence embedding architectures. Extensive experiments on diverse cross-lingual classification tasks showcase the proposed framework to be comparable to state-of-the-art large language models (in mono-lingual and zero-shot settings), even with lesser training and resource requirements.

About the Speaker:

Dr Koustava Goswami is a Research Scientist at Adobe Research India. He is broadly interested in representation learning and natural language processing. More specifically, he is interested in developing deep algorithms for low-resource settings and applications.

During his PhD in the Insight Centre for Data Analytics lab at the University of Galway, he worked on building different novel unsupervised deep models using representation learning, which can be applied to diverse low-resource languages and applications across language families. One of his works is to enhance the performance of the multilingual sentence embedding models for low-resource domains in zero-shot settings by injecting better word representations. His significant contributions during PhD include the invention of different unsupervised loss functions, including the Maximum Likelihood Clustering (MLC) loss function and the designing of a new machine learning framework called the Anchor-Learner framework. He has also worked on instructional learning using prompt engineering for few-shot model training.

During PhD, he has also had internship stints at Bosch Research Germany and Huawei Research Centre Ireland. His published research papers appeared in various NLP conferences, including EMNLP, COLING, LREC, IEEE, Frontier Journal, NAACL etc.

Host:

The seminar series is led by the Cardamom project team. The Cardamom project aims to close the resource gap for minority and under-resourced languages using deep-learning-based natural language processing (NLP) and exploiting similarities of closely related languages. The project further extends this idea to historical languages, which can be considered closely related to their modern form. It aims to provide NLP through both space and time for languages that current approaches have ignored.

Registration link: