Comparing Methods for Overcoming Data Scarcity in Less-Resourced Natural Language Processing

Chester Palen-Michel

doi:10.48617/etd.1489

Back

Comparing Methods for Overcoming Data Scarcity in Less-Resourced Natural Language Processing

Dissertation

Open access

Comparing Methods for Overcoming Data Scarcity in Less-Resourced Natural Language Processing

Chester Palen-Michel

Doctor of Philosophy (PhD), Brandeis University, Graduate School of Arts & Sciences

2025

DOI:

https://doi.org/10.48617/etd.1489

Abstract

Less-Resourced Languages

Low-Resourced Languages

Named Entity Recognition

Natural language processing

NLP

Summarization

Natural language processing (NLP) tasks like named entity recognition (NER) and automatic text summarization assist in the understanding of vast numbers of documents across a wide variety of domains. While large amounts of available data have lead to strong performance on many tasks for high-resource languages like English, for less-resourced languages, performance often lags due to a scarcity of data. This dissertation addresses this performance gap by creating and curating novel benchmark datasets with a focus on less-resourced settings, and then examines the use of various techniques for less-resourced scenarios like data augmentation, multilingual transfer, and continual learning with knowledge distillation on the tasks of NER and summarization. We find that fine-tuned transfer learning models can be highly effective for these languages, often outperforming much larger language models. We find that while data augmentation significantly benefits models trained on limited data, its impact is less pronounced for models already benefiting from multilingual transfer. For less-resourced datasets with diverse entity type ontologies, we find continual learning with knowledge distillation is beneficial in preventing catastrophic forgetting. Additionally, many of the techniques have applications in the domain of e-commerce. We apply data augmentation and continual learning in higher-resourced languages in the e-commerce domain and find a benefit to training with continual learning across different product categories. Overall our results indicate benefits of using these various strategies, but that their effectiveness varies in different settings.

Files and links (1)

pdf

Dissertation-Final-Final-9.11.256.04 MBDownload View

Open Access

Metrics

1 File views/ downloads

1 Record Views

Details

Title: Comparing Methods for Overcoming Data Scarcity in Less-Resourced Natural Language Processing
Creators: Chester Palen-Michel
Contributors: Constantine Lignos (Advisor)
Nianwen Xue (Committee Member)
James Pustejovsky (Committee Member)
Antonios Anastasopoulos (Committee Member)
Awarding Institution: Brandeis University, Graduate School of Arts & Sciences; Doctor of Philosophy (PhD)
Theses and Dissertations: Doctor of Philosophy (PhD), Brandeis University, Graduate School of Arts & Sciences
Number of pages: 231
Identifiers: 9924577243101921
Academic Unit: Michtom School of Computer Science
Language: English
Resource Type: Dissertation

Comparing Methods for Overcoming Data Scarcity in Less-Resourced Natural Language Processing

Abstract

Files and links (1)

Metrics

Details

Brandeis University Social media