Abstract
Natural language processing (NLP) tasks like named entity recognition (NER) and automatic text summarization assist in the understanding of vast numbers of documents across a wide variety of domains. While large amounts of available data have lead to strong performance on many tasks for high-resource languages like English, for less-resourced languages, performance often lags due to a scarcity of data.
This dissertation addresses this performance gap by creating and curating novel benchmark datasets with a focus on less-resourced settings, and then examines the use of various techniques for less-resourced scenarios like data augmentation, multilingual transfer, and continual learning with knowledge distillation on the tasks of NER and summarization.
We find that fine-tuned transfer learning models can be highly effective for these languages, often outperforming much larger language models. We find that while data augmentation significantly benefits models trained on limited data, its impact is less pronounced for models already benefiting from multilingual transfer.
For less-resourced datasets with diverse entity type ontologies, we find continual learning with knowledge distillation is beneficial in preventing catastrophic forgetting.
Additionally, many of the techniques have applications in the domain of e-commerce.
We apply data augmentation and continual learning in higher-resourced languages in the e-commerce domain and find a benefit to training with continual learning across different product categories.
Overall our results indicate benefits of using these various strategies, but that their effectiveness varies in different settings.