Improving the accessibility of textual historical collections via transfer learning

Over recent decades, digitization efforts in the cultural heritage sector have been short of “phenomenal” in scale. Digitization efforts mainly focused on materials containing textual contents, such as books and newspapers. While making textual heritage collections accessible is an ambition that starts with digitization, it does not end with it. Machine learning is the field of research focused on automating tasks such as information extraction, classification and retrieval. The accessibility of heritage collections would greatly improve if we could apply these techniques. Unfortunately, we still face several challenges to make the textual contents of digitized heritage collections accessible. Firstly, text extracted from digitized collections by automatic Optical/Handwritten Character Recognition (OCR/HCR) methods contains errors. Secondly, collections are historical and written in multiple languages, all subject to vary diachronically. Thirdly, manual annotations necessary to use machine learning techniques often require specialized expertise, making them costly and not scalable. Lastly, modern machine learning requires prohibitive compute resources, often not available to the heritage sector. These are general challenges in Natural Language Processing (NLP, the engineering domain dealing with human language): languages change all the time, they are many and they are used in a variety of ways.As a consequence, recent developments in machine learning for NLP have witnessed a growing interest for an approach which promises to address such challenges: transfer learningTransfer learning is the adaptation of machine learning techniques trained for tasks where large resources are available in terms of data and compute, to other, related tasks, for which we only possess minimal or no data and limited compute resources.In this project, we will apply transfer learning on a set of tasks aimed at making textual heritage collections accessible. Given its novelty, transfer learning has received little attention from the heritage sector: the proposed project could yield high-gains at low-risk. In particular, this project makes a first step that can inform the heritage community on the possible gains of using transfer learning and help motivate (or not) further work and grant applications in this direction.The project will focus on specific tasks, including:

  • Named Entity Recognition (NER): a language parsing task whose aim is to detect mentions of named entities in a text, such as persons of places.
  • Event extraction: a language understanding task which aims at extracting triplets reporting factual occurrences (subject, verb, object) in order to populate a knowledge base.
  • Semantic change: a language modelling task which aims at detecting and qualifying shifts in word meaning over time.

The project will last one year, until August 2021.