Improving the accessibility of textual historical collections via transfer learning

Over recent decades, digitization efforts in the cultural heritage sector have been short of “phenomenal” in scale. Digitization efforts mainly focused on materials containing textual contents, such as books and newspapers. While making textual heritage collections accessible is an ambition that starts with digitization, it does not end with it. Machine learning is the field of research focused on automating tasks such as information extraction, classification and retrieval. The accessibility of heritage collections would greatly improve if we could apply these techniques. Unfortunately, we still face several challenges to make the textual contents of digitized heritage collections accessible. Firstly, text extracted from digitized collections by automatic Optical/Handwritten Character Recognition (OCR/HCR) methods contains errors. Secondly, collections are historical and written in multiple languages, all subject to vary diachronically. Thirdly, manual annotations necessary to use machine learning techniques often require specialized expertise, making them costly and not scalable. Lastly, modern machine learning requires prohibitive compute resources, often not available to the heritage sector. These are general challenges in Natural Language Processing (NLP, the engineering domain dealing with human language): languages change all the time, they are many and they are used in a variety of ways.As a consequence, recent developments in machine learning for NLP have witnessed a growing interest for an approach which promises to address such challenges: transfer learningTransfer learning is the adaptation of machine learning techniques trained for tasks where large resources are available in terms of data and compute, to other, related tasks, for which we only possess minimal or no data and limited compute resources.In this project, we will apply transfer learning on a set of tasks aimed at making textual heritage collections accessible. Given its novelty, transfer learning has received little attention from the heritage sector: the proposed project could yield high-gains at low-risk. In particular, this project makes a first step that can inform the heritage community on the possible gains of using transfer learning and help motivate (or not) further work and grant applications in this direction.The project will focus on specific tasks, including:

The STCN and Knowledge Markets

This project uses the Short Title Catalogue Netherlands to map and analyse shifts in markets for knowledge during the early modern period.

Part-of-speech tagging of historical Dutch texts
Enriching a text with part-of-speech tags can be useful for researching both the content and the form of this text.

COLEM: Creating an Orthographic Layer for Early Modern Texts
This project aims to test the attainability of the creation of a tool that can provide Dutch prose texts from the 17th century with an orthographic layer of Modern Dutch. Historical Dutch texts are characterized by its unstable orthographies, which troubles automatic processing and analyzing of the texts.

Past visions of the future in digitized newspapers
This projects look at detecting references to the future in historical Dutch newspapers (Delpher).

Extraction and Exploration of Spatial Information in Documents
The projects aims to build a tool that performs a geographical search along a text document and is capable of recognizing, disambiguating and visualizing any place entity such as countries, regions, cities, neighbourhoods, streets or even buildings, showing a map with all the locations named in text as final output.

TOPIC: Tracing tracks of philosophical innovation in digital text corpora
This project investigates the possibilities of topic modeling technologies for research on the dissemination of new philosophical ideas in the Dutch Republic during the period 1650-1725.

 Mapping Historical Networks of the Creative Industry: Connecting Creative Agency in Early Modern Amsterdam, Venice, Florence, and Rome 
This project investigates the concept of the creative industry from a historical perspective, through the analysis of the rich epistolary correspondences between agents of the creative industries in early modern Italy and the Dutch Republic, focusing in particular on the role of publishers and printers.

  • Named Entity Recognition (NER): a language parsing task whose aim is to detect mentions of named entities in a text, such as persons of places.
  • Event extraction: a language understanding task which aims at extracting triplets reporting factual occurrences (subject, verb, object) in order to populate a knowledge base.
  • Semantic change: a language modelling task which aims at detecting and qualifying shifts in word meaning over time.

The project will last one year, until August 2021.