Date(s) - 11/04/2019
3:00 pm - 5:00 pm

From text to table. Extracting information from semi-structured data

In many Digital Humanities projects, information is extracted from sources that are at best semi-structured: they often (but not always) follow an implicit pattern which can be recovered with computational means. Theatre plays are a good example: to the human eye, the alternation of speech turns (implicitly marked by new lines and speaker names) is immediately recognizable, but a computer needs explicit instructions to recover and interpret the actual structure of a document.  

Making the implicit structure of documents explicit to a computer, helps us to interact with big data in more meaningful ways. For example, it enables researchers to study “who says what” in the case of theatre plays. The salon explores contexts where such enrichment of historical texts produces more valuable data–but also investigates the many hurdles. In the case of digitized sources, errors in the OCR or changes over time in the formatting of the texts. We will discuss several projects that have grappled with such data processing issues.

The CREATE Salon takes place on Thursday 11 April between 3:00-5:00pm, eLab Mediastudies (BG1 room 0,16) Turfdraagsterpad 9, 1012 XT Amsterdam


  • Alex Olieman will introduce, the project, that seeks to make municipal data (e.g council meetings) more accessible via geographic and exploratory search.
  • Leon van Wissen will talk about his work for the Golden Agents project in which he (re)structured and enriched person, property and geographical information from the Amsterdam City Archive’s Transport Acts (before 1811).
  • Kaspar Beelen and Thunnis van Oort present the Digifil project (CLARIAH 2018), which aims to automatically extract, digitise and publish data on film screening using weekly films listings in historical newspapers (1948-1995) as the main source of information.