On Thursday 28 May, a group of enthusiasts gathered on Zoom for a Cinema Context datasprint, aimed at cleaning and enriching programming data obtained via automatic extraction from digitised newspapers. For the occasion, a ‘Digifil Editor’ was created, a tool structuring the information that participants were invited to check and correct. Besides the goal of cleaning data, the purpose of the data sprint was also to evaluate the tool and think about ways to improve the process of cleaning and enriching.
After an introduction and instructions, the group went to work diligently throughout the day, all the while discussing problems via the Slack chat channel and on Zoom. After the session, participants continued working on the weeks allotted to them. Currently, 16 programming weeks have been delivered totalling almost 600 data rows.
Thanks again to all participants! We are currently processing the data and the feedback and deciding what will be the next steps to go forward, building on the enthusiasm experienced during the meeting, hopefully leading to a follow-up meeting.
A number of recommendations and suggestions came out of this day. Some of the more general remarks:
- The Cinema Context ID property that Hay Kranen has proposed to Wikidata has been accepted. So now Cinema Context has its own Wikidata property: https://www.wikidata.org/wiki/Property:P8296. See also: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Cinema_Context#Overleg. This allows for much better linking between Wikidata and Cinema Context. Three cheers for Hay and Menno. In the following weeks, Hay Kranen has imported over 15,000 Cinema Context IDs into wikidata!
- Hay Kranen also made a quick tool for looking up title / year in the IMDB:https://codepen.io/hay/full/rNOELJR, facilitating a fast lookup of the title/year combination offered in the editor.
- Can we turn this data sprint into a recurring and/or ongoing event? How can we facilitate participants who are willing to continue working on the material?
- Can the digifil tool be repurposed into a more generic tool for crowdsourcing film programming data?
- It would be useful to update Dutch titles in the IMDB ‘also known as’ section.
Feedback and suggestions specifically regarding the Digifil editor and the Digifil dataset:
- There needs to be a function that facilitates the interim saving of results and a way to keep track of which part of the data has been reviewed by the participant.
- Participants should ideally be able to go back to earlier results and make changes/corrections, also to be able to copy films/links that appear in consecutive weeks
- Participants should be able to verify whether the data was received in good order (and perhaps have a local back-up option?)
- Participants should be able to actively approve pre-filled data (with a button or checkbox) instead of just passively leaving it intact.
- The tool ought to have a functionality to add cinemas since some cinemas that were not pre-listed, apparently were active in the period but not pre-listed)
- Should we split the process into two tasks: 1) check if the correct titles are linked to the correct cinemas and 2) identify the films?
- Liliana Melgar suggested we look at the open source crowd source application Scribe: https://scribeproject.github.io/
- It would greatly facilitate the work if we can supply the more graphic ‘film ladders’ (instead of the bare listings, that the Digifil system prefers) as a reference – perhaps find a way to automate locating them in Delpher. The ‘ladders’ are more convenient for human reading and also contain a lot of contextual information that enhances the opportunities for identifying the film.