by Dieuwke Hupkes

When someone asks me what my field of research is, I usually tell them I am a computational linguist. Nowadays, many people use technologies invented in this field of research, such as Google Translate, a speech recognition system on a computer, a spelling correction function in a text editor or word-predictor T9 on a cell phone (although I seem to be one of the few people still using that), so most people are aware that there are people working on computational models of language. Very few people, however, seem to have an adequate impression of the focus or the difficulty of the field.

Maybe this is not too surprising.

First of all, contrary to what the term suggests, many computational linguists do not really practice what is considered linguistics. Knowledge about language is clearly necessary to understand the problems we are modelling but very hard to incorporate in widely applicable models. In the early days of the field many computational models were founded on linguistic principles, but this was generally not very successful, as painfully illustrated by the well known quote by Frederik Jelinek, working at IBM at the time:

“Every time I fire a linguist, the performance of the speech recognizer goes up”

Statistical models, based on large quantities of (sometimes linguistically) annotated data, outperform models that are actually based on detailed knowledge of language, and this results in a shift in focus of the field: rather than dealing with all the fine-grained nuances in language, many computational linguists focus on getting right a much more general picture and I suspect that some of us might every now and then even forget what the actual topic of our research is.

I have always been very interested in the nuanced system that is language, but working within CREATE made it clear that even I have inevitably adapted to the trend of mostly looking at the bigger picture. I found myself getting impatient with people for bombarding me with questions about the treatment of specific phenomena in historical Dutch that are clearly very interesting from a linguistic point of view but too minor to do anything for the overall tagging accuracy. They were not a part of the bigger problem and therefore not very interesting to me.

Of course, when working with a tagger that assigns the wrong tag to 30% of the words, it would not be particularly fruitful to focus on one specific construct that is tagged wrongly every now and then. It is hard to overestimate the complexity of modelling language computationally and often getting it right approximately is already a huge accomplishment, which should not be forgotten by the people from outside the field using the tools that are developed within it.

On the other hand, I should not have been impatient, because even though answering the questions did not help me improve the system, I should realise that it is important to know the limits of a system even if right now they cannot really be solved. Not only are they indispensable for an optimal use of the tool – if my mom were more aware what language technology her iPad used, she probably wouldn’t send emails ending with an automatically respelled name – but they also point us to interesting facets of our language. During my research to POStagging of historical Dutch I learned many interesting technical things, but I am also still asking everyone who will listen whether they know that Dutch used to have a double negation (like French still does), a fact that I would have never known if I hadn’t paid attention to the details of my corpus.

Incidentally, the answer to this question was almost always ‘no’.