The Linguists' Search Engine is a project conceived by Dr. Philip Resnik (Department of Linguistics and UMIACS, University of Maryland) and Dr. Christiane Fellbaum (Department of Psychology, Princeton University) to increase the availability of quickly and easily queriable linguistic material with the goal of making the use of empirical methods in the field of linguistics more widespread. The Linguists' Search Engine will consist of a very large, constantly growing corpus, on the order of billions of words. This corpus will be mined from the Web and annotated using off-the-shelf natural language processing tools. It will be searchable through a web interface. For this enterprise to be successful, one of the necessary components is a flexible, scalable and most importantly robust annotation and storage framework.
The current framework consists of two stages: an UNIX pipeline of XML filters for obtaining sentences to annotate (shown in Figure 1), and a central database communicating with a distributed collection of nodes that perform the annotation (shown in Figure 2.)