The fundamental unit of interest in the Linguists' Search Engine is the sentence. Queries are evaluated in terms of whether or not they match individual sentences; even if inter-sentence relationships were allowed in the query specification, the fundamental unit we want to find and return is the sentence. Sentences appear in the context of a document; the document information is used to manage collections of sentences, provide the surrounding sentences for context information, and provide a pointer to the original web document in which the sentence appeared.
Hence the first task in constructing a corpus is obtaining sentences in the language of interest. Since the source of linguistic data is the Internet, sentences in the given language must be extracted from HTML text. The sentence extraction process uses a series of programs arranged in the classic Unix pipeline model that function as a series of XML filters. The general idea is intuitive and attractive; XDOC [7], a document workbench primarily used to annotate German texts, uses the idea of a series of XML filters that each perform a specific annotation task.
However, we found that simply using the UNIX pipeline with perl scripts using the PerlSAX parser for input and XML::Writer for output were too brittle - one error introduced into the stream breaks everything, and as the number of stages in the pipeline and length of the information being processed goes up, the chances of introducing an error into the XML stream goes up. For example, the producers of the British National Corpus (BNC) found that 4-5% of automatically SGML-annotated documents out of a 100-million word corpus contained some sort of error requiring human correction. [8] This is not suitable for automated sentence extraction and annotation. For this reason, the XML pipeline was abandoned as a mechanism for the full annotation framework.
The current XML pipeline consists of 5 stages, displayed in Figure 1.