The primary advantages to this approach over the idea of adding annotation with successive XML filters are reliability and scalability. Reliability is increased because everything is atomic at the level of the sentence. If something goes wrong while annotating one sentence, none of the other sentences are lost, the to-do table information remains intact, and the annotation process can restart and continue. Even if the to-do tables becomes corrupt, the entire collection can be rescanned to determine which sentences need which tasks to be performed. This is in contrast to the UNIX pipeline of XML filters model, in which a large set of documents is passed through all at once; if an error is introduced somewhere in the pipeline the entire collection is unusable, and there is no way to restart if things go wrong.
The database-driven approach also offers better scalability. Although in the XML model simple horizontal scalability is obtainable by simply running the pipeline on several machines simultaneously with different document sets, there is still the problem of combining various sets of documents into a single coherent set for later search. With the database model, sentences are added to the database as soon as it is clear which sentences are of interest, so the problem of having many different files with no particular way of managing them is eliminated. In addition there is much greater control over how the sentences are annotated. In the XML pipeline model either the annotation processes run in parallel as an actual UNIX pipeline, or they in series, producing a temporary file after one or more steps. In either case, there is no provision to limit the amount of resources used by any particular node participating in the annotation process or for any particular node only to perform one kind of annotation task. Furthermore, the data is tied to a particular node. With the database model, each node can be assigned to start only as many processes as it can handle, and different tasks can be assigned to different nodes. More than one node can participate in the same annotation task; the to-do table mechanism coordinates which computers perform which tasks. Data and computation are kept separate so that any node can annotate any sentence.
Hughes and Bird describe what they call a ``Grid-Enabled'' framework for natural language engineering in which data sources and processing components register with a resource discovery agent to provide something like GATE that is much more scalable . Their framework retains generality by using Semantic Web-style metadata to allow resources to identify each other; this adds a layer of complexity not necessary for a more controlled setting.
The limitation to how well the database-driven approach scales is the time it takes to move the data to be annotated across the network versus how long it takes to perform the annotation. Unlike Hughes and Bird's project, which needed to annotate large audio files, all of our annotation is done on fairly small chunks of text. Even on a 2.0 GHz Pentium 4, the Charniak parser averages several seconds per parsed sentence. Time to retrieve sentences from the database is kept down by the fact that both the main sentences table (which holds the input data) and the to-do tables are sorted in order of the sentence's insertion into the database, so retrieving sentences for annotation does not generally involve large numbers of random lookups to the main sentences table. This also means that even if multiple nodes request sentences from different to-do tables, it is likely that the number of disk blocks needing to be read at a particular point in time will be proportional to the number of annotation tables rather than the number of nodes or number of sentences being retrieved. For tasks where the database retrieval and network transfer time overwhelm the actual annotation time, it is likely that the task is a fairly trivial one which might be better combined with one or more other tasks.
It would also be feasible to add another layer of scalability and allow multiple database nodes. In this case the simplest approach to maintaining consistency would be to have a top-level node that assigns a particular database to incoming sentences and provides information to annotation nodes about which databases have sentences ready to annotate for particular tasks; this would be similar to the resource discovery agent mentioned by Hughes and Bird. In addition, most of the logic for adding and removing sentences from to-do tables is in the annotation client; moving this logic to the database could provide some benefit by not needing to explicitly perform additional SQL queries each time sentences are requested or successfully annotated.