Next: Reliability and Scalability Up: A Scalable Architecture for Previous: Insertion into Database

Linguistic Annotation

**Figure 2:** Stage 2 - a cluster of nodes performing annotation tasks.
$\includegraphics[]{stage2.eps}$

Once the sentences are identified and have been added to the database, we are ready to annotate them in various ways. As noted before, each sentence is uniquely identified by the document in which it appears and its position within the document.

Each specific annotation is referred to as a task. Tasks can have dependencies; that is, a task may depend on one or more previous tasks having been completed. For example, a parser may require that the input be tokenized or annotated with part of speech tags. The interface between the annotation process and the database is simply a perl function that takes the required inputs as parameters and returns the annotation output. Each node participating in the annotation process performs at least one task, and many different nodes may be performing the same task - see Figure 2 for an example setup.

Some examples of existing tasks are producing part-of-speech tag sequences with Adwait Ratnaparkhi's MxPost tool [5], English constituency parses with Eugene Charniak's parser [1], and Dekang Lin's Minipar dependency parser [4]. The formats are all quite different; the database views the annotations merely as strings of text and records whether or not a particular annotation is present. Indexing and search tools must then use the annotation in whatever format was produced. As an example of the flexibility of the system, we were able to clone the database structure in use for English sentences and annotations and start a Spanish database merely by changing the HTML to text module to look for Spanish documents; since the database does not need to know about the structure of particular annotations, it did not require any changes to the database or the existing Conexor Spanish dependency parser to use its output with this framework.

Some examples of the format of annotation produced, for the sentence ``Even now, we still laughs whenever the robot punishes us with electricity and mind power,'' an usual sentence that might be of interest to a linguist:

MxPost provides Penn Treebank-style part of speech tags - in this case ``Even_RB now_RB ,_, we_PRP still_RB laughs_VBZ whenever_WRB the_DT robot_NN punishes_VBZ us_PRP with_IN electricity_NN and_CC mind_NN power_NN ._.''

The Charniak parser produces Penn Treebank-style constituency parses in a bracketed LISP S-expression format: ``(S1 (S (ADVP (RB Even) (RB now)) (, ,) (NP (PRP we)) (ADVP (RB still)) (VP (VBZ laughs) (SBAR (WHADVP (WRB whenever)) (S (NP (DT the) (NN robot)) (VP (VP (VBZ punishes) (NP (PRP us)) (PP (IN with) (NP (NN electricty)))) (CC and) (VP (VB mind) (NP (NN power))))))) (. .)))''

Minipar produces dependency parses in its own special graph format:

( 
E2      (()      U      *       ) 
1       (Even   ~ A     2       mod     (gov now)) 
2       (now    ~ N     E2      ) 
4       (we     ~ N     E2      ) 
5       (still  ~ A     E2      ) 
6       (laughs laugh N E2      ) 
E0      (()     fin C   E2      ) 
7       (whenever       ~ A     E0      wha     (gov fin)) 
8       (the    ~ Det   9       det     (gov robot)) 
9       (robot  ~ N     10      s       (gov punish)) 
10      (punishes       punish V        E0      i       (gov fin)) 
E3      (()     robot N 10      subj    (gov punish)    (antecedent 9)) 
11      (us     ~ N     10      obj     (gov punish)) 
12      (with   ~ Prep  10      mod     (gov punish)) 
13      (electricity    ~ N     12      pcomp-n (gov with)) 
15      (mind   ~ V     10      conj    (gov punish)) 
E4      (()     robot N 15      subj    (gov mind)      (antecedent 9)) 
16      (power  ~ N     15      obj     (gov mind)) 
) 
}

These tasks are similar to ProcessingResources in GATE, the General Architecture for Text Engineering [2]. GATE provides a fairly heavyweight framework in which documents are converted to LanguageResources which used an XML-based generalized annotation format; various ProcessingResources which use the GATE API provide annotation services. The ProcessingResources can be arranged in a pipeline but communicate using the GATE API rather than just XML. This provides a more robust but less flexible architecture. It is similar to the database approach in that corpus management services are provided, unlike the XML approach before, but the whole framework is more oriented toward exploration and development rather than large-scale annotation and as such does not provide easy parallelization.

The architecture is very flexible in terms of what kind of annotation is allowed. Tasks can be defined as having one or more inputs, each of which simply comes from an unlimited-length text field in the database; the output from the task is added to a different field. Hence the database does not impose any structure or restriction on the format of the annotations, unlike GATE.

For each task, a database table is maintained that contains a list of sentences waiting for a particular annotation task to be performed. This table is called the ``to-do table.'' Sentences can be referenced by multiple to-do tables; i.e. a sentence could be ready to be parsed by two different parsers.

For example, suppose a sentence with document ID (docid) 100 and sequence ID (seqid) (the position of the sentence within the document) 50; suppose we had two annotation tasks, a part of speech tagger and a parser, and that the parser required input to have part of speech tags while the part of speech tagger just needed the sentence body. Upon loading the sentence into the database, the sentence's docid/seqid pair would be added to the to-do table for part of speech tagging; when the sentence was part of speech tagged, the sentence's docid/seqid pair would be removed from the part of speech tagging to-do table and added to the parsing to-do table.

To perform the annotation, a process is started that selects the first k rows of the to-do table (for k some small integer) and marks these sentences in the to-do table as being in use so that another process does not try to perform the same annotation on the same sentences. However, if this were to happen it would not cause an inconsistency so long as the actual annotation tool is deterministic - existing annotations are assumed not to change over the lifetime of the database, so both annotation processes would receive the same input and hence produce the same output. The inputs for the annotation task are retrieved and passed to the actual annotation tool; once the output is obtained and added to the database, the sentence is removed from the to-do table. In addition, the sentence is examined along with the dependencies for the current task to determine what other tasks the sentence is now ready for. If the annotation node is stopped, it marks the sentences it checked out as no longer being in use. If the annotation node crashes, then the sentences will still be marked in use and will simply not be annotated until sentences in the to-do table are manually marked as being not in use.

Subsections

Reliability and Scalability

Next: Reliability and Scalability Up: A Scalable Architecture for Previous: Insertion into Database

Aaron Elkiss 2003-05-14