The two-phase architecture allows a clear separation of corpus creation and corpus annotation; the XML pipeline allows precise control over which sentences are added to the database while the cluster of machines performing annotation provides a scalable and reliable annotation service.
This annotation framework has so far allowed the extraction and annotation with constituency parses of 115,075 Internet Archive documents in English comprising 3.1 million sentences as well as over 55,000 documents comprising 328,350 sentences retrieved as a result of AltaVista searches. Parses were performed both on a cluster of 3 Sun Blade 1000 workstations with 1 gigabyte of memory each as well as on a cluster of three geographically distributed Pentium 4 dual-processor 2.0GHz machines with 2 gigabytes of memory each. The Pentium 4 cluster was able to parse 150,000 - 200,000 sentences per day. In addition, 362 Spanish documents comprising 10,750 sentences were successfully extracted from Internet Archive material and annotated with constituency parses as a test of the framework's portability to other languages. The database has no problems handling this relatively small set of data; however, current tools for searching tree-type data like constituency parses are not efficient. Current research is focused on developing an indexing mechanism and query language to support fast search of a large number of small trees.