Document Extraction

Next: HTML to Text Up: Obtaining Sentences Previous: Obtaining Sentences

Document Extraction

Currently there are two modules for document extraction. The first uses Internet Archive crawls in conjunction with a language ID package in order to extract documents in the language of interest. An Internet Archive crawl is typically a 25-100 megabyte (compressed) archive of essentially random HTML documents; hence, using Internet Archive crawls should produce a good random sample of sentences. This avoids problems that one could encounter using the Wall Street Journal corpus (for example) since the documents come from many different sources. This is similar to the goal of the BNC, but Internet sources should be even more heterogeneous and reflect a wide mixture of styles and usages.

In addition, it is possible to specify an AltaVista query and select sentences for annotation based on the result of this query. This is useful when the preexisting annotated corpus is too small or does not contain enough examples of interest for some particular word, but is much slower in that the sentences must be retrieved and annotated before they can be searched. By producing an AltaVista query that over-generates results for the real search in which one is interested, a narrow cross-section of the Web can be obtained and searched. Since the results of AltaVista searches are not random, they must be kept separate from sentences obtained from Internet Archive crawls. This is accomplished by storing a tag for each document denoting its source. Our AltaVista search module uses an automatic URL fetcher called ``puf'' (for ``Parallel URL Fetch'') similar to the popular program wget to retrieve the results of AltaVista searches.

Next: HTML to Text Up: Obtaining Sentences Previous: Obtaining Sentences

Aaron Elkiss 2003-05-14