next up previous
Next: Linguistic Annotation Up: Obtaining Sentences Previous: Tokenization and Word List

Insertion into Database

Once we have the final set of sentences from a particular AltaVista search or Internet Archive crawl, the sentences are uploaded into a SQL database. At this point a unique document ID is assigned to each document, and a tag can be assigned to the set of documents contained in the XML file identifying which Internet Archive crawl or AltaVista search the results came from. This tag is useful when performing search or indexing tasks since it provides the ability to search only a relevant sub-corpus in the case of AltaVista searches.

Aaron Elkiss 2003-05-14