Tokenization and Word List

Next: Insertion into Database Up: Obtaining Sentences Previous: MxTerminator

Tokenization and Word List

It might be the case that not every sentence in every HTML document is of interest. This is especially the case for AltaVista searches, in which specific words were almost certainly specified as part of the query. In this case there is a module in the pipeline that can be used to discard all sentences that do not contain one or more words that appear on a list of ``go words.'' A tokenizer is run first to ensure that punctuation and other miscellany do not confuse the word list filter. These are two separate modules: the tokenization module adds a <tokens> tag inside each <sentence> tag that contains the tokenized version of the sentence; the word-list module in general removes sentences that do not contain a word on the list of ``go words'' but does retain the two sentences surrounding each acceptable sentence for purposes of providing context when the sentence is displayed. These context sentences are not annotated.

Some sentences are discarded and others are marked as context, while the tokenized version of each sentence is added:

<document ip="204.71.212.248" content_type="text/html" 
          uri="http://www.transmitter.com:80/curr2000/curr000313.html" 
          timestamp="20010915184816">
      <context seqid="1">
         <body>RF CURRENT . </body>
      </sentence>
      <sentence seqid="2">
         <body>Welcome to RF Current, a weekly electronic newsletter 
	       focusing on Broadcast technical and F.C.C. </body>
         <tokens>Welcome to RF Current , a weekly electronic newsletter 
                 focusing on Broadcast technical and F . C . C . </tokens>
      </sentence>
      <context seqid="3">
         <body>related issues. </body>
      </context>
</document>

Next: Insertion into Database Up: Obtaining Sentences Previous: MxTerminator

Aaron Elkiss 2003-05-14