MxTerminator

Next: Tokenization and Word List Up: Obtaining Sentences Previous: HTML to Text

MxTerminator

The next step in the pipeline is to perform complete sentence boundary disambiguation. The sentence boundary disambiguation module inserts a <sentence> tag pair around each sentence it encounters. It also includes an attribute in the tag denoting the sequence ID of the sentence in the document, i.e. this is the 1st, 2nd, 10th, 100th, etc. sentence in the document.

A simple perl interface that starts up and controls Adwait Ratnaparkhi's MxTerminator [6] is used. MxTerminator is different from many sentence boundary disambiguation tools in that instead of a set of regular expressions or other rules it uses a statistical approach based on maximum entropy. MxTerminator works extremely well overall, but one problem in using MxTerminator is that it was trained on the Wall Street Journal corpus. Ratnaparkhi claims a high degree of portability to other domains and even other languages using the Roman alphabet, but one of the implicit assumptions made is that the text being annotated actually consists of a stream of sentences. Hence MxTerminator can have difficulty dealing with some of the peculiarities of Internet such as text menus and lists of links. The use of preexisting HTML markup to provide hints to MxTerminator helps this to some degree; retraining MxTerminator would likely help more. Unfortunately MxTerminator does not include any functionality to identify and discard non-sentences; non-sentences are unlikely to come up as the result of a search, but still take time to annotate (and can confuse annotation tools) and take up space in the database.

At this stage the sentences are delimited. MxTerminator isn't perfect:

<document ip="204.71.212.248" content_type="text/html" 
    uri="http://www.transmitter.com:80/curr2000/curr000313.html" 
    timestamp="20010915184816">
      <sentence seqid="1">
         <body>RF CURRENT . </body>
      </sentence>
      <sentence seqid="2">
         <body>Welcome to RF Current, a weekly electronic newsletter 
               focusing on Broadcast technical and F.C.C. </body>
      </sentence>
      <sentence seqid="3">
         <body>related issues. </body>
      </sentence>
</document>

Next: Tokenization and Word List Up: Obtaining Sentences Previous: HTML to Text

Aaron Elkiss 2003-05-14