Once the raw HTML documents have been obtained, the HTML tags must be stripped and documents must be split up by sentence. The HTML to text module does some of the work of the sentence boundary disambiguation and MxTerminator [6] does the rest. Since HTML can give some clues as to document structure, the module can insert hints to MxTerminator about where sentence breaks should be forced - for example, header tags like <h1> force a sentence break, as do other table formatting tags and other tags that break up text. At the end of this stage, an XML file is produced with minimal markup - a start and end tag for the file as well as tags delimiting the beginning and ending of documents. The HTML to text module can also optionally use Eugene Ludovik's Reco language ID package; this is useful mainly for Internet Archive sources where the HTML files are in many different languages. Since MxTerminator works reasonably well with languages using the Latin alphabet, extracting a different language from Internet Archive files requires changing only the parameter specifying the language of interest.
In other systems, a separate XML file might be produced for each document; by keeping the documents small the reliability can be increased since the chance of introducing an XML error into the document is decreased. However, this does not fit well with the UNIX pipeline - if a separate XML document was used for each XML document there would be an unacceptably high overhead associated with spawning the processes needed to process each document.
A typical example of markup produced at this stage looks like this - the <IGNORE> bits are present to force a sentence break.
<document uri="http://www.transmitter.com:80/curr2000/curr000313.html" ip="204.71.212.248" timestamp="20010915184816" content_type ="text/html"> . \<IGNORE\>. . RF CURRENT . \<IGNORE\>. . Welcome to RF Current, a weekly electronic newsletter focusing on Broadcast technical and F.C.C. related issues. This newsletter is part of The RF Page @ www.transmitter.com , a web site devoted to TV Broadcast RF engineering. For more information see the What is... guide to the R.F. Page site. . \<IGNORE\>. . Issues are dated each Monday, although recently I've needed an extra day or two to complete each issue. Articles may be posted earlier if time permits or if there is a major, breaking story. . </document>