Korean/English Machine Translation: Semantics and Morphology Terri Paik Senior Computer Science Major 15 October 1997 I. Introduction to Machine Translation This summer I worked with Dr. Bonnie Dorr (Department of Computer Science and UM Institute for Advanced Computer Studies) on her project in Korean/English Machine Translation. Machine translation (MT) is the process of having a computer translate text of one language into another language. While this would be a difficult task for a pair like English and French, it becomes much more complex for a pair like English and Korean - two languages that have little in common. Dr. Dorr and her associates have developed a simple Korean analyzer, but the existing analyzer only recognizes simple sentences and a limited number of words. = The current project is expanding the system so that it could handle more linguistic phenomena and thousands of new words. As mentioned above, Korean is very different from English. Korean is a head-final language, which means that a preposition comes at the end of a prepositional phrase, a verb comes at the end of a verb phrase, etc. It is also a synthetic language, which means that morphemes are attached on to root words to form complex words. Thus, the sentence "I go to the store" becomes nai.gasi.jaq.eigan.da. I-nom store-togo-present tense-indic "Father ate dinner" becomes a.be.ji.gaje.nyeg.xldx.syess.da. father-nomdinner-acc eat-honorific-past tense-indic MT comprises several stages, each critical to the success of the system. First, the sentence is broken up into morphemes using a morphology lexicon that lists all possible morphemes, and a set of morphology rules that specify how the surface representation of the morphemes change as they are combined. The morphology lexicon identifies= each morpheme as to meaning, part of speech, and other features. Second,= the parser parses this information into a syntax tree. Third, the composer uses a Korean LCS lexicon to compose the tree into a Lexical Conceptual Structure (LCS), i.e., a low-level semantic representation of the surface sentence. Fourth, the syntactic generator looks up the English words in an English LCS lexicon and generates the English sentence. My part in this project was to develop the morphology lexicon (used in stage 1) and the LCS lexicon (used in stage 3). II. Building a Morphology Lexicon The morphology lexicon was based on a corpus of about 100 sentences. Every morpheme in the corpus became an entry in the lexicon grouped by its class, or part of speech. I started out with about twelve classes, but later sentences introduced new phenomena that required splitting classes and creating new ones. The lexicon now comprises 22 classes: VERBS, NOUNS, ADJECTIVES, ADVERBS, NUMBERS, DETERMINERS, CONJUNCTIONS, 6 verb suffix classes, 5 noun suffix classes, 3 adjective suffix classes, and number suffixes. The format for the entry is as follows: morpheme/CONTINUATION CLASSfeatures The continuation class lets the recognizer know what class(es) can attach= to the morpheme. End means that no other morpheme can attach, i.e. the word must end. To demonstrate how the morphology lexicon works, I have provided a bare-bones sample lexicon (page 4) based on the "I go to the store" and "Father ate dinner" sentences. This lexicon only illustrates the concept of how individual morphemes are identified; it is syntactically incorrect and the actual Korean lexicon looks very different.= The first section (ALTERNATION ...) lists each continuation class and names a set of classes that the class encompasses. Then come the morphemes, arranged by class name LEXICON . A 0 entry means none of the above. To illustrate, let us trace the path of nai.ga si.jaq.ei gan.da (I go to the store). The recognizer starts with nai.ga. It first looks up LEXICON INITIAL (a reserved class name) and sees that it should continue with Begin. Begin, according to the alternation list, encompasses VERBS and NOUNS. So the recognizer searches through VERBS and then NOUNS to find a= morpheme that could match nai.ga. It sees nai which is followed by CASE. = It stores the features from nai in a temporary bag and searches CASE. Under CASE, it sees +ga which is followed by End. Since this matches nai.ga successfully, the recognizer will dump out all the features that it has collected: ((root nai) (gloss I) (person first) (cat noun) (case nom)). Next up is si.jaq.ei, and once again we begin with VERBS and NOUNS. The recognizer spots si.jaq, and follows the path to POSTPOSITIONS. Voila, it sees +ei and End. Out comes ((root si.jaq) (gloss store) (cat noun post) (pgloss to)). Finally, we come to gan.da. This case is somewhat trickier because it involves a morphological transformation. Not finding gan.da or gan among VERBS and NOUNS, it chooses ga and follows the path to HONOR-VERB. +X.sI does not match +n.da, so it takes the 0 option to TENSE-VERB. There it finds +NUn.da /End, declares a match, and outputs ((root ga) (gloss go) (cat verb) (honorific -) (tense present) (mood indic)). +NUn.da matches +n.da because uppercase letters indicate parts of the morpheme that may change at the surface according to context. All such changes are specified in the morphology rules; there is a rule stating that +NUn.da becomes +n.da when it follows a verb ending in a. An analogy for English might be the English plural morpheme +Es, where one rule states that E is realized as e when it follows nouns ending in x or s. I will not elaborate on how the morphology rules work because it was not a major part of my summer research= =2E Hopefully, the reader can now trace the second sentence, a.be.ji.ga je.nyeg.xl dx.syess.da, through the lexicon. Further examples could show off the more advanced capabilities of the morphology lexicon, e.g. handling ambiguity, and using the ALTERNATION section to design more complex paths. However, the two sentences illustrate the basic operation of the recognizer with the morphology lexicon. I found the morphology lexicon to be the most enjoyable and interesting part of my summer research. Having learned Korean from my parents, I had never analyzed gan.da as ga + NUn.da. The field of linguistics in Korean is not well-established, so I had to start almost from scratch. The challenge was in determining which morphemes to group together as a class, and how the classes should connect to one another so that the structure was consistent for every sentence - not just every sentence in the corpus, but for future sentences as well. One limitation of the recognizer is that it does not follow continuation classes across word boundaries. This was a problem because some Korean phrases are logically one morpheme, but are split across words. An example of this is +ei dai.han which translates into the English preposition about. dai.han does not make sense by itself, so it was difficult to assign a meaning and part of speech to it. Furthermore,= the representation of dai.han had to be consistent with ei so that somehow when put together, they meant about. I resolved this by creating a verb dai.ha meaning "to be about", which could then pick up the relative complementizer +n. Now, +ei dai.han is approximately equivalent to about: a.be.ji.eidai.hani.ya.gi father-dat be about-rel story (the story that is about Father) The morphology stage is actually relatively simple. The "dirty work" of machine translation gets done in stages 2 and 3 - parsing the sentence and composing it into Lexical Conceptual Structure (LCS). Thankfully, I did not have to do this. However, I did build the LCS lexicon used in stage 3, which I will discuss in the next section. ALTERNATION /Begin VERBS NOUNS ALTERNATION /VERBS VERBS ALTERNATION /NOUNS NOUNS ALTERNATION /POSTPOSITIONS POSTPOSITIONS ALTERNATION /CASE CASE ALTERNATION /VERB-END VERB-END ALTERNATION /TENSE-VERB TENSE-VERB ALTERNATION /HONOR-VERB HONOR-VERB ALTERNATION /End End LEXICON INITIAL 0 /Begin =93=94 ;;; noun suffixes LEXICON CASE +ga /End =93(cat noun) (case nom)=94 +xl /End =93(cat noun) (case acc)=94 LEXICON POSTPOSITIONS +ei /End =93(cat noun post) (pgloss to)=94 0 /CASE =93=94 ;;; verb suffixes LEXICON HONOR-VERB +X.sI /TENSE-VERB =93(cat verb) (honorific +)=94 0 /TENSE-VERB =93(cat verb) (honorific -)=94 LEXICON TENSE-VERB +Ass /VERB-END =93(tense past)=94 +NUn.da /End =93(tense present) (mood indic)=94 0 /VERB-END =93(tense no-tense)=94 LEXICON VERB-END +da /End =93(mood indic)=94 ;;; root words LEXICON VERBS ga /HONOR-VERB =93(root ga) (gloss go)=94 dxL /HONOR-VERB =93(root dxl) (gloss eat)=94 LEXICON NOUNS nai /CASE =93(root nai) (gloss I) (person first)=94 a.be.ji /POSTPOSITIONS =93(root a.be.ji) (gloss father)=94 si.jaq /POSTPOSITIONS =93(root si.jaq) (gloss store)=94 je.nyeg /POSTPOSITIONS =93(root je.nyeg) (gloss dinner)=94 Extremely simplified example of a morphology lexicon based on the two sample sentences: "nai.ga si.jaq.ei gan.da" : I go to the store "a.be.ji.ga je.nyeg.xl dx.syess.da" : Father ate dinner III. Building an LCS Lexicon Lexical Conceptual Structure (LCS) is a language-independent representation of the meaning of a sentence. Thus it serves as an intermediate language in the MT process. Whereas human translators would= translate Korean directly into English, machines translate the Korean into LCS, and then translate the LCS into English. This makes it easier to expand the system for additional languages. Theoretically, if a Spanish-English MT system exists, we only need to write the Korean-LCS portion to perform MT for Korean-English and Korean-Spanish. Linguists have determined that sentences center around verbs. Since the verb primarily determines the structure of the sentence, the LCS lexicon focuses on verbs. My task was to build the LCS lexicon for Korean verbs. This was more challenging than the morphology lexicon. = Although I was not able to include all the verbs in the language, I reached my summer's goal of 1000 verbs. I estimate the total number of Korean verbs to be 2500-3000, so there are still plenty more. Understandably, LCS is very difficult to read. A one-line sentence in English might become a page or more in LCS. Fortunately, I did not have to write out the LCS representation for each verb by hand. Dr. Dorr has an acquisition program that generates the LCS for a verb, given its Levin class and thematic grids. Beth Levin is a linguist who established a classification system for all English verbs. The classification is based on both semantics and syntax. For example, Hit Verbs (18.1) are similar in meaning to Swat Verbs (18.2). Hit Verbs include kick, strike, and tap. Swat Verbs include peck, punch, and slug. = The difference? For one, the Hit Verbs make sense in the context I ____ the stick against the fence but the Swat Verbs do not. Thus they have different thematic grids. The thematic grids specify what arguments need to be present for the clause to be well-formed. For example, the verb conceal requires= an agent (John), a theme (money), and may take a possessional modifier (from his wife), and a locational modifier (in/under the desk). The thematic grid notation for conceal would be: _ag_th,mod-poss(from),mod-loc() The underscore in front of ag and th indicates that those arguments are obligatory, and the comma in front of mod-poss and mod-loc indicates that those arguments are optional. The from in parentheses means that the possessional modifier must be preceded by the article from. Empty parentheses mean that an article precedes the modifier but it could be one of many. The complete format for each LCS lexicon entry is: classnum#thematic grid#Korean verb#English translation# Most English verbs have more than one thematic grid. Even a simple verb like bake needs separate thematic grids for John baked Mary a cake (_ag_ben_th), John baked a cake for Mary (_ag_th,ben(for)), A baker bakes (_ag), The cake baked (_th), The oven bakes the cake (_instr_th), etc. Although the LCS lexicon for English verbs lists about 5,000 unique verbs, it is about 12,000 lines long (one line per thematic grid). The entries are sorted by class, then by thematic grid, then alphabetically by verb. Dr. Dorr suggested that I start from the English verb lexicon and build the Korean lexicon by translating the English verbs. However, I thought that that method would leave out many Korean verbs that do not have one-to-one correspondences with English verbs. So I obtained a Korean verb dictionary and started adding the verbs alphabetically. Progress was slow because I had to change gears for each verb. I soon discovered that a good number of verbs were accumulating at the bottom of my file under Unresolved; I could not find classes for them. Levin's classification scheme is based on English verbs, and some Korean verbs are quite incompatible with this scheme. Examples are gag.o.ha, which is best described as "to be willing to face / to be determined despite", and "ggu", which means "to dream" when the noun "dream" is the theme. One cannot "ggu" anything other than a dream, yet the noun= "dream" must be present. The verb "ggu" alone does not have meaning. Problems like these prompted me to adopt Dr.. Dorr's original approach: translating the English verbs. For each class, I first had to figure out what distinguished the class from similar classes. Then I looked up each English verb in my Korean-English dictionary (even the most simple verbs), found the sense that matched the current class, and then looked up the Korean verb(s) on the other side to make sure that I assigned the best possible English translation to the verb. Then I checked the thematic grid and modified it for the Korean verb. Although this was much faster than adding from the Korean dictionary, there were still verbs that took days of fretting over before resolving. = If I needed to delete or add an argument to the thematic grid for the Korean verb, it would no longer belong to the same class. In the case of= deleting, I usually changed the argument type to optional and left it in the class. In the case of adding or substituting, I often had to search for another class that was semantically different but had the correct thematic grid and LCS. I also resolved some of the incompatible verbs. For example, I assigned "ggu" to Engender Verbs (beget, spawn, etc) and translated it "to dream" but I tagged the entry so that "dream" must be present as the theme: 27#_ag_th#ggu#dream#!!-ingly =3D 0 (th =3D dream) On the other hand, gag.o.ha is still in the Unresolved pile. My 1000 verb lexicon consists of the completed classes 9-28, and varying amounts for classes 29-55. My successor will need to complete classes 29-55 and then go through the Korean verb dictionary and add what is missing. I have left my comments with the file for as smooth a transition as possible. I would like to thank Dr. Dorr allowing me this opportunity. MT has sparked my interest in natural-language processing and I am now considering pursuing it as a career. I feel privileged to have been part= of the team this summer, and I am just beginning to appreciate the challenges involved in each step of the translation process. I learned a great deal about MT and Korean linguistic phenomena, and I feel that I have made a meaningful contribution to what may be the first Korean MT project. ------- 1. This report uses the Korean MT team's home brew transliteration. While traditional transliterations are more phonologically accurate, our transliteration simplifies the morphology rules. 2. Nominative case marker. 3. Indicative marker. 4. Accusative case marker. 5. Dative case marker. 6. Relative complementizer.