Korean Machine Translation Jun Yang jun@glue.umd.edu Abstract: This paper describes a project regarding the Korean Machine Translation. The project was done in a semester so it is not the complete work for the Korean Machine Translation. Rather, it is just the part of the Korean Machine Translation. The project was to produce two files. One file was for the rule of Korean language, and the other file was for the Korean dictionary entry. The two files will be used for a program called PC-KIMMO that is for doing computational phonology and morphology. It is typically used to build morphological parser for natural language processing systems. 1 Introduction PC-KIMMO is a program for doing computational phonology and morphology. It is typically used to build morphological parser for natural language processing systems. PC-KIMMO is described in the book "PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). KGEN is an auxiliary program for PC-KIMMO. The KGEN program was developed by Nathan Miles as a part-time project. The phonological component of PC-KIMMO is based on a rule formalism called two-level phonology. A typical two-level rule looks like this: y:i => @:C __ +:0 PC-KIMMO cannot directly use rules written in this high-level notation. Two-level rules must first be translated into finite state tables such as this: @ y + @ C i 0 @ 1: 2 0 1 1 2: 2 3 2 1 3. 0 0 1 0 Then, the finite state tables can be used as the rule in the PC-KIMMO. In order to be used in PC-KIMMO, the finite state tables can be created as the output of the KGEN with the two-level rule as the input of the KGEN. 2 Two-Level Rule The two-level rule is the KIMMO format. So the first task is to convert the Korean rule to the two-level rule. The typical Korean rule given originally was: l --> 0 / __ + n It means that when a syllable that starts with "n" comes after a syllable that ends with "l", the "l" becomes "0" (meaning that it disappears). The KIMMO format for this rule is: l:0 => __ +:. n:n The character "." is used as a morphological delimiter in Korean. 3 Finite State Table The KGEN accepts as input a file of the two-level rules and produces as output a PC-KIMMO rules file that has the finite state tables. The KGEN input file contains three sections: subset specifications, feasible pairs, and the rules section. The subsets section of the KGEN input file is optional. The subset section declares the subset names and the alphabetic characters they specify. For example, if you want to declare the subset for vowels (a, e, i, o, and u), you can do: SUBSET V a e i o u where "V" is the subset name you create. The pairs section declares all feasible pairs used in the description. This includes both default correspondences (such as a:a and b:b) and special correspondences (such as y:i and s:0). The pairs section is obligatory. Here is an example: PAIRS a e i o u a e i o u PAIRS a k l n p p s t u u i 0 0 0 o W 0 l l 0 A rule is declared with the keyword RULE. The rule must be written all on one line; for example, RULE l:0 => __ +:. n:n The environment line must be one or more underline characters. White space (spaces, tabs, but not new lines) may be used freely to improve readability. 4 Dictionary Entry The output of the KGEN is used for the rule of Korean language in the KIMMO. In addition to the rule, the KIMMO needs the Korean dictionary entry that accompany the morphological rule. For example, here are the entries corresponding to "RULE l:0 => __ +:. n:n": ROOTS: kel /ENDING "(cat v) (root kel-ta) (gloss hang)" kil /ENDING "(cat v) (root kil-ta) (gloss be_long)" kal /ENDING "(cat v) (root kal-ta) (gloss grind)" tal /ENDING "(cat v) (root tal-ta) (gloss attach connect)" ENDINGS: +nikka /End "(gloss since)" where "cat v" means that the category is the verb, "root kel-ta" means that the root is "kel-ta", "gloss hang" means that the word means "hang", and "+nikka" means that "nikka" can be substituted for the ending of the word ("ta" is the ending of the word in this case) with the additional meaning of the word ( "since" in this case). So, "kel-nikka" becomes "ke nikka" because of the rule, and it means "hang since" literally. But according to the Korean grammar rule, actually, "hang since" is "since hang". I am not going to talk about this detail of Korean rule since it was not the part of this project. Anyway, by adding entries like that, I have provided the dictionary entry for the KIMMO. 5 Conclusions By having the PC-KIMMO and the KGEN already, it was easier than I expected to work on the Korean Machine Translation. I guess I may want to do some more hard work such as studying how the KIMMO and the KGEN are programmed and improving them if it is possible in the future. APPENDIX 1: Morphology Rules Here is KOREAN.TXT from Jun: !; KOREAN.RUL 4-DEC-96 !; Tables generated by KGEN !; By Jun S Yang ! ;NULL 0 ;ANY @ ;BOUNDARY # ! ;SECTION 1: Subsets SUBSET C b c d f g h j k l m n p q r s t v w x y z ; consonants SUBSET V a e i o u ; vowels ;SECTION 2: Feasible Pairs ; Consonant defaults PAIRS b c d f g h j k l m n p q r s t v w x y z b c d f g h j k l m n p q r s t v w x y z ; Vowel defaults PAIRS a e i o u a e i o u ; Special correspondences PAIRS + + + + a k l n p p s t u u . i k u i 0 0 0 o w 0 l l 0 ;SECTION 3: Rule Syntax ; Rule 1 RULE l:0 => ___ +:. n ; Rule 2 RULE p:o => __ +:. V RULE p:w => __ +:u V ; Rule 3 RULE t:l => __ +:. V ; Rule 4 RULE u:l => l __ +:. a RULE u:l => l __ +:. e ; Rule 5 RULE s:0 => __ +:. V ; Rule 6 RULE k:0 => C +:. ___ a:i RULE l:0 => C +:. ___ u l ; Rule 7 RULE l:l => C +:u ___ o RULE u:0 => l +:. ___ l o ; Rule 8 RULE w:w => C +:k ___ a RULE l:l => C +:i ___ a n g ; Rule 9 RULE n:0 => C +:. ___ u n ; Rule 10 RULE n:n => C +:i ___ a END APPENDIX 2: Kimmo Automata Here is KOREAN.RUL from Jun: ; KOREAN.RUL 4-DEC-96 ; Tables generated by KGEN ; By Jun S Yang ALPHABET b c d f g h j k l m n p q r s t v w x y z a e i o u + . NULL 0 ANY @ BOUNDARY # SUBSET C b c d f g h j k l m n p q r s t v w x y z SUBSET V a e i o u RULE "defaults" 1 31 b c d f g h j k l m n p q r s t v w x y z a e i o u + + + + @ b c d f g h j k l m n p q r s t v w x y z a e i o u . i k u @ 1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RULE "defaults" 1 11 a k l n p p s t u u @ i 0 0 0 o w 0 l l 0 @ 1: 1 1 1 1 1 1 1 1 1 1 1 RULE " l:0 => ___ +:. n" 3 4 l + n @ 0 . n @ 1: 2 1 1 1 2. 0 3 0 0 3. 0 0 1 0 RULE " p:o => __ +:. V" 3 4 p + V @ o . V @ 1: 2 1 1 1 2. 0 3 0 0 3. 0 0 1 0 RULE " p:w => __ +:u V" 3 4 p + V @ w u V @ 1: 2 1 1 1 2. 0 3 0 0 3. 0 0 1 0 RULE " t:l => __ +:. V" 3 4 t + V @ l . V @ 1: 2 1 1 1 2. 0 3 0 0 3. 0 0 1 0 RULE " u:l => l __ +:. a" 4 5 u l + a @ l l . a @ 1: 0 2 1 1 1 2: 3 2 1 1 1 3. 0 0 4 0 0 4. 0 0 0 1 0 RULE " u:l => l __ +:. e" 4 5 u l + e @ l l . e @ 1: 0 2 1 1 1 2: 3 2 1 1 1 3. 0 0 4 0 0 4. 0 0 0 1 0 RULE " s:0 => __ +:. V" 3 4 s + V @ 0 . V @ 1: 2 1 1 1 2. 0 3 0 0 3. 0 0 1 0 RULE " k:0 => C +:. ___ a:i" 4 5 k C + a @ 0 C . i @ 1: 0 2 1 1 1 2: 0 2 3 1 1 3: 4 2 1 1 1 4. 0 0 0 1 0 RULE " l:0 => C +:. ___ u l" 5 6 l C + u l @ 0 C . u l @ 1: 0 2 1 1 2 1 2: 0 2 3 1 2 1 3: 4 2 1 1 2 1 4. 0 0 0 5 0 0 5. 0 0 0 0 2 0 RULE " l:l => C +:u ___ o" 4 5 l C + o @ l C u o @ 1: 0 2 1 1 1 2: 0 2 3 1 1 3: 4 2 1 1 1 4. 0 0 0 1 0 RULE " u:0 => l +:. ___ l o" 5 5 u l + o @ 0 l . o @ 1: 0 2 1 1 1 2: 0 2 3 1 1 3: 4 2 1 1 1 4. 0 5 0 0 0 5. 0 0 0 1 0 RULE " w:w => C +:k ___ a" 4 5 w C + a @ w C k a @ 1: 0 2 1 1 1 2: 0 2 3 1 1 3: 4 2 1 1 1 4. 0 0 0 1 0 RULE " l:l => C +:i ___ a n g" 6 7 l C + a n g @ l C i a n g @ 1: 0 2 1 1 2 2 1 2: 0 2 3 1 2 2 1 3: 4 2 1 1 2 2 1 4. 0 0 0 5 0 0 0 5. 0 0 0 0 6 0 0 6. 0 0 0 0 0 2 0 RULE " n:0 => C +:. ___ u n" 5 6 n C + u n @ 0 C . u n @ 1: 0 2 1 1 2 1 2: 0 2 3 1 2 1 3: 4 2 1 1 2 1 4. 0 0 0 5 0 0 5. 0 0 0 0 2 0 RULE " n:n => C +:i ___ a" 4 5 n C + a @ n C i a @ 1: 0 2 1 1 1 2: 0 2 3 1 1 3: 4 2 1 1 1 4. 0 0 0 1 0 END APPENDIX 3: Conversion of Morphology Rules into Kimmo Rule Format This appendix shows all the Morphology Rules that were converted into Kimmo Rule Format. For example, the first rule is: l --> 0 / ___ + n The Kimmo format for this rule is: l:0 => ___+:. n:n Note that I used the character "." since this is supposed to be a morphological delimeter in Korean. In addition, we will have to work on Korean dictionary entries that accompany the morphological rules. For example, here are the entries corresponding to the first Korean rule: ROOTS: kel /ENDING "(cat v) (root kel-ta) (gloss hang)" kil /ENDING "(cat v) (root kil-ta) (gloss be_long)" kal /ENDING "(cat v) (root kal-ta) (gloss grind)" tal /ENDING "(cat v) (root tal-ta) (gloss attach connect)" ENDINGS: +nikka /End "(gloss since)" ------------------------------ 2.2. Irregular verbs 1. l --> 0 / ___ + n a) nol-ta ('play', 'take a rest') aitul-i cip-eyse no-nikka nemwu sikkulepta children-Nom house-in play-since very noisy (It's very noisy since the children are playing in the house.) b) kel-ta ('hang') ke-nikka hang-since c) kil-ta ('is long') ki-nikka is long-since d) kal-ta ('grind') ka-nikka grind-since e) tal-ta ('attach' 'connect') f) pel-ta ('earn' 'get') ton-ul pe-nikka money-Acc earn-since g) mwul-ta ('bite') h) sal-ta ('live' 'stay alive') i) cwul-ta ('diminish' 'lessen') j) nul-ta ('increase') k) mel-ta ('is far away') 2. p --> o / __ + V(owel) p --> wu / __ + V a) tep-ta ('is hot') nalssi-ka te-wu-myen swuyeng-ul ha-ca weather-Nom is hot-if swimming-Acc do-Propositive (If the weather is hot, let's go swimming.) b) kop-ta ('is beautiful' 'is elegant') kop-ase --> kowu-ase --> ko-wase (Here, you can see that after /p/ is changed to /wu/, there is a contraction of vowels, resulting in /wase/ rather than /wuase/. Such contractions are very usual.) kop-ase --> ko-wase is beautiful-because c) komap-ta ('be grateful') komap-ase --> komawase is grateful-because d) nwup-ta ('lie') nwup-e --> nwu-we lie-and e) cwup-ta ('pick up') f) chwup-ta ('is cold') g) kip-ta ('sew/mend') 3. t --> l / __ + V(owel) mwut-ta ('inquire' 'ask') a) mwut-ese --> mwulese ask-and John-eykey mwul-ese hwakin-haca -to ask-and confirm-Propositive (Let's ask John and confirm it.) b) ket-ta ('walk') ket-ese --> kel-ese c) kit-ta ('draw' (water) ) mwul-ul kil-ese ka-ca water-Acc draw-and go-Propositive (Let's draw water (from a well) and go.) d) sit-ta ('load') sit-ese --> sil-ese e) tut-ta ('listen' 'hear') tut-ese --> tul-ese 4. u --> l / l __ + a (or e) a) kilu-ta ('raise' 'breed') kilu-e --> kill-e Mary-nun so-lul kill-e ton-ul pelessta -Top cow-Acc breed-and money-Acc made (Mary bred cows and made money.) b) nalu-ta ('carry') nalu-a --> nall-a c) kwulu-ta ('roll') kwulu-e --> kwull-e d) hulu-ta ('flow' 'stream') hulu-e --> hull-e flow-and e) nwulu-ta ('press') nwulu-e --> nwull-e press-and f) ccilu-ta ('poke') ccilu-e --> ccill-e g) kolu-ta ('choose') kolu-a --> koll-a choose-and 5. s --> 0 / __ + V(owel) a) cis-ta (build) cis-umyen --> ci-umyen cip-ul ci-umeyn na-nun kot isa-lul hal-keta house-Acc build-if/when I-Top soon moving-Acc do-will (If (they) build the house, I will move there very soon.) b) kus-ta ('draw' (a line) ) kus-umyen --> ku-umyen c) is-ta ('connect') is-ese --> i-ese connect-and d) pwus-ta ('pour') pwus-umyen --> pwu-umyen e) ces-ta ('stir') ces-ese --> ce-ese stir-and 6. ka --> i / C(onsonant) + ___ Nominative case marker: ka/ i, kkeyse(-honorific) lul --> ul / C + ___ Accusative case marker: lul/ ul Genitive case marker: uy 7. Postpositions: lo --> ulo / C + ___ (Exception: ulo --> lo / l + ___ ) Instrumental lo/ ulo Reason or Source: lo/ ulo, ey Status: lo/ ulo Resultative: lo/ ulo Locative (a) (default): ey (b) [+Animate] (=Dative): ey-key, hanthey, kkey (-honorific) (c) Source: ey-se, ey-key-se, hanthey-se, pwuthe (-beginning point) *(d) Direction: lo/ ulo (e) Ending point: kkaci (f) Eventive: eyse Temporal (a) Eventive: ey (b) Beginning point: eyse, pwuthe (c) Ending point: kkaci Measure: ey 8. Postpositions: wa --> kwa / C + ___ lang --> ilang / C + ___ Commitative: wa/ kwa, lang/ ilang, hako 9. Delimeter: nun --> un / C + ___ Topic: nun/ un cf: Topic marking of the subject is required when the sentence is classified as a depictive statement. Only: man Too: to Even: cocha, mace Each: mata 10. Postpositions: na --> ina / C + ___ Unselective or Emphasis: na/ ina Amount: (Only in interrogative sentences): na/ ina ** When a delimiter is attached to the subject/ object NP, nomina- tive/ accusative case marker on the NP must be deleted. (Case III.(2) above is an exception.) - John-un /*John-i-un Mary-to/ *Mary-ul-to coahanta -Top -Nom-Top -too -Acc-too like `John likes Mary, too.' (`John, even Mary likes him.') ** Delimiters can be attached to adverbs/ verbs as well as nouns. 11. Others (1) Comparative: pota, kathi, chelem (2) Coordinate conjunction (a) And: wa/ kwa, lang/ ilang, hako (b) Or: na/ ina ====================================================================== Verbal Endings We undertook an analysis of Verbal Endings from (Ihm et al., 1988). In the description below, unless specified otherwise, the initial vowel of the verbal ending (= `u' or `e') is deleted when the verbal stem ends with a vowel. (Verbs whose stem ends with `-o'/'-wu' sound are exceptions to this generalization.) Also, the initial vowel `e' of verbal ending changes to `a' when the last syllable of verbal stem includes `-a' or `-o' sound. I.Terminative Endings *Terminative endings represent the mood type of sentences. They are classified based on the speech level. Speech level is mainly determined by the hearer's age, social status etc. relative to the speaker's. (1) Declarative (a) [super high]: `upnita' (b) [high]: `eyo', `ciyo' (c) [mid-low]: `so', `ne' (d) [low]: `ta', `e' (2) Interrogative (a) [super high]: `upnikka' (b) [high]: `eyo', `nayo' (c) [mid-low]: `swu', `na' (d) [low]: `ni', `nya' (3) Imperative (a) [super high]: `useyyo' (b) [high]: `eyo' (c) [mid-low]: `key' (d) [low]: `ela', `e' (4) Propositive (a) [super high]: `siciyo' (b) [high]: `upsita', `ciyo' (c) [mid-low]: `use' (d) [low]: `ca' II.Adnominal Endings (-involving either a relative clause or a complex NP clause) (1) Present tense (a) `un': for adjectival verbs (b) `nun': otherwise - kem-un koyangi black-Adnm cat `A cat which is black'/'a black cat' - chayk-ul ilk-nun salam book-Acc read-Adnm man `A man who is reading a book' (2) Future tense: `ul' (3) Past tense (a) (ess)`ten': implying reminiscence (b) `un': used only with non-adjectival verbs III.Adverbial Endings (1) reason or cause: `se' `nikka' --->for, as (2) weak contrast: `nuntey' --->while (3) conditional: `umyen' `ketun' --->if, when (4) purpose: `lyeko' `le' --->in order to (do), for (do)ing (5) prerequisite:`(e/a)ya' `(e/a)yaman' --->only when, only if (6) goal: `tolok' --->so that (one) may/can (do) (7) concurrence: `umyense' `umye' (8) contrast: `ciman' --->although (cf: coordinate conjunction `ciman') (9) separate action: `ta(ka)' (10) greater degree:`ulswulok' --->the more (....the more) (11) immediate sequence: `ca' `camaca' ---> as soon as IV. Nominal Endings -'um' and `ki' (1) [+tense]: `um' (2) [-tense]: `ki' *`um' must be accompanied by a case marker. *`um' generally occurs with factive predicate, and `ki' occurs with nonfactive predicate. V. Coordinate Conjunction (1) and: `ko', `se' *`se' is used when the first conjunct precedes the second one in time sequence or when the first conjunct is subordinate to the second one. (cf:`kose') (2) or: `kena' (3) but: `ciman', `una' VI. (Quotative) Complementizer: `ko' *`ko' must be preceded by terminative endings ====================================================================== Korean Auxiliary Verbs Our analysis of Korean auxiliary verbs from (Ihm et al., 1988) revealed that such verbs are classified primarily into two groups: One corresponds to an aspectual specification, and the other corresponds to the representation of a state which is different from the present. I.Aspectual Specification *V = main verb stem (1a) V-e `peli-ta': completion (1b) V-e `nay-ta': accomplishment (1c) V-ko `mal-ta': perfective(?)(-something is done at last) (2a) V-e `noh-ta': completion + duration (2b) V-e `twu-ta': duration (2c) V-e `kaci-ko': duration(?) (-must be used in the form of `con- junction') (3a) V-kon `ha-ta': habitual (3b) V-e `tay-ta': repetition(?) (4) V-ko `iss-ta': progressive II.The State differing from the present (1) V-ko `siph-ta': hope (-want/hope to V) (2) V-na `siph-ta': (speaker's) guess (3) V-nunka `ha-ta': (speaker's) guess V-na `ha-ta' (4) V-un `tus-siph-ta': (speaker's) expect/ guess V-ul `tus-siph-ta' (5) V-un `tus-ha-ta': (speaker's) expect/ guess V-ul `tus-ha-ta' (6) V-eya `ha-ta': obligation (-have to V) (7) V-un `cheyha-ta': pretense (-pretend to V) (8) V-ul `ppenha-ta': almost (-almost did something, but no success/ completion) ---> past tense required (9) V-ulye(ko) `ha-ta': is about to V (10) V-koca `ha-ta': volition (11) V-ulkka `ha-ta': plan(-not decisive) (12) V-ul `manha-ta': is worthwhile to V (13) V-na `po-ta': (speaker's) guess --->tense marker is not allowed III.Others (1) V-e `cwu-ta': of benefit (-did something for others) (2) V-e `po-ta': trial (3) V-ci `anh-ta': negation (-does not V)