Korean/English Machine Translation: Semantics and Morphology
Terri Paik
Senior Computer Science Major
15 October 1997


I.  Introduction to Machine Translation

This summer I worked with Dr. Bonnie Dorr (Department of Computer
Science and UM Institute for Advanced Computer Studies) on her project
in Korean/English Machine Translation.  Machine translation (MT) is the
process of having a computer translate text of one language into another
language.  While this would be a difficult task for a pair like English
and French, it becomes much more complex for a pair like English and
Korean - two languages that have little in common.  Dr. Dorr and her
associates have developed a simple Korean analyzer, but the existing
analyzer only recognizes simple sentences and a limited number of words. =

The current project is expanding the system so that it could handle
more linguistic phenomena and thousands of new words. 

As mentioned above, Korean is very different from English.  Korean
is a head-final language, which means that a preposition comes at the
end of a prepositional phrase, a verb comes at the end of a verb phrase,
etc.  It is also a synthetic language, which means that morphemes are
attached on to root words to form complex words.  Thus, the sentence "I
go to the store" becomes

nai.gasi.jaq.eigan.da.
I-nom store-togo-present tense-indic

"Father ate dinner" becomes
a.be.ji.gaje.nyeg.xldx.syess.da.
father-nomdinner-acc eat-honorific-past tense-indic

MT comprises several stages, each critical to the success of
the system.  First, the sentence is broken up into morphemes using a
morphology lexicon that lists all possible morphemes, and a set of
morphology rules that specify how the surface representation of the
morphemes change as they are combined.  The morphology lexicon identifies=

each morpheme as to meaning, part of speech, and other features.  Second,=

the parser parses this information into a syntax tree.  Third, the
composer uses a Korean LCS lexicon to compose the tree into a Lexical
Conceptual Structure (LCS), i.e., a low-level semantic representation
of the surface sentence.  Fourth, the syntactic generator looks up the
English words in an English LCS lexicon and generates the English
sentence.  My part in this project was to develop the morphology
lexicon (used in stage 1) and the LCS lexicon (used in stage 3). 


II.  Building a Morphology Lexicon

The morphology lexicon was based on a corpus of about 100
sentences.   Every morpheme in the corpus became an entry in the lexicon
grouped by its class, or part of speech.  I started out with about
twelve classes, but later sentences introduced new phenomena that
required splitting classes and creating new ones.  The lexicon now
comprises 22 classes: VERBS, NOUNS, ADJECTIVES, ADVERBS, NUMBERS,
DETERMINERS, CONJUNCTIONS, 6 verb suffix classes, 5 noun suffix classes,
3 adjective suffix classes, and number suffixes.  The format for the
entry is as follows:

morpheme/CONTINUATION CLASSfeatures

The continuation class lets the recognizer know what class(es) can attach=

to the morpheme.  End means that no other morpheme can attach, i.e. the
word must end. 

To demonstrate how the morphology lexicon works, I have
provided a bare-bones sample lexicon (page 4) based on the "I go to the
store" and "Father ate dinner" sentences.  This lexicon only illustrates
the concept of how individual morphemes are identified; it is
syntactically incorrect and the actual Korean lexicon looks very different.=
 

The first section (ALTERNATION ...) lists each continuation
class and names a set of classes that the class encompasses.  Then come
the morphemes, arranged by class name LEXICON <class>.  A 0 entry means
none of the above.  To illustrate, let us trace the path of  nai.ga 
si.jaq.ei  gan.da (I go to the store). 

The recognizer starts with nai.ga.  It first looks up LEXICON
INITIAL (a reserved class name) and sees that it should continue with
Begin.  Begin, according to the alternation list, encompasses VERBS and
NOUNS.  So the recognizer searches through VERBS and then NOUNS to find a=

morpheme that could match nai.ga.  It sees nai which is followed by CASE. =

It stores the features from nai in a temporary bag and searches CASE.  
Under CASE, it sees +ga which is followed by End.  Since this matches
nai.ga successfully, the recognizer will dump out all the features that
it has collected: ((root nai) (gloss I) (person first) (cat noun)
(case nom)). 

Next up is si.jaq.ei, and once again we begin with VERBS and
NOUNS.  The recognizer spots si.jaq, and follows the path to
POSTPOSITIONS.  Voila, it sees +ei and End.  Out comes ((root si.jaq)
(gloss store) (cat noun post) (pgloss to)). 

Finally, we come to gan.da.  This case is somewhat trickier
because it involves a morphological transformation.  Not finding gan.da
or gan among VERBS and NOUNS, it chooses ga and follows the path to
HONOR-VERB.  +X.sI does not match +n.da, so it takes the 0 option to
TENSE-VERB.  There it finds +NUn.da /End, declares a match, and
outputs ((root ga) (gloss go) (cat verb) (honorific -) (tense present)
(mood indic)).  +NUn.da matches +n.da because uppercase letters indicate
parts of the morpheme that may change at the surface according to
context.  All such changes are specified in the morphology rules;
there is a rule stating that +NUn.da becomes +n.da when it follows a
verb ending in a.  An analogy for English might be the English plural
morpheme +Es, where one rule states that E is realized as e when it
follows nouns ending in x or s.  I will not elaborate on how the
morphology rules work because it was not a major part of my summer research=
=2E 

Hopefully, the reader can now trace the second sentence, a.be.ji.ga
je.nyeg.xl dx.syess.da, through the lexicon.  Further examples could
show off the more advanced capabilities of the morphology lexicon,
e.g. handling ambiguity, and using the ALTERNATION section to design
more complex paths.  However, the two sentences illustrate the basic
operation of the recognizer with the morphology lexicon. 

I found the morphology lexicon to be the most enjoyable and
interesting part of my summer research.  Having learned Korean from
my parents, I had never analyzed gan.da as ga + NUn.da.  The field of
linguistics in Korean is not well-established, so I had to start almost
from scratch.  The challenge was in determining which morphemes to group
together as a class, and how the classes should connect to one another
so that the structure was consistent for every sentence - not just every
sentence in the corpus, but for future sentences as well. 

One limitation of the recognizer is that it does not follow
continuation classes across word boundaries.  This was a problem because
some Korean phrases are logically one morpheme, but are split across
words.  An example of this is +ei dai.han which translates into the
English preposition about.  dai.han does not make sense by itself, so it
was difficult to assign a meaning and part of speech to it.  Furthermore,=

the representation of dai.han had to be consistent with ei so that
somehow when put together, they meant about.  I resolved this by
creating a verb dai.ha meaning "to be about", which could then pick up
the relative complementizer +n.  Now, +ei dai.han is approximately
equivalent to about:

a.be.ji.eidai.hani.ya.gi
father-dat be about-rel story
(the story that is about Father)

The morphology stage is actually relatively simple.  The "dirty
work" of machine translation gets done in stages 2 and 3 - parsing the
sentence and composing it into Lexical Conceptual Structure (LCS). 
Thankfully, I did not have to do this.  However, I did build the LCS
lexicon used in stage 3, which I will discuss in the next section. 


ALTERNATION /Begin VERBS NOUNS
ALTERNATION /VERBS VERBS
ALTERNATION /NOUNS NOUNS
ALTERNATION /POSTPOSITIONS POSTPOSITIONS
ALTERNATION /CASE CASE
ALTERNATION /VERB-END VERB-END
ALTERNATION /TENSE-VERB TENSE-VERB
ALTERNATION /HONOR-VERB HONOR-VERB
ALTERNATION /End End

LEXICON INITIAL
0 /Begin =93=94

;;; noun suffixes
LEXICON CASE
+ga /End =93(cat noun) (case nom)=94
+xl /End =93(cat noun) (case acc)=94

LEXICON POSTPOSITIONS
+ei /End =93(cat noun post) (pgloss to)=94
0 /CASE =93=94

;;; verb suffixes
LEXICON HONOR-VERB
+X.sI /TENSE-VERB =93(cat verb) (honorific +)=94
0 /TENSE-VERB =93(cat verb) (honorific -)=94

LEXICON TENSE-VERB
+Ass /VERB-END =93(tense past)=94
+NUn.da /End =93(tense present) (mood indic)=94
0 /VERB-END =93(tense no-tense)=94

LEXICON VERB-END
+da /End =93(mood indic)=94

;;; root words
LEXICON VERBS
ga /HONOR-VERB =93(root ga) (gloss go)=94
dxL /HONOR-VERB =93(root dxl) (gloss eat)=94

LEXICON NOUNS
nai /CASE =93(root nai) (gloss I) (person first)=94
a.be.ji /POSTPOSITIONS =93(root a.be.ji) (gloss father)=94
si.jaq /POSTPOSITIONS =93(root si.jaq) (gloss store)=94
je.nyeg /POSTPOSITIONS =93(root je.nyeg) (gloss dinner)=94


  Extremely simplified example of a morphology lexicon based on the two
sample sentences:
"nai.ga si.jaq.ei gan.da" : I go to the store
"a.be.ji.ga je.nyeg.xl dx.syess.da" : Father ate dinner


III.  Building an LCS Lexicon

Lexical Conceptual Structure (LCS) is a language-independent
representation of the meaning of a sentence.  Thus it serves as an
intermediate language in the MT process.  Whereas human translators would=

translate Korean directly into English, machines translate the Korean
into LCS, and then translate the LCS into English.  This makes it easier
to expand the system for additional languages.  Theoretically, if a
Spanish-English MT system exists, we only need to write the Korean-LCS
portion to perform MT for Korean-English and Korean-Spanish. 

Linguists have determined that sentences center around verbs. 
Since the verb primarily determines the structure of the sentence,
the LCS lexicon focuses on verbs.  My task was to build the LCS lexicon
for Korean verbs.  This was more challenging than the morphology lexicon. =

Although I was not able to include all the verbs in the language, I
reached my summer's goal of 1000 verbs.  I estimate the total number
of Korean verbs to be 2500-3000, so there are still plenty more. 

Understandably, LCS is very difficult to read.  A one-line
sentence in English might become a page or more in LCS.  Fortunately,
I did not have to write out the LCS representation for each verb by
hand.  Dr. Dorr has an acquisition program that generates the LCS for
a verb, given its Levin class and thematic grids.  Beth Levin is a
linguist who established a classification system for all English verbs. 
The classification is based on both semantics and syntax.  For example,
Hit Verbs (18.1) are similar in meaning to Swat Verbs (18.2).  Hit Verbs
include kick, strike, and tap.  Swat Verbs include peck, punch, and slug. =

The difference?  For one, the Hit Verbs make sense in the context I ____
the stick against the fence but the Swat Verbs do not.  Thus they have
different thematic grids.

The thematic grids specify what arguments need to be present
for the clause to be well-formed.  For example, the verb conceal requires=

an agent (John), a theme (money), and may take a possessional modifier
(from his wife), and a locational modifier (in/under the desk).  The
thematic grid notation for conceal would be:

_ag_th,mod-poss(from),mod-loc() 

The underscore in front of ag and th indicates that those arguments
are obligatory, and the comma in front of mod-poss and mod-loc indicates
that those arguments are optional.  The from in parentheses means that
the possessional modifier must be preceded by the article from. 
Empty parentheses mean that an article precedes the modifier but it
could be one of many.  The complete format for each LCS lexicon entry
is:

classnum#thematic grid#Korean verb#English translation#

Most English verbs have more than one thematic grid.  Even a
simple verb like bake needs separate thematic grids for John baked Mary
a cake (_ag_ben_th), John baked a cake for Mary (_ag_th,ben(for)),
A baker bakes (_ag), The cake baked (_th), The oven bakes the
cake (_instr_th), etc.  Although the LCS lexicon for English verbs
lists about 5,000 unique verbs, it is about 12,000 lines long (one
line per thematic grid).   The entries are sorted by class, then by
thematic grid, then alphabetically by verb. 

Dr. Dorr suggested that I start from the English verb lexicon
and build the Korean lexicon by translating the English verbs.  However,
I thought that that method would leave out many Korean verbs that do not
have one-to-one correspondences with English verbs.  So I obtained a
Korean verb dictionary and started adding the verbs alphabetically. 
Progress was slow because I had to change gears for each verb.  I soon
discovered that a good number of verbs were accumulating at the bottom
of my file under Unresolved; I could not find classes for them. 
Levin's classification scheme is based on English verbs, and some Korean
verbs are quite incompatible with this scheme.  Examples are gag.o.ha,
which is best described as "to be willing to face / to be determined
despite", and "ggu", which means "to dream" when the noun "dream"
is the theme.  One cannot "ggu" anything other than a dream, yet the noun=

"dream" must be present.  The verb "ggu" alone does not have meaning. 

Problems like these prompted me to adopt Dr.. Dorr's original
approach: translating the English verbs.  For each class, I first had
to figure out what distinguished the class from similar classes. 
Then I looked up each English verb in my Korean-English dictionary
(even the most simple verbs), found the sense that matched the current
class, and then looked up the Korean verb(s) on the other side to make
sure that I assigned the best possible English translation to the verb. 
Then I checked the thematic grid and modified it for the Korean verb. 
Although this was much faster than adding from the Korean dictionary,
there were still verbs that took days of fretting over before resolving. =

If I needed to delete or add an argument to the thematic grid for the
Korean verb, it would no longer belong to the same class.  In the case of=

deleting, I usually changed the argument type to optional and left it
in the class.  In the case of adding or substituting, I often had to
search for another class that was semantically different but had the
correct thematic grid and LCS. 

I also resolved some of the incompatible verbs.  For example, I
assigned "ggu" to Engender Verbs (beget, spawn, etc) and translated it
"to dream" but I tagged the entry so that "dream" must be present as
the theme:

27#_ag_th#ggu#dream#!!-ingly =3D 0 (th =3D dream)

On the other hand, gag.o.ha is still in the Unresolved pile.  My
1000 verb lexicon consists of the completed classes 9-28, and varying
amounts for classes 29-55.   My successor will need to complete classes
29-55 and then go through the Korean verb dictionary and add what is
missing.  I have left my comments with the file for as smooth a
transition as possible.

I would like to thank Dr. Dorr allowing me this opportunity. 
MT has sparked my interest in natural-language processing and I am now
considering pursuing it as a career.  I feel privileged to have been part=

of the team this summer, and I am just beginning to appreciate the
challenges involved in each step of the translation process.   I
learned a great deal about MT and Korean linguistic phenomena, and I
feel that I have made a meaningful contribution to what may be the
first Korean MT project.   


-------
1.  This report uses the Korean MT team's home brew transliteration. 
While traditional transliterations are more phonologically accurate,
our transliteration simplifies the morphology rules. 
2.  Nominative case marker.
3.  Indicative marker.
4.  Accusative case marker.
5.  Dative case marker.
6.  Relative complementizer.