Thursday, September 11, 2014

Learning Apertium

   So, having decided to try and infer grammar rules for English-Japanese translation, I figured the first thing I should do is get comfortable with Apertium and figure out what kind of documents and formatting I would be working with. From Apertium's documentation:
The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).
 "A few rules" means several thousand lines of opaque code that looks something like:
<rule>
   <pattern>
     <pattern-item n="nom"/>
   </pattern>
   <action>
     <out>
       <lu>
         <clip pos="1" side="tl" part="lem"/>
         <clip pos="1" side="tl" part="a_nom"/>
         <clip pos="1" side="tl" part="nbr"/>
       </lu>
     </out>
   </action>
</rule>
 but no one said it would be easy (or pretty), so I dove in and ran through the toy example they gave on their wiki. Here's what I learned:

- You need two monolingual (morphological) dictionaries, one for each language, that will define the language alphabet (though I'm not clear how that will relate to kanji, yet), symbols that will denote the POS tag(s) used in the dictionary, a section they call "paradigms" that define the stem, ending(s) and POS tag(s) of a word and a last section they call "main" that house the lemmas that map the base form of the word to its proper paradigm. Below is an example taken from their tutorial:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
   <alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet>
   <sdefs>
      <sdef n="n"/>
      <sdef n="sg"/>
      <sdef n="pl"/>
      <sdef n="vblex"/>
      <sdef n="p1"/>
      <sdef n="pri"/>
   </sdefs>
   <pardef n="gramofon__n">
      <e><p><l/><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e>
      <e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e>
   </pardef>
   <pardef n="vid/eti__vblex">
      <e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e>
   </pardef>
   <section id="main" type="standard">
      <e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e>
      <e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e>
   </section>
</dictionary>
  • This dictionary has exactly two words in Serbo-Croatian: "gramofon" and "videti", which mean "gramophone" and "to see", respectively. The sdefs section defines the POS tag symbols and can be whatever we want them to be, provided they remain consistent throughout all the dictionaries. The variable 'n' is used to all of the POS symbols, not just nouns. 
  • The paradigm definition for "gramofon" define three cases: a singular noun that doesn't change from its base form, a plural noun that adds "-i" to the end of its base form and a plural noun that adds "-e" to its base form.
  • The paradigm definition for "videti" has an ending "-im" that would replace "-ite" from the base form for a singular, first person, present indicative conjugation. Also, the stem "vid-" is defined in the pardef header.
  • In the "main" section, we have two lemmas, one for each word where you can see the base form and what paradigm it maps to.
- You need a bilingual dictionary to describe the mappings between the words of the two languages. Again, taken from the tutorial:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
   <alphabet/>
   <sdefs>
      <sdef n="n"/>
      <sdef n="sg"/>
      <sdef n="pl"/>
      <sdef n="vblex"/>
      <sdef n="p1"/>
      <sdef n="pri"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e>
      <e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e>
   </section>
</dictionary>

  • Note the exact same symbols are defined. You have to consistently define the symbols in order to have them map properly.
  • The "main" section contains the actual word mappings. In this case, Serbo-Croatian is on the left and English is on the right for each entry.
- Finally, the "few rules" document, the Transfer Rules. This is where things get messy. I'll post the example, then describe it.

<?xml version="1.0" encoding="UTF-8"?>
<transfer>
   <section-def-cats>
      <def-cat n="nom">
         <cat-item tags="n.*"/>
      </def-cat>
      <def-cat n="vrb">
         <cat-item tags="vblex.*"/>
      </def-cat>
      <def-cat n="prpers">
         <cat-item lemma="prpers" tags="prn.*"/>
      </def-cat>
   </section-def-cats>

   <section-def-attrs>
      <def-attr n="nbr">
         <attr-item tags="sg"/>
         <attr-item tags="pl"/>
      </def-attr>
      <def-attr n="a_nom">
         <attr-item tags="n"/>
      </def-attr>
      <def-attr n="temps">
         <attr-item tags="pri"/>
      </def-attr>
      <def-attr n="pers">
         <attr-item tags="p1"/>
      </def-attr>
      <def-attr n="a_verb">
         <attr-item tags="vblex"/>
      </def-attr>
      <def-attr n="tipus_prn">
         <attr-item tags="prn.subj"/>
         <attr-item tags="prn.obj"/>
      </def-attr>
   </section-def-attrs>

   <section-def-vars>
      <def-var n="number"/>
   </section-def-vars>

   <section-rules>
      <rule>
         <pattern>
            <pattern-item n="nom"/>
         </pattern>
         <action>
           <out>
              <lu>
                 <clip pos="1" side="tl" part="lem"/>
                 <clip pos="1" side="tl" part="a_nom"/>
                 <clip pos="1" side="tl" part="nbr"/>
              </lu>
           </out>
         </action>
      </rule>
      <rule>
         <pattern>
           <pattern-item n="vrb"/>
         </pattern>
         <action>
            <out>
               <lu>
                  <clip pos="1" side="tl" part="lem"/>
                  <clip pos="1" side="tl" part="a_verb"/>
                   <clip pos="1" side="tl" part="temps"/>
               </lu>
            </out>
         </action>
      </rule>
      <rule>
         <pattern>
           <pattern-item n="vrb"/>
         </pattern>
         <action>
            <out>
               <lu>
                  <lit v="prpers"/>
                  <lit-tag v="prn"/>
                  <lit-tag v="subj"/>
                  <clip pos="1" side="tl" part="pers"/>
                  <clip pos="1" side="tl" part="nbr"/>
              </lu>
              <b/>
              <lu>
                  <clip pos="1" side="tl" part="lem"/>
                  <clip pos="1" side="tl" part="a_verb"/>
                  <clip pos="1" side="tl" part="temps"/>
              </lu>
            </out>
         </action>
      </rule>
   </section-rules>

</transfer>
  • First, there are two ways to group our grammatical symbols, Categories and Attributes.
    • Categories are used for matching POS symbols, like n.* is all nouns
    • Attributes are used to group symbols into types that can be chosen from. An example of this would be sg & pl both indicate amount, so they could be grouped under the attribute "nbr" (number).
  • Secondly, we define some global variables to store which attribute we'll be passing to the dictionaries in the <section-def-vars>
  • Thirdly are the rules themselves. They match a certain pattern, then perform the action that pattern matches to. Trying to explain what each tag means here without context would be tough, but you can check out the full tutorial here
So, with a basic, if shaky, understanding of building dictionaries and rules, my next goal will be to figure out the kanji/kana alphabet issue (listing all the possible kanji in the alphabet would be daunting, to say the least) and to start sifting through prebuilt dictionaries and transfer rules to find some useful rules that I can test on.

Thesis Topic Discussion

   As I started doing my literature review for my master's thesis in Machine Translation of English-Japanese text, I was struck by how there were essentially two schools of thought regarding MT. One was the old fashioned, rule-based grammars, where we feed the source language sentence into a huge set of rules to morph the grammar to the target language grammar, then use a bilingual dictionary to translate the words. The other is the more modern, "big data" approach where you take large amounts of text that has been translated into two languages, and run an analysis on the words and word ordering to find the most probable translation. The latter is what Google does with its Google Translate service, and for similar languages, it gives a decent, quick and dirty translation of small amounts of text (maybe a full sentence). The former was the approach taken by the early MT pioneers and was rather unsuccessful, largely due to the hardware constraints of the time, but also because it does take a lot of human expertise to code the rules properly and then a good amount of time to process, compared to the probabilistic approach.

   What bothered me, though, was that few papers that I read ever used both approaches at the same time. The probabilistic method is fast and relatively accurate for single words and/or short phrases, while the rule-based method does a better job with precision. I always felt the shortcomings of the big data approach were really in its not keeping track of any information beyond a few words to the left or right of the current word being processed, very much like taking the phrase "I love killing time reading novels" and only looking at "I love killing". It doesn't take long before you lose track of what it is you are translating when you only focus on the immediate phrase.

   So, my initial though was to take a probabilistic system, such as the one used by the Kyoto Free Translation Task and add some form of translation memory to keep track of not only previously translated segments to improve performance, but of traces to improve pronoun translation for null-subject languages (languages where pronouns are not required in some situations). Then my advisor sent me a paper she came across by some researchers at Prompsit, a Spanish translation service, where they used a hybrid approach with the rule-based software, Apertium and the statistical software, Moses to generate some phrase pairs by inferring rules from a parallel corpus. In other words, automatically generating rules that normally required a language expert to write. This was one of the few hybrid papers that I had read that took one of the main drawbacks of rule-based translation and attempted to solve it head-on. This led me to more hybrid approaches, mostly using Apertium. However, Apertium has absolutely no Japanese rule bases. If I were to be able to glean some grammar rules from a corpus in a similar way that Prompsit did for English-French translation, it may be a step towards better automatic translation. And so the challenge begins!