Thursday, September 11, 2014

Learning Apertium

   So, having decided to try and infer grammar rules for English-Japanese translation, I figured the first thing I should do is get comfortable with Apertium and figure out what kind of documents and formatting I would be working with. From Apertium's documentation:
The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).
 "A few rules" means several thousand lines of opaque code that looks something like:
<rule>
   <pattern>
     <pattern-item n="nom"/>
   </pattern>
   <action>
     <out>
       <lu>
         <clip pos="1" side="tl" part="lem"/>
         <clip pos="1" side="tl" part="a_nom"/>
         <clip pos="1" side="tl" part="nbr"/>
       </lu>
     </out>
   </action>
</rule>
 but no one said it would be easy (or pretty), so I dove in and ran through the toy example they gave on their wiki. Here's what I learned:

- You need two monolingual (morphological) dictionaries, one for each language, that will define the language alphabet (though I'm not clear how that will relate to kanji, yet), symbols that will denote the POS tag(s) used in the dictionary, a section they call "paradigms" that define the stem, ending(s) and POS tag(s) of a word and a last section they call "main" that house the lemmas that map the base form of the word to its proper paradigm. Below is an example taken from their tutorial:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
   <alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet>
   <sdefs>
      <sdef n="n"/>
      <sdef n="sg"/>
      <sdef n="pl"/>
      <sdef n="vblex"/>
      <sdef n="p1"/>
      <sdef n="pri"/>
   </sdefs>
   <pardef n="gramofon__n">
      <e><p><l/><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e>
      <e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e>
   </pardef>
   <pardef n="vid/eti__vblex">
      <e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e>
   </pardef>
   <section id="main" type="standard">
      <e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e>
      <e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e>
   </section>
</dictionary>
  • This dictionary has exactly two words in Serbo-Croatian: "gramofon" and "videti", which mean "gramophone" and "to see", respectively. The sdefs section defines the POS tag symbols and can be whatever we want them to be, provided they remain consistent throughout all the dictionaries. The variable 'n' is used to all of the POS symbols, not just nouns. 
  • The paradigm definition for "gramofon" define three cases: a singular noun that doesn't change from its base form, a plural noun that adds "-i" to the end of its base form and a plural noun that adds "-e" to its base form.
  • The paradigm definition for "videti" has an ending "-im" that would replace "-ite" from the base form for a singular, first person, present indicative conjugation. Also, the stem "vid-" is defined in the pardef header.
  • In the "main" section, we have two lemmas, one for each word where you can see the base form and what paradigm it maps to.
- You need a bilingual dictionary to describe the mappings between the words of the two languages. Again, taken from the tutorial:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
   <alphabet/>
   <sdefs>
      <sdef n="n"/>
      <sdef n="sg"/>
      <sdef n="pl"/>
      <sdef n="vblex"/>
      <sdef n="p1"/>
      <sdef n="pri"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e>
      <e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e>
   </section>
</dictionary>

  • Note the exact same symbols are defined. You have to consistently define the symbols in order to have them map properly.
  • The "main" section contains the actual word mappings. In this case, Serbo-Croatian is on the left and English is on the right for each entry.
- Finally, the "few rules" document, the Transfer Rules. This is where things get messy. I'll post the example, then describe it.

<?xml version="1.0" encoding="UTF-8"?>
<transfer>
   <section-def-cats>
      <def-cat n="nom">
         <cat-item tags="n.*"/>
      </def-cat>
      <def-cat n="vrb">
         <cat-item tags="vblex.*"/>
      </def-cat>
      <def-cat n="prpers">
         <cat-item lemma="prpers" tags="prn.*"/>
      </def-cat>
   </section-def-cats>

   <section-def-attrs>
      <def-attr n="nbr">
         <attr-item tags="sg"/>
         <attr-item tags="pl"/>
      </def-attr>
      <def-attr n="a_nom">
         <attr-item tags="n"/>
      </def-attr>
      <def-attr n="temps">
         <attr-item tags="pri"/>
      </def-attr>
      <def-attr n="pers">
         <attr-item tags="p1"/>
      </def-attr>
      <def-attr n="a_verb">
         <attr-item tags="vblex"/>
      </def-attr>
      <def-attr n="tipus_prn">
         <attr-item tags="prn.subj"/>
         <attr-item tags="prn.obj"/>
      </def-attr>
   </section-def-attrs>

   <section-def-vars>
      <def-var n="number"/>
   </section-def-vars>

   <section-rules>
      <rule>
         <pattern>
            <pattern-item n="nom"/>
         </pattern>
         <action>
           <out>
              <lu>
                 <clip pos="1" side="tl" part="lem"/>
                 <clip pos="1" side="tl" part="a_nom"/>
                 <clip pos="1" side="tl" part="nbr"/>
              </lu>
           </out>
         </action>
      </rule>
      <rule>
         <pattern>
           <pattern-item n="vrb"/>
         </pattern>
         <action>
            <out>
               <lu>
                  <clip pos="1" side="tl" part="lem"/>
                  <clip pos="1" side="tl" part="a_verb"/>
                   <clip pos="1" side="tl" part="temps"/>
               </lu>
            </out>
         </action>
      </rule>
      <rule>
         <pattern>
           <pattern-item n="vrb"/>
         </pattern>
         <action>
            <out>
               <lu>
                  <lit v="prpers"/>
                  <lit-tag v="prn"/>
                  <lit-tag v="subj"/>
                  <clip pos="1" side="tl" part="pers"/>
                  <clip pos="1" side="tl" part="nbr"/>
              </lu>
              <b/>
              <lu>
                  <clip pos="1" side="tl" part="lem"/>
                  <clip pos="1" side="tl" part="a_verb"/>
                  <clip pos="1" side="tl" part="temps"/>
              </lu>
            </out>
         </action>
      </rule>
   </section-rules>

</transfer>
  • First, there are two ways to group our grammatical symbols, Categories and Attributes.
    • Categories are used for matching POS symbols, like n.* is all nouns
    • Attributes are used to group symbols into types that can be chosen from. An example of this would be sg & pl both indicate amount, so they could be grouped under the attribute "nbr" (number).
  • Secondly, we define some global variables to store which attribute we'll be passing to the dictionaries in the <section-def-vars>
  • Thirdly are the rules themselves. They match a certain pattern, then perform the action that pattern matches to. Trying to explain what each tag means here without context would be tough, but you can check out the full tutorial here
So, with a basic, if shaky, understanding of building dictionaries and rules, my next goal will be to figure out the kanji/kana alphabet issue (listing all the possible kanji in the alphabet would be daunting, to say the least) and to start sifting through prebuilt dictionaries and transfer rules to find some useful rules that I can test on.

No comments:

Post a Comment