Tidbits of Tech Teaching: Thesis Topic Discussion

As I started doing my literature review for my master's thesis in Machine Translation of English-Japanese text, I was struck by how there were essentially two schools of thought regarding MT. One was the old fashioned, rule-based grammars, where we feed the source language sentence into a huge set of rules to morph the grammar to the target language grammar, then use a bilingual dictionary to translate the words. The other is the more modern, "big data" approach where you take large amounts of text that has been translated into two languages, and run an analysis on the words and word ordering to find the most probable translation. The latter is what Google does with its Google Translate service, and for similar languages, it gives a decent, quick and dirty translation of small amounts of text (maybe a full sentence). The former was the approach taken by the early MT pioneers and was rather unsuccessful, largely due to the hardware constraints of the time, but also because it does take a lot of human expertise to code the rules properly and then a good amount of time to process, compared to the probabilistic approach.

What bothered me, though, was that few papers that I read ever used both approaches at the same time. The probabilistic method is fast and relatively accurate for single words and/or short phrases, while the rule-based method does a better job with precision. I always felt the shortcomings of the big data approach were really in its not keeping track of any information beyond a few words to the left or right of the current word being processed, very much like taking the phrase "I love killing time reading novels" and only looking at "I love killing". It doesn't take long before you lose track of what it is you are translating when you only focus on the immediate phrase.

So, my initial though was to take a probabilistic system, such as the one used by the Kyoto Free Translation Task and add some form of translation memory to keep track of not only previously translated segments to improve performance, but of traces to improve pronoun translation for null-subject languages (languages where pronouns are not required in some situations). Then my advisor sent me a paper she came across by some researchers at Prompsit, a Spanish translation service, where they used a hybrid approach with the rule-based software, Apertium and the statistical software, Moses to generate some phrase pairs by inferring rules from a parallel corpus. In other words, automatically generating rules that normally required a language expert to write. This was one of the few hybrid papers that I had read that took one of the main drawbacks of rule-based translation and attempted to solve it head-on. This led me to more hybrid approaches, mostly using Apertium. However, Apertium has absolutely no Japanese rule bases. If I were to be able to glean some grammar rules from a corpus in a similar way that Prompsit did for English-French translation, it may be a step towards better automatic translation. And so the challenge begins!

Tidbits of Tech Teaching

Thursday, September 11, 2014

Thesis Topic Discussion

No comments:

Post a Comment