myMT
About
MT and
SMT
Machine Translation
(MT) is a translation entirely performed by
a computer. Traditionally MT systems were based on rules
describing human languages' syntax; such systems
were costly to implement and adapt, and improving the output quality
required writing more rules, which was
hindering the cost-effectiveness of the solutions and compromising the
chances of ever getting to a satisfactory
solution.
Statistical Machine
Translation (SMT) was then proposed as an alternative approach: the
computer
builds a language model and a translation model by "learning" typical
phrase structures and translations in the
customer's existing corpus of previously-translated documents. This
approach is promising because it builds on each
customer's typical terminology and style, and it is entirely automated.
However it suffers from some structural
weaknesses, in particular the syntactic accuracy.
Thus future solutions probably
lay with a hybrid model associating both statistical machine
translation and syntactic analysis (as well as other approaches such as
crowd-sourcing). Some of those solutions
were already built but they still need improvement. Nevertheless,
machine translation is now back on the scene and
will probably become a lasting solution to translation high costs and
tight deadlines.
myMT:
Simple Shift's machine translation solution
Simple Shift's machine
translation solution is currently built on the Moses open-source
software, to which we added home-made corpus processing tools:
-
A converter which
supports the following Office formats: doc, docx, xls, xlsx, ppt and
pptx, as well as the PDF format (if not a picture), the HTML format,
the plain text TXT format and the XLIFF format (which is an
OASIS standard used in many Computer-Assisted Translation tools);
-
A segmenter which cuts
a document into sentences in order to set up the correct format for the
SMT system's training corpus;
-
An aligner which
automatically pairs up the relevant source and target sentences so as
to build up the SMT training corpus into an TMX file (TMX is a LISA
standard for translation memories);
-
An extracter which
separates the TMX content into training, tuning and evaluation files;
-
A post-processing tool
which turns Moses's output back into XLIFF format.
The corpus preparation phase
is absolutely critical when setting up an SMT system, but other
criteria must also be looked into, in particular the hardware
architecture to guarantee reasonable system's
performances, and some internal parameter settings in the Moses
application.
Simple Shift has set up
several SMT systems for international organizations and acquired an
extensive experience of SMT systems, backed up by a specialized
training with the main author of Moses' source
code.
We are currently researching
various ways to improve the quality of SMT output by building various
types of hybrid systems and improving our corpus processing tools. As
more international organizations and private
companies are getting interested in this field, Simple Shift is
definitely going to expand its MT activities and
expertise.
|