Simple Shift logo

Simple Shift  
Language Engineering


Rising Sun




Home            Products            Team            Partners            Research            Contact



About MT and SMT

Machine Translation (MT) is a translation entirely performed by a computer. Traditionally MT systems were based on rules describing human languages' syntax; such systems were costly to implement and adapt, and improving the output quality required writing more rules, which was hindering the cost-effectiveness of the solutions and compromising the chances of ever getting to a satisfactory solution.

Statistical Machine Translation (SMT) was then proposed as an alternative approach: the computer builds a language model and a translation model by "learning" typical phrase structures and translations in the customer's existing corpus of previously-translated documents. This approach is promising because it builds on each customer's typical terminology and style, and it is entirely automated. However it suffers from some structural weaknesses, in particular the syntactic accuracy.

Thus future solutions probably lay with a hybrid model associating both statistical machine translation and syntactic analysis (as well as other approaches such as crowd-sourcing). Some of those solutions were already built but they still need improvement. Nevertheless, machine translation is now back on the scene and will probably become a lasting solution to translation high costs and tight deadlines.

myMT: Simple Shift's machine translation solution


Simple Shift's machine translation solution is currently built on the Moses open-source software, to which we added home-made corpus processing tools:

  • A converter which supports the following Office formats: doc, docx, xls, xlsx, ppt and pptx, as well as the PDF format (if not a picture), the HTML format, the plain text TXT format and the XLIFF format (which is an OASIS standard used in many Computer-Assisted Translation tools);
  • A segmenter which cuts a document into sentences in order to set up the correct format for the SMT system's training corpus;
  • An aligner which automatically pairs up the relevant source and target sentences so as to build up the SMT training corpus into an TMX file (TMX is a LISA standard for translation memories);
  • An extracter which separates the TMX content into training, tuning and evaluation files;
  • A post-processing tool which turns Moses's output back into XLIFF format.

The corpus preparation phase is absolutely critical when setting up an SMT system, but other criteria must also be looked into, in particular the hardware architecture to guarantee reasonable system's performances, and some internal parameter settings in the Moses application.

Simple Shift has set up several SMT systems for international organizations and acquired an extensive experience of SMT systems, backed up by a specialized training with the main author of Moses' source code.

We are currently researching various ways to improve the quality of SMT output by building various types of hybrid systems and improving our corpus processing tools. As more international organizations and private companies are getting interested in this field, Simple Shift is definitely going to expand its MT activities and expertise.