January 7, 2014

Feature Decay Algorithms for Fast Deployment of Accurate Statistical Machine Translation Systems

Ergun BiçiciFeature Decay Algorithms for Fast Deployment of Accurate Statistical Machine Translation Systems. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. [PDF ] Keyword(s): Machine TranslationMachine LearningLanguage Modeling[Abstract] [bibtex-entry]

We use feature decay algorithms (FDA) for fast deployment of accurate statistical machine translation systems taking only about half a day for each translation direction. We develop parallel FDA for solving computational scalability problems caused by the abundance of training data for SMT models and language models and still achieve SMT performance that is on par with using all of the training data or better. Parallel FDA runs separate FDA models on randomized subsets of the training data and combines the instance selections later. Parallel FDA can also be used for selecting the LM corpus based on the training set selected by parallel FDA. The high quality of the selected training data allows us to obtain very accurate translation outputs close to the top performing SMT systems. The relevancy of the selected LM corpus can reach up to $86\%$ reduction in the number of OOV tokens and up to $74\%$ reduction in the perplexity. We perform SMT experiments in all language pairs in the WMT13 translation task and obtain SMT performance close to the top systems using significantly less resources for training and development.

No comments:

Post a Comment