June 23, 2014

Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems

Ergun Biçici, Qun Liu, and Andy WayParallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, USA, June 2014. Association for Computational Linguistics. [PDF ] Keyword(s): Machine TranslationMachine LearningLanguage Modeling[Abstract][bibtex-entry]

We use parallel FDA5, an efficiently parameterized and optimized parallel implementation of feature decay algorithms for fast deployment of accurate statistical machine translation systems, taking only about half a day for each translation direction. We build Parallel FDA5 Moses SMT systems for all language pairs in the WMT14 translation task and obtain SMT performance close to the top Moses systems with an average of $3.49$ BLEU points difference using significantly less resources for training and development.

Monolingual and Bilingual Text Quality Judgments with Translation Performance Prediction


We won funding from SFI on "Monolingual and Bilingual Text Quality Judgments with Translation Performance Prediction" where we target solutions in text analytics, quality, and similarity with translation performance prediction technology. You are welcome to check out the project's website and read the related CNGL news article.

January 7, 2014

Feature Decay Algorithms for Fast Deployment of Accurate Statistical Machine Translation Systems

Ergun BiçiciFeature Decay Algorithms for Fast Deployment of Accurate Statistical Machine Translation Systems. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. [PDF ] Keyword(s): Machine TranslationMachine LearningLanguage Modeling[Abstract] [bibtex-entry]

We use feature decay algorithms (FDA) for fast deployment of accurate statistical machine translation systems taking only about half a day for each translation direction. We develop parallel FDA for solving computational scalability problems caused by the abundance of training data for SMT models and language models and still achieve SMT performance that is on par with using all of the training data or better. Parallel FDA runs separate FDA models on randomized subsets of the training data and combines the instance selections later. Parallel FDA can also be used for selecting the LM corpus based on the training set selected by parallel FDA. The high quality of the selected training data allows us to obtain very accurate translation outputs close to the top performing SMT systems. The relevancy of the selected LM corpus can reach up to $86\%$ reduction in the number of OOV tokens and up to $74\%$ reduction in the perplexity. We perform SMT experiments in all language pairs in the WMT13 translation task and obtain SMT performance close to the top systems using significantly less resources for training and development.

Referential Translation Machines for Quality Estimation

Ergun BiçiciReferential Translation Machines for Quality Estimation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. [PDF ] Keyword(s): Machine TranslationMachine LearningQuality EstimationNatural Language Processing[Abstract] [bibtex-entry]

We introduce referential translation machines (RTM) for quality estimation of translation outputs. RTMs are a computational model for identifying the translation acts between any two data sets with respect to a reference corpus selected in the same domain, which can be used for estimating the quality of translation outputs, judging the semantic similarity between text, and evaluating the quality of student answers. RTMs achieve top performance in automatic, accurate, and language independent prediction of sentence-level and word-level statistical machine translation (SMT) quality. RTMs remove the need to access any SMT system specific information or prior knowledge of the training data or models used when generating the translations. We develop novel techniques for solving all subtasks in the WMT13 quality estimation (QE) task (QET 2013) based on individual RTM models. Our results achieve improvements over last year's QE task results (QET 2012), as well as our previous results, provide new features and techniques for QE, and rank $1$st or $2$nd in all of the subtasks.

CNGL: Grading Student Answers by Acts of Translation

Ergun Biçici and Josef van GenabithCNGL: Grading Student Answers by Acts of Translation. In *SEM 2013: The Second Joint Conference on Lexical and Computational Semantics and Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, 14-15 June 2013. Association for Computational Linguistics. [WWW ] [PDF ] Keyword(s): Machine TranslationMachine LearningQuality EstimationNatural Language Processing[Abstract] [bibtex-entry]

We invent referential translation machines (RTMs), a computational model for identifying the translation acts between any two data sets with respect to a reference corpus selected in the same domain, which can be used for automatically grading student answers. RTMs make quality and semantic similarity judgments possible by using retrieved relevant training data as interpretants for reaching shared semantics. An MTPP (machine translation performance predictor) model derives features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and the presence of acts of translation involved. We view question answering as translation from the question to the answer, from the question to the reference answer, from the answer to the reference answer, or from the question and the answer to the reference answer. Each view is modeled by an RTM model, giving us a new perspective on the ternary relationship between the question, the answer, and the reference answer. We show that all RTM models contribute and a prediction model based on all four perspectives performs the best. Our prediction model is the $2$nd best system on some tasks according to the official results of the Student Response Analysis (SRA 2013) challenge.

CNGL-CORE: Referential Translation Machines for Measuring Semantic Similarity

Ergun Biçici and Josef van GenabithCNGL-CORE: Referential Translation Machines for Measuring Semantic Similarity. In *SEM 2013: The Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, 13-14 June 2013. Association for Computational Linguistics. [WWW ] [PDF ] Keyword(s): Machine TranslationMachine LearningQuality EstimationNatural Language ProcessingArtificial Intelligence[Abstract] [bibtex-entry]

We invent referential translation machines (RTMs), a computational model for identifying the translation acts between any two data sets with respect to a reference corpus selected in the same domain, which can be used for judging the semantic similarity between text. RTMs make quality and semantic similarity judgments possible by using retrieved relevant training data as interpretants for reaching shared semantics. An MTPP (machine translation performance predictor) model derives features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and the presence of acts of translation involved. We view semantic similarity as paraphrasing between any two given texts. Each view is modeled by an RTM model, giving us a new perspective on the binary relationship between the two. Our prediction model is the $15$th on some tasks and $30$th overall out of $89$ submissions in total according to the official results of the Semantic Textual Similarity (STS 2013) challenge.

Predicting Sentence Translation Quality Using Extrinsic and Language Independent Features

Ergun BiçiciDeclan Groves, and Josef van GenabithPredicting Sentence Translation Quality Using Extrinsic and Language Independent FeaturesMachine Translation, 2013. Keyword(s): Machine TranslationMachine LearningQuality Estimation[Abstract] [bibtex-entry]

We develop a top performing model for automatic, accurate, and language independent prediction of sentence-level statistical machine translation (SMT) quality with or without looking at the translation outputs. We derive various feature functions measuring the closeness of a given test sentence to the training data and the difficulty of translating the sentence. We describe exttt{mono} feature functions that are based on statistics of only one side of the parallel training corpora and exttt{duo} feature functions that incorporate statistics involving both source and target sides of the training data. Overall, we describe novel, language independent, and SMT system extrinsic features for predicting the SMT performance, which also rank high during feature ranking evaluations. We experiment with different learning settings, with or without looking at the translations, which help differentiate the contribution of different feature sets. We apply partial least squares and feature subset selection, both of which improve the results and we present ranking of the top features selected for each learning setting, providing an exhaustive analysis of the extrinsic features used. We show that by just looking at the test source sentences and not using the translation outputs at all, we can achieve better performance than a baseline system using SMT model dependent features that generated the translations. Furthermore, our prediction system is able to achieve the $2$nd best performance overall according to the official results of the Quality Estimation Task (QET) challenge when also looking at the translation outputs. Our representation and features achieve the top performance in QET among the models using the SVR learning model.

Predicting Machine Translation Performance

Presentation @ Dublin City University Faculty Research Day, 2012.
https://twitter.com/SophieMatabaro/status/245848492940083200