Can we improve semantic and discourse-level properties of SMT output?

Bonnie Webber

Statistical Machine Translation (SMT) is currently limited by two forms of locality: One is the locality of a single sentence, which limits how much is translated at one time. A standard SMT system processes sentences independently of one another, and even the order in which they are processed doesn't matter. The second is the N-gram locality of the Language Model used in SMT. This limits how much of an output translation can be simultaneously assessed as a good sub-string in the target language.

Neither of these localities provides enough of a view of an output translation to ensure that it is syntactically correct, semantically adequate for expressing the source message in the target, or discourse appropriate for its position in a multi-sentence text. If an output translation ends up satisfying these criteria, it is more a matter of frequency in the training data and luck than of making linguistically-informed choices.

In this talk, I will briefly describe some efforts at Edinburgh and elsewhere to improve one aspect of semantics in translation (consistent expression of negation) and two aspects of discourse (appropriate signalling of coreference and appropriate signalling of discourse relations).