Machine Translation/Evaluation

Why we need evaluation

In many NLP tasks, researchers need to know if their changes to algorithms improve or degrade the overall performance. In MT, we evaluate performance of MT systems.

Evaluation of MT is harder than many other NLP tasks since there isn't only one perfect translation of a sentence, there are many semantically equivalent or similar sentences.

To do:
put an example here

What is evaluated?

Fluency

Is the translation in natural word order? Is the text fluent? Does it contain grammar errors?

Adequacy

Does the translation preserve the meaning of the origin? Is part of the meaning lost, added or skewed?

Intelligibility

Is the translation comprehensible?

Manual evaluation

In manual evaluation, annotators usually assess the previous qualities on scale of 5^{[citation needed]}.

adequacy		fluency
5	all meaning	5	flawless English
4	most meaning	4	good
3	much meaning	3	non-native
2	little meaning	2	dis-fluent
1	no meaning	1	incomprehensible

To do:
add example of an annotation tool here

Disadvantages of manual evaluation are clear as it is:

slow,
expensive,
subjective

Inter-annotator agreement (IAA) studies sho that people agree more when assessing fluency than adequacy.^{[citation needed]}

The evaluation can be formulated as comparison of two candidate translations which might be much easier for annotators to assess. It can increase IAA^{[citation needed]}.

Post-editing time

Cost saved

Automatic evaluation

Since manual evaluation is very slow and costly, automatic methods are used.

The paradox is that we let computers assess automatic translations which is like if we asked students to proofread their own essays. The problem is also that automatic methods usually output a score for a given pair or reference and candidate sentence which is not straightforward to interpret.

The main prerequisite is to have reference manual translations (gold standard) which are automatically compared to candidate translations from a MT system. Each candidate translation is compared to one or more reference translations and automatic metrics then quantify this comparison.

Recall and precision

These two metrics come from Information retrieval (IR) and are used also in evaluation of many NLP tasks. Their harmonic mean is called F-score and combines the two metrics into one score which is easier to work with. To be applied on MT quality evaluation, we need to represent candidate and reference sentences as bags of words (BOW).

${\text{F-score}}=2\times {{{\text{precision}}\times {\text{recall}}} \over {{\text{precision}}+{\text{recall}}}}$

The precision is defined as the number of correct words in a candidate sentence divided by the number of words in the candidate sentence. The recall has the same numerator, the denominator is the number of words in the reference sentence.

Let us consider the following pair of sentences. MT system output: I did not something wrong, reference translation: I have not done anything wrong.

${\text{precision}}={3 \over 5}=60\%$ ${\text{recall}}={3 \over 6}=50\%$ ${\text{F-score}}=2\times {{0.6\times 0.5} \over {0.6+0.5}}=54\%$

It is obvious that the formula does not capture word-order so if the candidate translation contains all words but in any scrumbled order, the F-score will be 100%.

N-gram methods

This class of evaluation metrics uses n-gram precision between candidate and reference sentences. N-grams help to capture word order.

BLEU

Probably the most popular evaluation metrics is BLEU^{[citation needed]}. It was developed at IBM by Papineni and coauthors. It uses precision of n-grams up to $n=4$ and also penalizes too short candidate sentences. The right translation is expected to have the same length as reference translations.

The candidate sentence c is scored with the following formula:

${\text{BLEU}}=\min(1,{length(c) \over length(r)})(\prod _{i=1}^{4}{\text{precision}}_{i})^{1 \over 4}$

Let us consider the previous example plus another candidate translation from a system B He has not done anything wrong.

To do:
Add a visualization

metrics	system A	system B
$precision_{1}$	3/5	4/6
$precision_{2}$	0/5	3/6
$precision_{3}$	0/5	2/6
$precision_{4}$	0/6	1/6
brevity penalty	5/6	6/6
BLEU	0.00	0.37

NIST

NIST stands for National Instituteof Standards and Technology which defined its own metrics derived from BLEU score.^{[citation needed]} It weight n-gram precision according to information value.

To do:
add an example

NEVA

Stands for Ngram EVAluation. Since BLEU uses precision for 4-grams, short sentences are disadvantaged by the formula. NEVA takes this into consideration together with assessing stylistic richness using synonyms.^{[citation needed]}

Edit distance methods

WAFT

Stands for Word Accuracy For Translation and uses edit distance to compare candidate and reference translations.

${\text{WAFT}}=1-{{d+s+i} \over {\max(l_{r},l_{c})}}$

where edit operations are deletion, substituting and insertion. The score is normalized with the length of the longer from the two compared sentences.

WER

TER, HTER

Translation Edit Rate. Swapping of words is allowed as an edit operation.

${\text{TER}}={{\text{the least number of edits}} \over {\text{average length of reference sentences}}}$

TER can be used for multiple reference translations.

To do:
Example

The evaluation can be done with manually prepared translation, this variant is called HTER (human TER).^{[citation needed]}

Other techniques

Meteor^[1]

Many evaluation metrics do not consider synonyms and morphology. When you translate into English and use boy instead of lad in reference translation He was such a kind lad, the candidate translation isn't wrong. But if n-grams are used for scoring the translation, the score is substantially lowered.

To overcome this disadvantage, synonyms can be considered in the scoring. Another disadvantage of comparing candidate and reference translations by words is that sometimes, the translation error is on sub-word level, e.g. a wrong suffix is chosen (singular vs. plural). Again, strictly word-based evaluation methods will assign too low score.

METEOR metrics tries to alleviate this by considering stems (words without suffixes) and synonyms (taken from semantic network WordNet). It uses several scoring formulae as NIST adequacy and WMT ranking and currently supports English, Czech, German, French, Spanish and Arabic.

Bulk evaluation of MT systems

It is interesting to compare average scores for various language pairs.

To do:
add matrix and explain "dark" and "light" columns and rows

Round-trip translation

When you have system translating between languages A=>B and B=>A, you may try to translate a sentence back to the source language with so called round-trip translation. In the ideal case, you would obtain the same sentence, but the double-translated sentence usually contains errors and in a way, this can be considered as an evaluation.

You can try it online with Translate and Back or using Google Translate.

Paraphrasing for MT evaluation

Generating multiple reference translations for more precise evaluation with standard metrics.

Evaluation of evaluation metrics

Since there are several methods for automatic evaluation, we would like to know which one is the best. To measure quality of an evaluation metrics, comparison (correlation) with human evaluation is usually used. The more correlated is an output from a metrics to human evaluation of the same set of sentences, the more accurate is the metrics considered.

There have been a few events dedicated to evaluating evaluation metrics, namely MetricsMATR and WMT16 Metrics task.

↑ Michael Denkowski and Alon Lavie, "Meteor Universal: Language Specific Translation Evaluation for Any Target Language", Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014

[1] Michael Denkowski and Alon Lavie, "Meteor Universal: Language Specific Translation Evaluation for Any Target Language", Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014

[1]