Machine Translation/Evaluation

Why we need evaluation

edit

In many NLP tasks, researchers need to know if their changes to algorithms improve or degrade the overall performance. In MT, we evaluate performance of MT systems.

Evaluation of MT is harder than many other NLP tasks since there isn't only one perfect translation of a sentence, there are many semantically equivalent or similar sentences.


 

To do:
put an example here


What is evaluated?

edit

Fluency

edit

Is the translation in natural word order? Is the text fluent? Does it contain grammar errors?

Adequacy

edit

Does the translation preserve the meaning of the origin? Is part of the meaning lost, added or skewed?

Intelligibility

edit

Is the translation comprehensible?

Manual evaluation

edit

In manual evaluation, annotators usually assess the previous qualities on scale of 5[citation needed].

adequacy fluency
5 all meaning 5 flawless English
4 most meaning 4 good
3 much meaning 3 non-native
2 little meaning 2 dis-fluent
1 no meaning 1 incomprehensible


 

To do:
add example of an annotation tool here


Disadvantages of manual evaluation are clear as it is:

  • slow,
  • expensive,
  • subjective

Inter-annotator agreement (IAA) studies sho that people agree more when assessing fluency than adequacy.[citation needed]

The evaluation can be formulated as comparison of two candidate translations which might be much easier for annotators to assess. It can increase IAA[citation needed].

Post-editing time

edit

Cost saved

edit

Automatic evaluation

edit

Since manual evaluation is very slow and costly, automatic methods are used.

The paradox is that we let computers assess automatic translations which is like if we asked students to proofread their own essays. The problem is also that automatic methods usually output a score for a given pair or reference and candidate sentence which is not straightforward to interpret.

The main prerequisite is to have reference manual translations (gold standard) which are automatically compared to candidate translations from a MT system. Each candidate translation is compared to one or more reference translations and automatic metrics then quantify this comparison.

Recall and precision

edit

These two metrics come from Information retrieval (IR) and are used also in evaluation of many NLP tasks. Their harmonic mean is called F-score and combines the two metrics into one score which is easier to work with. To be applied on MT quality evaluation, we need to represent candidate and reference sentences as bags of words (BOW).

 

The precision is defined as the number of correct words in a candidate sentence divided by the number of words in the candidate sentence. The recall has the same numerator, the denominator is the number of words in the reference sentence.

Let us consider the following pair of sentences. MT system output: I did not something wrong, reference translation: I have not done anything wrong.

     

It is obvious that the formula does not capture word-order so if the candidate translation contains all words but in any scrumbled order, the F-score will be 100%.

N-gram methods

edit

This class of evaluation metrics uses n-gram precision between candidate and reference sentences. N-grams help to capture word order.

BLEU

edit

Probably the most popular evaluation metrics is BLEU[citation needed]. It was developed at IBM by Papineni and coauthors. It uses precision of n-grams up to   and also penalizes too short candidate sentences. The right translation is expected to have the same length as reference translations.

The candidate sentence c is scored with the following formula:

 

Let us consider the previous example plus another candidate translation from a system B He has not done anything wrong.


 

To do:
Add a visualization


metrics system A system B
  3/5 4/6
  0/5 3/6
  0/5 2/6
  0/6 1/6
brevity penalty 5/6 6/6
BLEU 0.00 0.37

NIST

edit

NIST stands for National Instituteof Standards and Technology which defined its own metrics derived from BLEU score.[citation needed] It weight n-gram precision according to information value.

 

To do:
add an example


NEVA

edit

Stands for Ngram EVAluation. Since BLEU uses precision for 4-grams, short sentences are disadvantaged by the formula. NEVA takes this into consideration together with assessing stylistic richness using synonyms.[citation needed]

Edit distance methods

edit

WAFT

edit

Stands for Word Accuracy For Translation and uses edit distance to compare candidate and reference translations.

 

where edit operations are deletion, substituting and insertion. The score is normalized with the length of the longer from the two compared sentences.

TER, HTER

edit

Translation Edit Rate. Swapping of words is allowed as an edit operation.

 

TER can be used for multiple reference translations.


 

To do:
Example


The evaluation can be done with manually prepared translation, this variant is called HTER (human TER).[citation needed]

Other techniques

edit

Meteor[1]

edit

Many evaluation metrics do not consider synonyms and morphology. When you translate into English and use boy instead of lad in reference translation He was such a kind lad, the candidate translation isn't wrong. But if n-grams are used for scoring the translation, the score is substantially lowered.

To overcome this disadvantage, synonyms can be considered in the scoring. Another disadvantage of comparing candidate and reference translations by words is that sometimes, the translation error is on sub-word level, e.g. a wrong suffix is chosen (singular vs. plural). Again, strictly word-based evaluation methods will assign too low score.

METEOR metrics tries to alleviate this by considering stems (words without suffixes) and synonyms (taken from semantic network WordNet). It uses several scoring formulae as NIST adequacy and WMT ranking and currently supports English, Czech, German, French, Spanish and Arabic.


Bulk evaluation of MT systems

edit

It is interesting to compare average scores for various language pairs.


 

To do:
add matrix and explain "dark" and "light" columns and rows


Round-trip translation

edit

When you have system translating between languages A=>B and B=>A, you may try to translate a sentence back to the source language with so called round-trip translation. In the ideal case, you would obtain the same sentence, but the double-translated sentence usually contains errors and in a way, this can be considered as an evaluation.

You can try it online with Translate and Back or using Google Translate.

Paraphrasing for MT evaluation

edit

Generating multiple reference translations for more precise evaluation with standard metrics.

Evaluation of evaluation metrics

edit

Since there are several methods for automatic evaluation, we would like to know which one is the best. To measure quality of an evaluation metrics, comparison (correlation) with human evaluation is usually used. The more correlated is an output from a metrics to human evaluation of the same set of sentences, the more accurate is the metrics considered.

There have been a few events dedicated to evaluating evaluation metrics, namely MetricsMATR and WMT16 Metrics task.

  1. Michael Denkowski and Alon Lavie, "Meteor Universal: Language Specific Translation Evaluation for Any Target Language", Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014