TER and BLEU – what these metrics tell us about translation quality

What is TER?
What does the TER metric indicate?
What is BLEU?
What does the BLEU metric indicate?
Does the BLEU metric have limitations?
Why TER and BLEU are important in translation evaluation

Using computer engines and artificial intelligence for translating texts does not always produce perfect results. Is there a way to objectively assess the quality of machine translation (MT)? Two metrics, TER and BLEU, can help. Learn what they reveal about the quality of machine-generated translations.

What is TER?

TER (Translation Edit Rate) is a metric used to assess the quality of machine translations. It measures how many edits are needed to transform a machine-translated text into a reference version – the ideal translation.

Complete LivoLINK translation software

TER is usually expressed as a percentage. It indicates the extent of changes required in the input segment (machine translation) relative to the reference output. These changes include:

Inserting missing words
Removing unnecessary words
Replacing incorrect words with correct ones
Reordering sequences of words

If TER shows a value of 25%, this means that one quarter of the machine-translated text would need editing to reach the reference version.

What does the TER metric indicate?

TER provides valuable information for translators and post-editors working with MT texts. This metric allows you to:

estimate how much effort a machine-translated text requires – the higher the TER, the more time needed to correct it
evaluate the effectiveness of a particular MT engine – analyse how well a tool actually supports translation
compare different MT engines – TER enables objective comparisons, helping choose the best tool for specific types of texts.

Limitations of the TER metric

However, it is worth remembering that TER has its limitations, which may distort the results of machine translation analysis. For example, it does not take into account the semantics of the text – it focuses solely on the mechanical changes that need to be made in the translation. For this reason, it only makes sense to use it in conjunction with other metrics, such as BLEU. This allows for a better and more accurate assessment of machine translation in relation to the reference text.

What is BLEU?

BLEU (Bilingual Evaluation Understudy) is another metric for evaluating machine translation quality. It analyses the similarity between a machine-translated text and one or more reference translations.

BLEU works by analysing n-grams, which are sequences of adjacent words. It compares their occurrence in the machine-translated and reference texts. BLEU scores range from 0 to 1, where 1 indicates a perfect match with the reference translation.

What does the BLEU metric indicate?

BLEU provides insights especially useful for post-editing machine translations. It:

measures lexical similarity – shows the extent to which the MT uses the same words and phrases as the reference text. A high BLEU score suggests correct terminology,
evaluates fluency – analysing longer n-grams, BLEU indirectly assesses whether the MT preserves natural target-language phrasing,
enables comparison of MT systems – similar to TER, BLEU allows objective evaluation of different tools. For example, if Engine A scores 0.42 on a medical text and Engine B scores 0.37, Engine A’s translation is closer to the reference and likely of higher quality.

Limitations of the BLEU metric

Like TER, BLEU is not perfect. It does not account for synonyms – a synonym in the MT output may be marked as an error. Therefore, BLEU is usually combined with other metrics such as TER or METEOR. METEOR, unlike BLEU, recognises synonyms and provides a more flexible assessment.

Why TER and BLEU are important in translation evaluation

TER and BLEU together give a more complete picture of MT quality. TER shows how much post-editing work a translator will need to do, while BLEU measures similarity to the reference. Using both metrics allows an objective, multidimensional evaluation.

For translators and translation agencies, these metrics are valuable because they:

help select the most suitable MT tools for specific types of texts,
allow monitoring of MT output quality through regular tracking of TER and BLEU values.

However, it is worth remembering that no metric is perfect. TER and BLEU have their limitations, so in order for the quality of translation to be as satisfactory as possible, a human perspective is always needed – cultural and linguistic sensitivity and an understanding of context.

Is using BLEU and TER necessary?

Including TER and BLEU in translation workflows is now essential when using modern translation tools. They help understand MT strengths and weaknesses and enable more efficient use in daily work.

If you are already using LivoLINK tools – such as LivoCAT, TM, glossaries, CRM, TMS, and automation – our next post will show how the system calculates these metrics and how this can improve your workflow.

TER and BLEU – what these metrics tell us about translation quality

What is TER?

What does the TER metric indicate?

Limitations of the TER metric

What is BLEU?

What does the BLEU metric indicate?

Limitations of the BLEU metric

Why TER and BLEU are important in translation evaluation

Is using BLEU and TER necessary?

POWIĄZANE WPISY:

The use of MTQE algorithms in translation

Editing distance in translations – what exactly is it?

What is translation automation?

Glossary vocabulary analysis in LivoCAT – searching for segments using a glossary