As an automated process, machine translations must be evaluated regularly in order to ensure their quality. And how can you do that? Companies have two options. One of them is to use professionals to review the translations. Of course, this would be almost contradictory to the whole MT process – as it is a translation process that does not involve humans. To solve this problem, the BLEU score was created. The BLEU score is a well-known concept for those who are familiar with MT. BLEU is the short term for Bilingual Evaluation Understudy – or an algorithm that evaluates machine-translated texts from one language to another.
How does the BLEU score evaluate quality?
To evaluate a text’s quality, the BLEU scores compare it with referenced translations. In other words, the BLEU score measures the similarity between an automated translation and a professional one. Ideally, the closer the automated version is to the human-translated content, the better it is.
The BLEU score understands how “close” a text is to the referenced content thanks to its algorithm. This algorithm compares consecutive sentences from the MT text with the consecutive sentences found in the reference translation, proceeding to evaluate how similar they are from one to another, not taking into account grammar or intelligibility. Then, BLEU provides a score that goes from 0 to 1 based on this information. 1 is, ideally, the best scenario here. However, it is important to point out that few texts will ever achieve this score, as it would imply that the MT text is identical to at least one text from the referenced corpus.
As you may have noticed, the BLEU score system needs to have referenced content to evaluate the machine-translated texts. In fact, it is generally recommended to have at least 1,000 sentences gathered in the referenced translations to ensure its quality.
There are a few downsides to this technology. For example, it’s needed for a significant amount of referenced sentences can be a problem depending on the nature of the translated content. However, this method is still highly popular among MT users and is one of the most cost-efficient ways to measure automated translated texts.