What are the best Machine Translation API’s ?

Evaluating the Quality and Response Time of Commercial Machine Translation API’s

This Survey is for…

Every company that is planning to implement any kind of translations needs to read this paper because we outline the various advantages and disadvantages of each Machine Translation tool in terms of quality and response time.

This in-depth content is geared towards professionals who are actively involved in improving their translation related products and services, such as:

Project Managers

Localizations
Managers

Localizations
Managers

Translators

Translation
Agencies

Summary

In this paper we evaluate the quality and translation time of four popular machine translation engines: Amazon Translate, DeepL, Google Translate and Microsoft Translator.

To assess their translation quality, we calculate the BLEU score of their translations in comparison to human translations, analyzing different aspects such as target language and the size of the sentence in the source language, which in this study is English. In addition, we measure the response time of those translation APIs, since this is an important feature for applications that require realtime translations, such as traveling apps and translation agencies.

The results show:

DeepL and Amazon Translate were the top performers, DeepL achieved the best results for most of European Languages and Amazon Translate for the Asian ones;

in general, the longer the sentence, the better the translation; and

the engines’ API provided low translation time, with the exception of DeepL, in which the median time to translate a single sentence was close to 1 second.

Introduction

Translation services are essential in a wide range of industries and applications. For instance, multinational companies provide content in multiple languages for their customers; and translation apps, such as TripLingo and iTranslate, produce real-time translations to their users.

To attend this demand, in the recent years a growing number of machine translation (MT) engines have become available via public APIs provided by big techs, e.g., Google, Amazon and Microsoft, and specialized translation companies, such as DeepL and Systran.

The great challenge for a solution that intends to use those MT engines is to choose which ones are more suitable for its needs. Some aspects to be taken into consideration in this decision are: translation quality, cost and turn around time.

Previous approaches have assessed public machine translation APIs with respect to gender bias (Stanovsky et al., 2019), software quality (He et al., 2020; Gupta et al., 2020) and model vulnerability (Wallace et al., 2020). Regarding the translation quality of MT engines in particular,

a recent study showed there is no single winner for all languages, and commercial engines have a superior performance in comparison to open-source ones.

In the same direction, we assess in this paper the quality of commercial MT engines and, in addition, measure the translation time of their API. More specifically

we collect more than 200K
segments from translation
memories in different topics
(e.g., health and law) created
by professional translators,
and use them as groundtruth to evaluate the quality

of the translations of four commercial MT engines (Amazon Translate, DeepL, Google Translate and Microsoft Translator) across seven language pairs having English as the source language.

Previous approaches have assessed public machine translation APIs with respect to gender bias (Stanovsky et al., 2019), software quality (He et al., 2020; Gupta et al., 2020) and model vulnerability (Wallace et al., 2020). Regarding the translation quality of MT engines in particular,

a recent study showed there is no single winner for all languages, and commercial engines have a superior performance in comparison to open-source ones.

In the same direction, we assess in this paper the quality of commercial MT engines and, in addition, measure the translation time of their API. More specifically

we collect more than 200K
segments from translation
memories in different topics
(e.g., health and law) created
by professional translators,
and use them as groundtruth to evaluate the quality

of the translations of four commercial MT engines (Amazon Translate, DeepL, Google Translate and Microsoft Translator) across seven language pairs having English as the source language.

The dataset used in this evaluation, originating from 13 translation memories from different companies generated by professional translators, has English as the source language and seven target languages: German (de), Spanish (sp), French (fr), Italian (it), Japanese (ja), Brazilian Portuguese (pt) and Chinese (zh). Every sentence in English has at least one correspondent pair with one of the mentioned target languages. There is a total of 224,223 segments in English in the dataset and 315,073 pairs.

Introduction

• The translation quality of the MT engines are similar across target languages, but DeepL and Amazon produced the best translations: DeepL for European languanges and Amazon for Asian languages.

• In general, the longer a sentence, the better the translation quality. And DeepL and Amazon generated the highest quality translations for long sentences;

 

• The engines’ API provided low translation time, which make them suitable for real-time translation applications, with the exception of DeepL, in which the median time to translate a single sentence was close to 1 second.

• The translation time for all engines grows linearly with respect to the number of segments to be translated. But DeepL has a much higher lin ear coefficient than the other engines in the single call scenario, and Amazon in the bulk scenario.

• The translation quality of the MT engines are similar across target languages, but DeepL and Amazon produced the best translations: DeepL for European languanges and Amazon for Asian languages.

• In general, the longer a sentence, the better the translation quality. And DeepL and Amazon generated the highest quality translations for long sentences;

 

• The engines’ API provided low translation time, which make them suitable for real-time translation applications, with the exception of DeepL, in which the median time to translate a single sentence was close to 1 second.

• The translation time for all engines grows linearly with respect to the number of segments to be translated. But DeepL has a much higher lin ear coefficient than the other engines in the single call scenario, and Amazon in the bulk scenario.

Experimental
Setup

In this section, we present the setup we used in our experimental evaluation. More specifically, we describe the ground-truth dataset, the machine translation engines, and the metrics used to evaluate the engines.

Data

The dataset used in this evaluation, originating from 13 translation memories from different companies generated by professional translators, has English as the source language and seven target languages: German (de), Spanish (sp), French (fr), Italian (it), Japanese (ja), Brazilian Portuguese (pt) and Chinese (zh). Every sentence in English has at least one correspondent pair with one of the mentioned target languages. There is a total of 224,223 segments in English in the dataset and 315,073 pairs.

Figure 3 presents the distribution of number of segments for each target language. Brazilian Portuguese has the highest number of segments (near 60k), whereas Japanese and Spanish the lowest one, around 20k segments. An important feature of this dataset for this evaluation is that it covers a great diversity of topics.

Figure 1 shows a word cloud of the English segments. As one can see, there is content related to health, law, information technology etc.

The dataset is structured with a text segment in the source language, and a reference list with the translations in the target languages. These reference lists have at least one translation associated with the original text, although it could have more than one, as a segment can have more than one possible translation.

Figure 3 presents the distribution of number of segments for each target language. Brazilian Portuguese has the highest number of segments (near 60k), whereas Japanese and Spanish the lowest one, around 20k segments. An important feature of this dataset for this evaluation is that it covers a great diversity of topics.

Turn your multilingual nightmares into sweet dreams.

Let us know how we can help you by completing the form, and we’ll make sure the right team responds to you as soon as possible!

Join our select group of clients!

Let’s talk!

To analyze the scalability of the engines, we present in Figure 6a and 6b the response time of the MT engines when we vary the number of segments. In all curves, the time grows linearly with the number of segments.

However, the linear coefficient of some of the engines is much smaller than the others. For instance, DeepL has the highest coefficient in the single scenario and Amazon the highest in the bulk one meaning that they do not scale as well as their competitors in each respective scenario.

The selected MT engines are all able to translate a single segment through their respective API, and except for Amazon Translate, they can also respond to a bulk call, when a list of segments are submitted and returned at once.

To deal with the bulk limitation of Amazon Translate, we made a minor coding optimization in the single call in order to eliminate the need to establish a connection to the API at every translation, which is not near a bulk translation but helped to reduce the gap between this and the other engines with bulk translation support.

Although all mentioned MT engines were suitable for tuning their models with parallel data or a glossary for specific terms, we decided to put these options aside for this evaluation.

We also try to evaluate other MT engines (e.g., Baidu Translate10, Tencent 11, Systram PNMT12, Apertium13, Alibaba14), but we could not use them for one of the following reasons: API unavailability, lack of documentation, or no support for all target languages.

Metrics

We evaluate the translation quality of the engines using BLEU score (Papineni et al., 2002). We used Friedman’s test (Friedman, 1940) to compare the scores of different engines, and the post hoc Nemenyi test (Nemenyi, 1963) to verify statistical significant differences between individual MT engines.

To calculate the APIs’ response time, we selected a sample of 100 segments of our dataset, respecting the distribution of intervals of segment sizes (Figure 2), and translated them in each engine from English to Portuguese.

We hit the engines with the selected sentences once a day for one week to assess the APIs’ methods: single and bulk. We did not use the whole dataset and only translated in one target language to evaluate the response time, because it would be financially costly to hit the engines for one week with 200k segments in seven languages.

Resultados experimentais

In this section, we present the results of our investigation about the performance of the machine translation engines described in Section 2

Table 1 presents the mean BLEU score of the four engines on each target language. For all languages, the p-values of Friedman’s test were smaller than the significance level (0.05), meaning that there are statistically significant differences in the scores of the engines. In addition, the engines with best scores for each language had performance statistically different of the other ones, according

Translation Time Evaluation

Figure 5a presents the distribution of the translation time per segment for each MT engine sending one segment at the time (single), and Figure 5b sending 100 segments at once (bulk).

In the single scenario, Microsoft provided the fastest translation (median of 0.09 second per segment). Amazon and Google were around two times slower (medians close to 0.2 second), and DeepL was the slowest one (median of 0.96 second per segment), almost ten times higher than Microsoft.

The first thing to notice when using the bulk call of the APIs (Figure 5b) in comparison to the single one (Figure 5a) is that there was a great reduction in the translation time per segment. For DeepL, for instance, the median time of translation per segment decreased from 0.95 second, in the single execution, to 0.02 second in the bulk one. These results clearly show that the bulk operation is much more efficient than sending segments individually for translation. Regarding the individual performances of the engines, Microsoft and Google obtained the lowest translation times

(median of 0.003 and 0.002 second per segment, respectively), whereas the highest translation time was from Amazon (median of 0.09 second). We believe the reason for this poor performance of Amazon is that it does not provide a real bulk call, which we had to approximate in our experiments as aforementioned.

The evaluated MT engines, therefore, presented low translation time per segment which make them suitable for real-time translation applications. The only exception was DeepL in the single scenario in which the median translation time of a single sentence was close to 1 second.

Translation Time Per Segment

To analyze the scalability of the engines, we present in Figure 6a and 6b the response time of the MT engines when we vary the number of segments. In all curves, the time grows linearly with the number of segments.

However, the linear coefficient of some of the engines is much smaller than the others. For instance, DeepL has the highest coefficient in the single scenario and Amazon the highest in the bulk one meaning that they do not scale as well as their competitors in each respective scenario.

Conclusion

In this paper, we presented an evaluation of four machine translation engines with respect to their quality and response time. Our evaluation showed the quality of the engines are similar, but having Amazon and Deepl as top performers. Regarding response time, overall the engines presented good performance, with exception of DeepL, when sending one segment at the time, and Amazon in the batch call.

Turn your multilingual nightmares into sweet dreams.

Let us know how we can help you by completing the form, and we’ll make sure the right team responds to you as soon as possible!

Join our select group of clients!

Let’s talk!

Our offices

SF Bay Area

3515 Mt. Diablo Blvd. Unit#71
Lafayette, CA, 94549
USA

Miami

2980 McFarlane Rd.
Miami, FL 33133
USA