What are the best Machine Translation API’s?

A full benchmark study of the best machine translation APIs—Google, Amazon, DeepL & Microsoft. Discover the top performer.
Table of Contents

Translation APIs are everywhere. But not all of them deliver the same level of performance. 

A recent study showed there is no single winner for all languages, and commercial engines have a superior performance in comparison to open-source ones.

This benchmark study tested the top players — Google, Amazon, Microsoft, and DeepL — using over 200,000 human-translated segments across seven languages, including Portuguese, Chinese, and Japanese. 

DeepL and Amazon came out on top, with DeepL excelling in European languages and Amazon leading in Asian ones.

While most engines delivered fast responses, DeepL lagged behind in real-time translation scenarios — with a median delay of nearly 1 second per sentence. That’s a major gap for apps that rely on instant results.

We calculate the BLEU score of their translations in comparison to human translations, analyzing different aspects such as target language and the size of the sentence in the source language. 

In addition, we measure the response time of those translation APIs, since this is an important feature for applications that require realtime translations, such as traveling apps and translation agencies.

So, when it comes to choosing the best translation API, it’s not just about who supports the most languages. It’s about striking the right balance between quality, speed, and context.

Here is a Summary of Our Key Findings

  • DeepL and Amazon Translate delivered the highest translation quality overall, with DeepL leading in European languages and Amazon outperforming in Asian languages like Japanese and Chinese.
  • There is no one-size-fits-all engine: performance varies by language pair, sentence length, and translation context.
  • Longer sentences tend to produce better BLEU scores across all engines — a consistent pattern observed in every language tested.
  • Microsoft Translator had the fastest response time in single-segment translations (median: 0.09 seconds), while DeepL was the slowest (close to 1 second per segment).
  • In bulk translation mode, Google and Microsoft offered sub-second speeds per segment, while Amazon underperformed due to its lack of true batch support.
  • BLEU scores showed statistically significant differences between the engines, confirmed by Friedman and Nemenyi tests — validating the results beyond anecdotal evidence.
  • Scalability is not equal: DeepL’s response time increases more sharply as segment volume grows, which may be a limiting factor in high-volume use cases.
  • All engines performed well enough for real-time applications, with the exception of DeepL in single-call mode and Amazon in bulk scenarios.
  • Brazilian Portuguese had the highest number of evaluated segments, making it one of the most robust language pairs in the study.
  • Data diversity matters: the dataset used covered domains like health, law, and IT, simulating real-world translation demands with high reliability.

What Are Machine Translation APIs?

Machine Translation APIs are cloud-based services that allow developers and platforms to automatically translate text between languages using machine learning models.

Instead of building their own translation engines from scratch, companies can integrate these APIs into websites, apps, or internal systems to provide fast, scalable, and multilingual content.

Some of the most popular Machine Translation APIs include:

  • Google Translate API – Covers over 100 languages and integrates easily with Google Cloud.
  • Amazon Translate – Designed for large-scale, fast translation, with strong performance in Asian languages.
  • Microsoft Translator – A budget-friendly option supporting 90+ languages, ideal for real-time applications.
  • DeepL API – Known for its high-quality translations in European languages, especially when it comes to fluency and nuance.

These APIs are widely used in industries like e-commerce, travel, legal, healthcare, customer support, and localization, where accurate, real-time translation can drastically improve user experience and operational efficiency.

But not all APIs are created equal — and choosing the right one depends on your specific needs: language pairs, speed, cost, and of course, translation quality.

Machine Translation Engines

For this evaluation, we selected four commercial machine translation engines that support all language pairs in our dataset. We describe them below with their associated cost values as of January 2022.

  • Amazon Translate: Developed by Amazon, it provides support for machine translation in more than 70 languages. Its Python API is fully integrated with AWS services, at a cost of USD 15 per million characters.
  • DeepL: It is a company focused on machine translation. Its API supports 26 languages, at a cost of USD 25 per million characters. We used its Python API which enables from and to English translations.
  • Google Translate: It provides machine translation support for over 100 languages, being the engine with the wider reach in regard to supported languages. It also provides a Python API integrated with all Google Cloud services. The translation pricing is USD 20 per million characters.
  • Microsoft Translator: It is the machine translation service provided by Microsoft at a cost of USD 10 per million characters, being the lowest pricing among all evaluated MT engines. This engine supports nearly 90 languages.

The selected MT engines are all able to translate a single segment through their respective API, and except for Amazon Translate, they can also respond to a bulk call, when a list of segments are submitted and returned at once.

To deal with the bulk limitation of Amazon Translate, we made a minor coding optimization in the single call in order to eliminate the need to establish a connection to the API at every translation, which is not near a bulk translation but helped to reduce the gap between this and the other engines with bulk translation support.

Although all mentioned MT engines were suitable for tuning their models with parallel data or a glossary for specific terms, we decided to put these options aside for this evaluation.

We also try to evaluate other MT engines (e.g., Baidu Translate, Tencent, Systram PNMT, Apertium, Alibaba), but we could not use them for one of the following reasons: 

  • API unavailability
  • Lack of documentation,
  • No support for all target languages.

Metrics

We evaluate the translation quality of the engines using BLEU score (Papineni et al., 2002). We used Friedman’s test (Friedman, 1940) to compare the scores of different engines, and the post hoc Nemenyi test (Nemenyi, 1963) to verify statistical significant differences between individual MT engines.

To calculate the APIs’ response time, we selected a sample of 100 segments of our dataset, respecting the distribution of intervals of segment sizes (Figure 2), and translated them in each engine from English to Portuguese.

We hit the engines with the selected sentences once a day for one week to assess the APIs’ methods: single and bulk. We did not use the whole dataset and only translated in one target language to evaluate the response time, because it would be financially costly to hit the engines for one week with 200k segments in seven languages.

Experimental Results

In this section, we present the results of our investigation about the performance of the machine translation engines described in Section 2.

Quality Evaluation

The table below presents the mean BLEU score of the four engines on each target language. For all languages, the p-values of Friedman’s test were smaller than the significance level (0.05), meaning that there are statistically significant differences in the scores of the engines. In addition, the engines with best scores for each language had performance statistically different from the other ones, according to the post hoc Nemenyi test with p-values lower than the significance level of 0.05. Amazon and DeepL achieved the best overall results with the highest scores in 4 target languages. Google tied with DeepL in Spanish and with Amazon in Chinese, whereas the Microsoft translation engine did not outperform any MT engine in any language.

The following figure presents the BLEU score distribution for different segment sizes in each target language. A common trend in these plots is that the longer a sentence, the better the BLEU score.

For instance, the median scores of all MT engines for German as the target language were around 0.6 for segments with size between 1 and 10 and close to 0.7 for the segments greater than 40 words.

Japanese is the only exception: the segment size did not affect the translation quality of Amazon and DeepL, but affected the quality of Microsoft (median BLUE score of 0.61 for the 1-10 interval and 0.58 for the 40- interval) and Google (median BLUE score of 0.62 for the 1-10 interval and 0.6 for the 40- interval).

Translation Time Evaluation

The distribution of translation time per segment for each MT engine—when sending one segment at a time (single) and 100 segments at once (bulk)—can be analyzed below.

In the single scenario, Microsoft provided the fastest translation (median of 0.09 second per segment). Amazon and Google were around two times slower (medians close to 0.2 second), and DeepL was the slowest one (median of 0.96 second per segment), almost ten times higher than Microsoft.

The first thing to notice when using the bulk call of the APIs in comparison to the single one is that there was a great reduction in the translation time per segment. For DeepL, for instance, the median time of translation per segment decreased from 0.95 second, in the single execution, to 0.02 second in the bulk one. 

These results clearly show that the bulk operation is much more efficient than sending segments individually for translation. Regarding the individual performances of the engines, Microsoft and Google obtained the lowest translation times (median of 0.003 and 0.002 second per segment, respectively), whereas the highest translation time was from Amazon (median of 0.09 second). 

We believe the reason for this poor performance of Amazon is that it does not provide a real bulk call, which we had to approximate in our experiments as aforementioned.

The evaluated MT engines, therefore, presented low translation time per segment which make them suitable for real-time translation applications. The only exception was DeepL in the single scenario in which the median translation time of a single sentence was close to 1 second.

To analyze the scalability of the engines, we present below the response time of the MT engines when we vary the number of segments. In all curves, the time grows linearly with the number of segments.

However, the linear coefficient of some of the engines is much smaller than the others. For instance, DeepL has the highest coefficient in the single scenario and Amazon the highest in the bulk one meaning that they do not scale as well as their competitors in each respective scenario.

Conclusion

In this paper, we presented an evaluation of four machine translation engines with respect to their quality and response time. Our evaluation showed the quality of the engines are similar, but having Amazon and Deepl as top performers. Regarding response time, overall the engines presented good performance, with exception of DeepL, when sending one segment at the time, and Amazon in the batch call.

Experimental Setup

In this section, we present the setup we used in our experimental evaluation. More specifically, we describe the ground-truth dataset, the machine translation engines, and the metrics used to evaluate the engines.

Data

The dataset used in this evaluation, originating from 13 translation memories from different companies generated by professional translators, has English as the source language and seven target languages: 

  • German (de)
  • Spanish (sp)
  • French (fr)
  • Italian (it)
  • Japanese (ja)
  • Brazilian Portuguese (pt)
  • Chinese (zh)

Every sentence in English has at least one correspondent pair with one of the mentioned target languages. There are a total of 224,223 segments in English in the dataset and 315,073 pairs.

The figure below presents the distribution of the number of segments for each target language. Brazilian Portuguese has the highest number of segments (near 60k), whereas Japanese and Spanish the lowest one, around 20k segments. An important feature of this dataset for this evaluation is that it covers a great diversity of topics.

The following figure shows a word cloud of the English segments. As one can see, there is content related to health, law, information technology etc.

The dataset is structured with a text segment in the source language, and a reference list with the translations in the target languages. These reference lists have at least one translation associated with the original text, although it could have more than one, as a segment can have more than one possible translation.

To simplify our analysis, we grouped the segments in ranges of size 10, as shown in the figure below, in order to evaluate the impact of the segment size in the quality of the engines’ translation.

This paper is for…

Every company that is planning to implement any kind of translations needs to read this paper because we outline the various advantages and disadvantages o f each Machine Translation tool in terms of quality and response time. This in-depth content is geared towards professionals who are actively involved in improving their translation related products and services, such as:

  • Product Managers,
  • Project Managers,
  • Localizations Managers,
  • Engineering Leaders,
  • Translators,
  • Translation Agencies.

This paper was written by Bureau Works engineers.

Bureau Works delivers comprehensive in-house translation services on our localization platform that allows for in-depth reporting, evolving translations memory, and automated localization.

Most importantly we combine the business and technical elements of localization under one roof.

Gabriel Melo, Luciano Barbosa, Fillipe de Menezes, Vanilson Buregio, Henrique Cabral.

Bureauworks, Universidade FederaL de Pernambuco, Universidade FederaL RuraL de Pernambuco

3685 Mt DiabLo BLvd, Lafayette, CA, United States, Av. Prof. Moraes Rego, 1235, Recife, PE, BraziL, Rua Dom ManueL de Medeiros, s/n, Recife, PE, BraziL

3685 Mt DiabLo BLvd, Lafayette, CA, United States, Av. Prof. Moraes Rego, 1235, Recife, PE, BraziL, Rua Dom ManueL de Medeiros, s/n, Recife, PE, BraziL

{gabrieL.meLo, fiLipe, henrique}@bureauworks.com Luciano@cin.ufpe.br, vaniLson.buregio@ufrpe.br

Unlock the power of glocalization with our Translation Management System.

Unlock the power of

with our Translation Management System.

Sign up today
Translate twice as fast impeccably
Get Started
Our online Events!
Join our community

Try Bureau Works Free for 14 days

The future is just a few clicks away
Get started now
The first 14 days are on us
World-class Support