Recently I came upon this little tidbit and initially thought how wonderful, (NMT is surely rising!) and decided to take a closer look and read through the research paper. This exercise left me a little uncertain as I now felt doubt, and began to suspect that this is just another example of that never-ending refrain of the MT world, the empty promise. Without a doubt Google had made some real progress, but “Nearly Indistinguishable From Human Translation” and “GNMT reduces translation errors by more than 55%-85% on several major language pairs “. Hmmm, not really, not really at all, was what my pesky brain kept telling me, especially as I saw this announcement coming up again and again through many news channels, probably pushed heavily by the Google marketing infrastructure.

Surely the great Google of the original “Don’t Be Evil” ethos would not bullshit us thus. (In their 2004 founders’ letter prior to their initial public offering, Larry Page and Sergey Brin explained that their “Don’t be evil” culture prohibited conflicts of interest, and required objectivity and an absence of bias.) Apparently, Gizmodo already knew about the broken promise in 2012. My friend Roy told me that: Following Google’s corporate restructuring under the conglomerate Alphabet Inc. in October 2015, the slogan was replaced in the Alphabet corporate code of conduct by the phrase “Do the right thing”. However, to this day, the Google code of conduct still contains the phrase “Don’t be evil”. This ability to conveniently bend the rules (but not break the law) and make slippery judgment calls which are convenient to corporate interests, is well described by Margaret Hodge in this little snippet. Clearly, Google knows how to push self-congratulating, mildly false content through the global news gathering and distribution system by using terms like research and breakthrough with somewhat shaky research data that includes math, sexy flowcharts, and many tables showing “important research data”. They are after all the kings of SEO. However, I digress.

The basic deception I speak of, and yes I do understand that those might be strong words, is the overstatement of the actual results, using questionable methodology, in what I would consider an arithmetical manipulation of the basic data, to support corporate marketing messaging spin (essentially to bullshit the casual, trusting, but naïve reader who is not aware of the shaky foundations at play here and of statistics in general). Not really a huge crime, but surely just a little bit evil and sleazy. Not quite the Wells Fargo, Monsanto, Goldman Sachs and Valiant & Turing Pharmaceuticals (medical drug price gouging) level of evil and sleazy but give them time and I am sure they could rise to this level, and they quite possibly will step up their sleaze game if enough $$$s and business advantage issues are at stake. AI and machine learning can be used for all kinds of purposes both sleazy or not as long as you have the right power and backing.

So basically I see three problems with this announcement:

  1. Arithmetic manipulation to create the illusion of huge progress. (Like my use of font size and bold to make the word huge seem more important than it is.)
  2. Questionable human evaluation methodology which produces rating scores that are then further arithmetically manipulated and used to support the claims of “breakthrough” progress. Humans are unlikely to rate 3 different translations of the same thing on a scale of 0 to 6 (crap to perfect) accurately and objectively. Ask them to do it 500 times and they are quite likely to give you pretty strange and illogical results. Take a look at just four pages of this side-by-side comparison and see for yourself. The only case I am aware of where this kind of a translation quality rating was done with any reliability and care was by Appen Butler Hill for Microsoft. But translation raters there were not made to make a torturous comparison of several versions of the same translation, and then provide an independent rating for each. Humans do best (if reliable and meaningful deductions are sought from the exercise) when asked to compare two translations and asked a simple and clear question like: “Which one is better?” Interestingly even the Google team noted that: “We observed that human raters, even though fluent in both languages, do not necessarily fully understand each randomly sampled sentence sufficiently”. Yes, seriously dude, are you actually surprised? Methinks that possibly there is a kind of stupor that sets in when one spends too much time in the big data machine learning super-special room, that numbs the part of the brain where common sense resides.
  3. Then taking this somewhat shaky research and making the claim that the GNMT is “Nearly Indistinguishable From Human Translation” and figuring out a way to present this as “55% to 85%” improvements.

In the grand scheme of things, the transgressions and claims made in this Google announcement are probably a minor deception but I still think they should be challenged and exposed if possible, and if it is actually fair criticism. We live in a world where corporate malfeasance has become the norm of the day. Here we have a small example which could build into something worse. Monsanto, Well Fargo, Goldman Sachs do not have evil people (maybe just at the top) but they have a culture that rewards certain kinds of ethically challenged behavior if it benefits the company or helps you “make your numbers”. To me, this is an example in-kind and tells you something about the culture at Google.

We are still quite a long way from “Nearly Indistinguishable From Human Translation”. We need to be careful about overstating the definite and clear progress that actually has been made in this case. For some reason, this (overstatement of progress) is something that happens over and over again in MT. Keep in mind that drawing such sweeping conclusions on a sample of 500 is risky with big data applications (probably 250 Million+ sentence pairs) even when the sample is really well chosen and the experiment has an impeccable protocol which in this case is SIMPLY NOT TRUE for the human evaluation. The rating process is flawed to such an extent that we have to question some or many of the conclusions drawn here.The most trustworthy data presented here are the BLEU scores assuming it is truly a blind test set and no Google glasses were used to peek into the test.

This sentence was just pointed out to me after this post went live, thus, I am adding an update as an in place postscript. Nature (Journal of Science) provides a little more detail on the human testing process.

“For some other language pairs, the accuracy of the NMTS approached that of human translators, although the authors caution that the significance of the test was limited by its sample of well-crafted, simple sentences.

Would this not constitute academic fraud? So now we have them saying both that the “raters” did not understand the sentences and that the sentences were essentially made simple i.e. rigged. To get a desirable and publishable result? For most in the academic community this would be enough to give pause and reason to be very careful about making any claims, but of course not for Google who looks suspiciously like they did engineer “the results” at a research level.

These results are consistent with what SYSTRAN has reported (described here ), and actually are slightly less compelling at a BLEU score level than the results SYSTRAN has had as I explained above. (Yes Mike I know it is not the same training and test set.)

Now that I have gotten the rant off my chest, here are my thoughts on what this “breakthrough” might mean:

  • NMT is definitely proving to be a way to drive MT quality upward and forward but for now, is limited to those with deep expertise and access to huge processing and data resources.
  • NMT problems (training and inference speed, small vocabulary problem, missing words etc..) will be solved sooner rather than later.
  • Experiment results like these should be interpreted with care, especially if they are based on such an ambiguous human rating score system. Don’t just read the headline and believe them, especially when they come from people with vested interests e.g. Google.
  • Really good MT always looks like human translation, but what matters is how many segments in a set of 100,000 look like a human translation. We should save our “nearly indistinguishable” comments for when we get closer to 90% or at least 70% of all these segments being almost human.
  • The success at Google, overstated though it is, has just raised the bar for both the Expert and especially the Moses DIY practitioners, which makes even less sense now since you could almost always do better with generic Google or Microsoft who also has NMT initiatives underway and in production.
  • We now have several end-to-end NMT initiatives underway and close to release from Facebook, Microsoft, Google, Baidu, and Systran. For the short term, I still think that Adaptive MT is more meaningful and impactful to users in the professional translation industry, but as SYSTRAN has suggested, NMT “adapts” very quickly with very little effort with small volumes of human corrective input. This is a very important requirement for MT use in the professional world. If NMT is as responsive to corrective feedback as SYSTRAN is telling us, I think we are going to see a much faster transition to NMT.

As I said in a previous post on the Systran NMT: They have reason to be excited, but all of us in MT have been down this path before and as many of us know, the history of MT is filled with empty promises.

The marketing guy at Google who pushed this announcement in its current form should be asked to watch this video (NSFW link, do NOT click on it if you are easily offended) at least 108 times. The other guys should also watch it a few times too. Seriously, let’s not get carried away just yet. Let’s wait to hear from actual users and let’s wait to see how it works in production use scenarios before we celebrate.

As for the Google corporate motto, I think it has been already true for some time now that Google is at least slightly evil, and I recommend you watch the three-minute summary by Margaret Hodge to see what I mean. Sliding down a slippery slope is a lot easier than standing on a narrow steep ledge in high winds. In today’s world, power and financial priorities rule over ethics, integrity and principal and Google is just following the herd that includes their friends at Well Fargo, Goldman Sachs, VW, and Monsanto. I said some time ago that a more appropriate motto for Google today might actually be: “You Give, We Take.” Sundar should walk around campus in a T-shirt (preferably one made by underpaid child labor in Bangladesh) with this new motto boldly emblazoned on it in some kind of a cool Google font. At least then Google marketing would not have to go through the pretense of having any kind of ethical, objective or non-biased core which the current (original) motto forces them to contend with repeatedly. The Vedic long view of the human condition across eons, says we are currently living in the tail end of the Kali Yuga, an age of darkness and falsehood and wrong values. An age when charlatans (Goldman Sachs, VW, Monsanto, Wells Fargo and Google) are revered and even considered prophets. Hopefully, this era ends soon.

Florian Faes has made a valiant effort to provide a fair and balanced view on these claims from a variety of MT expert voices. I particularly enjoyed the comments by Rico Sennrich of the University of Edinburgh who cuts through the Google bullshit most effectively. For those who think that my rant is unwarranted, I suggest that you read the Slator discussion, as you will get a much more diverse opinion. Florian even has rebuttal comments from Mike Schuster at Google whose responses sound more than a little bit like the spokespersons at Well Fargo, VW and Goldman Sachs to me. Also, for the record, I don’t buy the Google claim “Our system’s translation quality approaches or surpasses all currently published results,” unless you consider only their own results. I am willing to bet $5 that both Facebook and Microsoft (and possibly Systran and Baidu) have equal or better technology. Slator is the best thing that has happened to the “translation industry” in terms of relevant current news and investigative journalism and I hope that they will thrive and succeed.

I remain willing to stand corrected if my criticism is unfounded or unfair, especially if somebody from Google sets me straight. But I won’t hold my breath, not until the end of the Kali Yuga anyway.

PEACE.

You can see this post with graphics at the original link: http://kv-emptypages.blogspot.com/2016/09/the-google-neural-machine-translation.html

--

--

K Vashee

Translation Technology, Collaboration, Dreamer, Better Digital Sound Technology, Driving Change -- Meaningful Work & Connection