Some older Tradiling readers may remember the Babel fish. Among other incarnations, it was the name of the first automatic web translation service, launched in 1997 by Altavista, a popular web search engine (1 BG: 1 year before Google), and from 2003 to 2012 the service was available at the Yahoo! website.
The name of the service derived from the “Babel fish” fictional species in Douglas Adams’s book “The Hitchhiker’s Guide to the Galaxy”, which could be inserted into the human ear to instantly translate spoken languages.
This invention is a commonplace of science fiction, in which members of different cultures or even different species communicate with apparent ease. A tongue-in-cheek cameo of this kind of technology featured in the Dr Who episode The Christmas Invasion (2006).
An excerpt of The Christmas Invasion on the official Dr Who YouTube channel
Back down here on terra firma, we now seem to be on the cusp of viable software and devices that are capable of making the Babel fish dream a reality, as a result of extraordinary advances in language technology over recent years. The Babel fish scenario of 2022 is not, however, based on exotic fish or time-travelling police boxes, but rather on mobile apps of various kinds.
Take a look at Sightcall, for example, a multilingual speech-to-text video support app.
This is not science fiction. This kind of service is available now. Sightcall does not perform a complete L1 speech to L2 speech process and is not quite in real time. It is more like liaison interpreting with a pause between each turn in the dialogue. But it is very impressive and seems to meet the needs of commercial helpdesks. (A similar service is offered by Microsoft Translator.)
Moving back to the complete Babel fish speech-to-speech scenario, let’s take a look at the various processes involved.
Speech1 (audio) > Text1 (printed text) > Text2 (printed text) > Speech2 (audio)
-
- Speech1 (audio) / image
- Step 1:
Automatic speech recognition (transcription) / Optical Character Recognition
- Step 1:
- Text1 (text)
- Step 2:
Machine translation (MT)
- Step 2:
- Text2 (text)
- Step 3:
Text-to-speech synthesis (TTS)
- Step 3:
- Speech2 (audio)
- Speech1 (audio) / image
This is the traditional pipeline for automatic speech translation systems. Most recently, the most promising research has focused on combining these steps into a single so- called “end-to-end” approach in which audio content is translated in a single step, without transcription and machine translation stages. Nonetheless, in the following discussion, our purpose is not to identify the best speech-to-speech approaches but to have a view of the various inputs, processes and outputs that are of interest to users, including scenarios other than speech-to-speech translation.
Speech1 (audio) / image > Text1
Following the traditional pipeline, there may be some need for prior processing if Text1 is not available as digital text. For example, if Text1 is printed, it will need to be captured as a digital image and then processed via Optical Character Recognition (OCR). This kind of processing is offered for free by the Google Translate and Microsoft Translator apps. You can point your camera at a restaurant menu, for example, and read the descriptions of dishes in your own language.
When Text1 is available as audio, we have called it Speech1 above. In this case, prior processing involves automatic transcription in the source language. Nowadays automatic live transcription is available in a variety of popular apps. For example, the Zoom video conferencing platform offers live transcription of meetings in English. Other specialised multilingual transcription services include Trint and Sonix.
Text1 > Text2
This is the kernel of what we often refer to as automatic translation. It is widely available for free in mobile devices and on the web. Among the most well-known services are Google Translate, Bing Microsoft Translator and DeepL. These platforms are capable of offering instant, viable (for some language pairs, high quality) translations of text between numerous language pairs.
Once Text1 is available in text format, it can be fed into an automatic translation engine. This kind of text-to-text process is the heart of the translation act and has been available for many years between diverse languages. It is only recently, however, that this process has become available in real time in communication platforms. For example, YouTube offers multilingual synchronised translation of any video that already has one subtitle track..
We can join these two processes together to create a Speech1 > Text2 process. This is what Google Lens tantalisingly demonstrated last month. Using the device, the listener sees the translated text at the edge of their Google Glass spectacles, thus being able to read a translation of conversational input in real time (“subtitles for the world”, as they say).
Text2 > Speech2
Once the text is available in the target language, all that remains is for it to be oralised, that is, converted to audio. Text-to-speech apps have made great strides over recent years and compellingly realistic synthetic voices are now available on a number of platforms, such as Speechelo and Google Cloud Text-to-Speech.
And the Babel fish?
If all these processes can be put together in series, it is feasible to create a speech-to-speech translation machine. For this to be practical there are pressing time constraints: the translated speech needs to be made available as soon as possible, particularly if the aim is for such a machine to be used to facilitate short-turn dialogues.
In fact, one of the main constraints on the use of such a machine is logistical. If two people are conversing in the same physical space, there is a problem of merged audio channels. The output of the translation machine will mix with the original voice input, creating a multilingual cacophony.
In the science fiction scenario, the spoken translation is not heard in the physical space, but rather inside the listener’s head (without even a vestige of the source sound). The illusion is created that there is no translation taking place at all. Speakers are “simply” heard in the language of each listener.
In our present real-world scenario, the problem of merged audio channels is ever-present, and there is no solution in sight. For this reason, dialogue translation apps for mobile phones tend to operate rather like consecutive interpreting, with no overlap between the different audio channels. This undermines any illusion that translation is not involved and stretches out the translation event in time. There are many conversation translation apps that use this approach.
Curiously, the problem of merged audio is absent in online dialogues, because the participants are acoustically isolated, each participating from a separate location. In this arrangement online automatic simultaneous translation is already possible, at least in principle. Has anyone seen such a system in operation?
Postscript
The language services sector is extremely dynamic. The information in this article may contain unintentional errors. If so, please correct me in the Comments section. Whatever the case, the situation is changing rapidly and we should expect the unexpected sooner than we imagine! Here at Tradiling we will attempt to follow this exciting story.