news.bridge is dead, long live news.bridge!


news.bridge team members Hina Imran and Peggy van der Kreeft overseeing on-the-fly subtitle creation at GMF 2019 in Bonn

Hello and welcome to the project wrap-up post – which actually isn’t a project wrap-up post. Funding by the Google DNI officially ran out on June 30th and the consortium is about to disband, but work on news.bridge will continue (hooray!). There’s a new team (more on that later), and a decision to polish up a platform that is already quite shiny.

Lots of happy users

Deutsche Welle editors have been testing news.bridge in various use cases for several months now. As of June 2019,  content in English, German, Hindi, Portuguese, Russian, and Spanish has been produced and evaluated. Test runs in Arabic, Farsi, Swahili, and Turkish are underway. External beta testers include a wide range of media companies from Euronews to 1&1/GMX.DE, but also international freelance journalists.

While news.bridge still has a couple of glitches and not all features have been implemented yet, the platform has been met with a lot of applause.  More conservative testers called it “very good and very useful”, feedback from enthusiasts sounded something like this: “Fantastic! Magic! When can I have this?”

news.bridge saves users a lot of time when re-versioning and adapting content for another language (or format).

HLT out and about

The fact that our consortium has been invited to dozens of language technology and innovation events also demonstrates the heightened interest in news.bridge. Recent event highlights include sessions, talks and presentations at the MESA Content Workflow Management Forum (London), ECIR (Cologne), MDN Workshop (Geneva), and, of course, our major live testing operation at GMF (Bonn).

2018 saw us speak at Subtech1 (Munich), the Workshop on Corpus Analysis of Time-Based Arts and Media (Berlin), a Google DNI Summit in Paris – and many other great events. In late November, we also partnered with the SUMMA project to host the Language Technology Hands-On Days in Bonn – another fruitful get-together that drew almost 100 participants.

New features, new services, new projects


Screenshot of the latest news.bridge user interface.

As news.bridge is growing internally and externally into what could become a proper SaaS platform, we’re working on redesigning the GUI, adding new functionalities (e.g. with regard to file import/export), and implementing new APIs: The EU’s MT service eTranslations is already available (albeit currently only to users at public institutions), and Microsoft has signaled their interest in adding some of the Azure Cognitive Services to our HLT mash-up solution.

In addition, news.bridge will be used as one of the output platforms for GoURMET, a major EU-funded project focusing on (better) machine translation for low-resource languages and domains. Hopefully, we’ll soon be able to offer you state-of-the-art MT for some very exotic language pairs.

Become a part of the news.bridge family

In case you were wondering if you can still join us as a beta tester: Yes, you can! Simply write to – and get a free, fully functional trial account. Furthermore, we’re always open to tech companies offering NLProc APIs (ASR, NMT, TTS). Our aim is to get as many high-quality services as possible under one umbrella.

Last, but not least, we’d like to thank everybody who has contributed to making news.bridge such a successful project. The past 18 months were a blast – and the future is wide open. HLT FTW!

DW uses news.bridge prototype for automated subtitling in day-to-day news operations



Right from the start, project news.bridge has been about creating a language technology platform fit for daily production workflows at modern media organizations. Our goal is to provide a collection of transcription, translation, subtitling and voice-over services that reporters and editors are able to use hassle-free. In the last couple of months, we’ve taken another big step towards this goal: DW newsrooms have started to work with the news.bridge prototype.

Kudos for pioneering spirit go to both the Hindi and Portuguese language teams, who were the first to use our (still unfinished) platform for the automated production of subtitles for program items on an actual publishing schedule. Needless to say, there were a couple of hiccups, but all in all the subtitling process went really well, and the results speak for themselves.

DW Hindi used news.bridge to turn English audio into Hindi subtitles in a short web video on activists fighting against female genital mutilation in Guinea:


DW Portuguese for Africa worked with news.bridge to create Portuguese subtitles from German audio for an episode of Euromaxx’s “Baking Bread”. The video focuses on pão de milho, Portugal’s famous cornbread:


By now, a team of editors from different newsroom has started adapting long-form videos from German and English. A series of documentaries on celebrated art school Bauhaus will soon be published with Russian and Brasilian Portuguese subtitles, for instance. The Turkish, Indonesian and Swahili desks have also experimented with news.bridge.

Just a couple of days ago, the DW Brasil department announced they’re preparing the production of subtitles for no less than 12 (German-language) instalments of DW’s Reporter series. We think news.bridge will be just right for the heavy lifting in the translation process.

5 Questions with… Afonso Mendes


For our fourth and final post on the people behind news.bridge, we’ve had a chat with Afonso Mendes, head of R&D at Portuegese language tech company Priberam. His team has been building the platform’s sophisticated summarization features.

1) Afonso, when and how did you first get in touch with human language technology?

That was back in 1991, two years after Priberam was created. We joined forces with ILTEC (a linguistic institute here in Portugal) to build a spell checker and a grammar correction tool. At the time, I led the team that developed the first European Portuguese proofing tools for Microsoft Office – which were, by the way, later on licensed to Microsoft.

2) What is the most fascinating aspect about news.bridge?

I think it’s the benefit of having a selection of state-of-the-art tools that can be applied to a number of related, but different tasks – and the benefit of being able to integrate new tools without redesigning the system or incurring high development costs. This is also a key commercial advantage.

3) What is the project’s biggest challenge?

There are two major challenges: The first is to make news.bridge a tool used by a large number of professionals, to make it some sort of standard. The second challenge, more on the technology side, is to create a system that is eventually able to receive direct feedback from its users and thus help to constantly improve the algorithms of its tool portfolio.

4) Who’s in your team and what are they currently working on?

Three members of our R&D team are involved in the project: Sebastião Miranda is a Senior Software Developer and a Research Scientist specialized in information retrieval and deep learning for natural language understanding. He has overall responsibility for the summarization module used in news.bridge. David Nogueira is also a Senior Software Developer and Research Scientist, and he is currently working on named entity recognition, sentiment analysis and question answering. I am the head of R&D at Priberam, and as such I’m in charge of project management and coordination.

5) Where do you see news.bridge in five years?

I expect that that an extended set of natural language understanding services are plugged into the the platform by then, increasing productivity even further. I also hope that news.bridge will have become a de facto standard for the creation of multilingual subtitles, voice-overs, and summaries.

news.bridge Goes Public Beta – Join Us for the Language Technology Hands-On Days in Bonn


Very good news: 10 months into our language tech innovation project, we’re ready to share what we’ve built – and we’d like your opinion on it!

So if you’re interested in testing our platform and finding out more about state-of-the-art transcription, translation, summarization and voice-over software, don’t forget to save the date:

November 21st & November 22nd

Our Language Technology Hands-on Days (with workshops focusing on automated subtitles creation) will take place at the DW headquarters in Bonn, Germany.

To join us, simply send an email to We’ll get back to you as soon as possible. Please note that space is limited, and early registration is advised.

The Language Technology Hands-on Days are part of a larger meeting held by SUMMA, our EU-funded sister project applying language technologies to media monitoring. In case you’d like to take a closer look, you can also register for the SUMMA user day (which takes place on November 20th).

More details and workshop schedules coming up soon.

We’re looking forward to seeing you in Bonn this November!


5 Questions with… Renārs Liepiņš


We’ve talked to Peggy van der Kreeft (who manages news.bridge for Deutsche Welle), and we’ve talked to Yannick Estève (who represents the LIUM computer scientists involved in the project). Now it’s time to have a chat with Renārs Liepiņš – without whom news.bridge would be little more than a white paper and a collection of wire frames. Renārs is Senior Research Scientist at IMCS UL and LETA, founder and CEO of MindFlux, and project lead for LETA.

1) Renārs, when and how did you first get in touch with human language technology?

I first learned about HLT when I was working on tools for the semantic web during my PhD years (2010-2015). At the time HLT reached a new level; the output started to become useful. After my PhD, in early 2016, I began to work for the SUMMA project, which is about combining multiple HLT modules and creating a unified pipeline for automated media monitoring. The success of SUMMA made me think about other interesting combinations of HLT tools, and thus the idea for news.bridge was born.

2) What is the most fascinating aspect of news.bridge?

Well, first of all, it’s great to build a platform that saves people a lot of cumbersome routine work. Videos, audio tracks, and scripts processed with news.bridge aren’t perfect – but they require only minor tweaks. It’s also fascinating to explore the options of human-computer cooperation. news.bridge depends on smart algorithms and smart editors. We get the best results when they work in tandem.

3) What is the project’s biggest challenge?

It’s actually a combination of challenges: We need to scale the system so it can be used in a production environment of a big broadcaster and extend the UI to handle more workflows – all while keeping the platform as simple to use as possible.

4) Who’s in your team and what are they currently working on?

The LETA team consists of Roberts Dargis, Didzis Gosko, Mikus Grasmanis, and myself. Roberts and Didzis take care of the backend development and the integration of new HLT modules from internal and external partners. Mikus is responsible for the UI and does most of the front-end coding. I’m the project lead, which means I handle coordination with other partners as well as overall system architecture and integration.

5)  Where do you see news.bridge in five years?

I hope that news.bridge will have become a mature platform that helps media companies expand their markets and provide truly multilingual news all around the world.

Meet us in London and Bonn

Our colleagues at the BBC are hosting two very interesting hackathons focused on human language technology in (news) media production, and we’re happy to spread the word:

textAV will take place in London on September 18th and 19th. It caters to “technologists, application developers, and practitioners working in the area of online audio and video, with a particular focus on the use of captions and transcripts to facilitate and speed up the production process”. For more information and a schedule, check out the textAV eventbrite page. Please note that web registration has already ended. A few tickets may still be available via

Summa #newsHACK (Pt. II), co-hosted by news.bridge partner Deutsche Welle, will take place in Bonn on October 9th and 10h. The central question of this event will be: “How could cutting-edge language processing technology transform your newsroom?” Developers, designers, and innovation managers will have access to the powerful Summa platform, which offers services for automated translation, entity extraction, topic detection, summarisation, and story clustering. Check out the Summa #newsHACK eventbrite page to learn more.

Several members of the news.bridge consortium will join the HLT design sprints in London and Bonn, so make sure to say hi if you’re also attending. Get in touch any time via or @newsbridge_hlt. We’re looking forward to seeing you at textAV and Summa #newsHACK!

From ASR to xml:tm – Human Language Technology Abbreviations Spelled Out!


Let’s face it: Media tech people live in a language filter bubble. Every community uses their own lingo – which already makes communication somewhat difficult. To make matters worse, communities go on to create a whole bunch of abbreviations, as their favorite terms are long and unwieldy. HLT pros – that’s human language technology professionals – are no exception here. The news.bridge consortium would like to try and fix the problem (or at least a part of it). That’s why we’ve created the following list of terms that spells out the cryptic language we use in our daily business.

Please note that all included abbreviations have a strict focus on speech, language, and translation. So even though deep neural networks (DNN) and hidden Markov models (HMM) are very important, we didn’t put them on the list, because they’re not exclusive to HLT. We also decided to go without full definitions (a lot of terms are actually self-explanatory once they’re spelled out), but rather include a number of links as well as information on whether an abbreviation represents a general term, a paradigm, or a file format. Ok, end of introduction. Here’s our list, we hope it’s useful:


  • ASR = Automatic speech recognition
  • AVR = Automatic voice recognition (rarely used alternative to ASR; stresses the recognition of the speaker voice rather than the speech itself)
  • CAT = Computer-assisted translation
  • DNT = Do not translate (because this is: a proper name, a trademark etc.)
  • EBMT = Example-based machine translation (an MT paradigm)
  • G11n = Globalization (“G”, followed by 11 letters, followed by “n”)
  • GILT =  Globalization, internationalization, localization, translation
  • HLT = Human language technology (our umbrella term)
  • HPMT = Hierarchical phrase-based machine translation (a statistical MT approach)
  • ISG = the Industry Specification Group for localisation industry standards (= the successor of LISA) at the European Telecommunications Standards Institute (ETSI)
  • I18N = Internationalization (“i”, followed by 18 letters, followed by “n”)
  • LE = Language engineering (predecessor term for HLT/LT)
  • L10n = Localization (“l”, followed by 10 letters, followed by “n”)
  • LIS = Localizsation Industry Standard(s)
  • LISA = Localization Industry Standards Association
  • LSP = Language services provider
  • LT = Language technology (umbrella term; used by some instead of HLT)
  • MT = Machine translation
  • NMT = Neural machine translation (an MT paradigm)
  • NER = Named-entity recognition (= the extraction of the names of persons, organizations, locations, etc. from a text)
  • NLG = Natural language generation
  • NLP = Natural language processing
  • NLProc = Natural language processing (used by some to underline they’re not talking about neuro-linguistic programming, which is  also referred to as NLP)
  • OLIF = Open Lexicon Interchange Format (= an open standard for the exchange of terminological and lexical data)
  • PBMT = Phrase-based machine translation (an MT paradigm)
  • POS tagging = Part-of-speech tagging (= the identification of words as nouns, verbs, adjectives, adverbs, etc.)
  • SBMT = Syntax-based machine translation (a statistical MT approach)
  • SLU = Spoken language understanding
  • SMT = Statistical machine translation (an MT paradigm)
  • SRX =  Segmentation Rules eXchange (an enhancement of the TMX standard)
  • STT = Speech-to-text
  • T9n = translation (“T”, followed by 9 letters, followed by “n”)
  • TBX = TermBase eXchange (a standard for exchanging terminological data)
  • TEP = Translate, edit, proofread
  • TM = Translation memory (= a database that stores translated sentences, paragraphs etc.)
  • TMM = Translation memory manager (= software tapping into a TM)
  • TMS = Translation memory system
  • TMX = Translation memory eXchange (= a standard that enables the interchange of translation memories between translation suppliers)
  • TQA = Translation quality assurance
  • TransWS = Translation web services (= a framework for the automation of localization processes via the use of web services)
  • TTS = Text-to-speech
  • TU = Translation unit (= a segment of text treated as a single unit of meaning)
  • UTX = Universal terminology eXchange (= a standard specifically designed for user dictionaries of MT)
  • WBMT = word-based machine translation (a statistical MT approach)
  • WER = Word error rate
  • XLIFF = XML localization interchange file format (= a language tech industry standard for exchanging XML data)
  • xml:tm = XML-based text memory (= an approach to translation memory based on the concept of “text memory”, which is a combination of author and translation memory)

This list is a work in progress. If you feel something important is missing, please drop us a line.

5 questions with… Yannick Estève


Four partners, four areas of expertise, four teams with a distinctive set of skills. For the second part of this series of posts on the people behind news.bridge, we’ve talked to Yannick Estève, Professor of Computer Science at the University of Le Mans, and project lead for LIUM.

Yannick, when and how did you first get in touch with human language technology?

Well, first of all, I’ve always been a fan of science fiction. So most certainly, books and movies like “2001” had a big influence on me. When I became a student in the 1990s, I was fascinated by computer science, but also by the humanities. Working on human language technology seemed like an excellent way to satisfy both interests.

What is the most fascinating aspect about news.bridge?

In my opinion, the most fascinating aspect is that you can handle complex and powerful technologies like speech recognition, machine translation, speech generation, and summarization through a very simple user interface. The platform offers easy access to global information in a vast number of languages, and that’s really fantastic!

What is the project’s biggest challenge?

The biggest challenge is probably related to integration. We need to manage heterogeneous technologies and services from several companies – and come up with one smart, unified application.

Who’s in your team and what are they currently working on?

We have four core members in this project: Sahar Ghannay is a post-doc researcher and an expert on deep learning for speech and natural language processing. Antoine Laurent is an assistant professor, his focus is on speech recognition. Natalia Tomashenko is a research engineer, she’s all about deep learning and acoustic models adaptation for speech recognition. Well, and I’m the professor and project lead; my expertise lies in speech and language technologies and deep learning.

We’re all members of LIUM, which can safely be called an HLT stronghold. For the last five years, our main research interest has been deep learning applied to media technology. Currently, we mainly work on neural end-to-end approaches for different tasks related to speech and language. End-to-end neural means that a single neural model processes the input (for example an audio signal containing speech) to generate the output (text), whereas in the “classical” pipeline, we apply different sequential systems and models between the input and the final output.

Where do you see news.bridge in five years?

In five years, news.bridge will have even better integrated services, cover even more languages, and offer new functionalities, like the smooth extraction of semantic information. Progress in HLT is very fast, and we still haven’t realized the full potential of the deep learning paradigm. Increasing computation power and training data is just a first step here.

LIUM publishes new papers on named entity extraction and speech recognition


news.bridge partner LIUM has released two new academic papers on computation and language: End-to-end named entity extraction from speech and TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation. The papers were collaboratively written by Antoine Caubrière, Yannick Estève, Sahar Ghannay, Antoine Laurent, Natalia Tomashenko (University of Le Mans), François Hernandez and Vincent Nguyen (Ubiqus), and Emmanuel Morin (University of Nantes).

End-to-end named entity extraction from speech shows that it’s possible to recognize named entities in speech with a deep neural network that directly analyzes the audio signal. This new approach is an alternative to the classical pipeline which applies an automatic speech recognition (ASR) system first and subsequently analyzes the transcriptions. LIUM’s new method not only allows to simultaneously deal with speech recognition and entity recognition, it also makes it possible to obtain named entities only — and ignore the other words. The approach is interesting for at least two reasons:

  1. The system is easier to deploy (because you only need to set up a neural net).
  2. Performance will most likely be better (because the neural net is optimized for named entity extraction, whereas in the classical pipeline the different tools are not jointly optimized for the same task).

End-to-end named entity extraction from speech is available on arXiv under this link.

TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation is about describing (and providing) a new LIUM TED talk corpus as well as documenting experiments with it.

TED LIUM basically follows two aims: Train and improve acoustic models — and fix flawed TED talk subtitles while at it. Via its own ASR system, tailor-made for processing TED talks, LIUM creates new transcriptions of original audio. Subsequently, these transcriptions are compared to the old, often inferior subtitles. LIUM saves reliable segments, gets rid of unreliable material, applies some heuristics and finally provides the subtitles in file formats used by the international speech recognition community.

One of the (rather surprising) insights of the paper is that when transcribing oral presentations like TED talks, augmenting training data from 207 hours to 452 hours (+ 218.36%) doesn’t significantly affect state-of-the-art ASR system (i.e. Hidden Markov Models coupled with deep neural networks using a pipeline of different processes: speaker adaptation, acoustic decoding, language model rescoring). The word error rate (WER) dropped by merely 0.2%, from (an already low) 6.8% to 6.6%. The system seems to have reached a plateau.

However, training data augmentation absolutely benefits emergent ASR systems (fully neural end-to-end architecture, only one process, no speaker adaptation, no heavy language model rescoring). In this case, the same augmentation of training data led to a 4.8% drop in the WER. At 13.7%, it’s still rather high, but significantly lower than the 18.5% achieved by an Markov-based system at a comparable development stage in 2012.

Conclusion: Emergent fully neural ASR systems aren’t bad at all, very sensitive to training data augmentation, and can probably be exploited further. The big question in this context: How much data does it take to reach competitive results?

TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation is available on arXiv under this link and was also submitted to and accepted by SPECOM 2018.

Meet us in Bonn

We’re thrilled to announce that we’re a part of this year’s Global Media Forum (GMF), which will take place in Bonn from June 11th to June 13th.

GMF, hosted by news.bridge partner Deutsche Welle (DW), is an annual get-together of representatives from the fields of journalism, digital media, politics, culture, business, development, academia and civil society. Their main concern: analyze global media development, tackle problems, brainstorm solutions.

In 2018, GMF will be all about “global inequalities” — which can also be reduced by breaking down language barriers and fostering polyglot public service broadcasting, for example with state-of-the-art HLT tools.

Make sure to catch our session:

news.bridge: Automated translation – are we there yet?
Wednesday, June 13th, 11:30h to 12:00h
World Conference Center (Rondel)

Our team will be in Bonn for the entire conference, so feel free to drop us a line (via email or Twitter) and have a chat with us. We’re looking forward to seeing you at the GMF!

More info: