Vorbis: End-to-end multimodal speech recognition

Automatic speech recognition (ASR) has been an intensely studied topic in the last decades, its research leading to a wide range of applications (e.g., smart home assistants or automatic video transcription). However, ASR systems are still prone to errors when the recording conditions are difficult (noise, reverberations, accents) or when training data is scarce (low-resource languages). A way of improving an ASR system is to use extra information. In this project we aim to leverage the visual context, which is provided by images that are recorded at the same time the command is uttered. Cases when the two modalities (audio and vision) coexist include voice-based robot navigation or video-based content (documentaries, news, instructional videos). The objectives of the project center around building such a multimodal speech recognition system and empirically showing that it indeed outperforms an audio-only ASR system. In order to obtain such a system we plan to make use of one of the ingredients of deep learning, namely end-to-end learning. Developing an end-to-end multimodal system will allow us to jointly optimize for the final objective function and will to facilitate various model combinations. A successful project will enable other scientific directions, such as automatic video summarization or semi-supervised multimodal learning.

This project is supported by the Romanian National Authority for Scientific Research and Innovation, UEFISCDI:
Project number: PN-III-P1-1.1-PD-2019-0918
Contract number: PD 97 / 2020
Period: 2020-09-01 – 2022-08-31
Research fields: machine learning, statistical data processing and applications using signal processing (e.g., speech, image, video)


The goal of the project is to improve the generated transcriptions of an automatic speech recognition system by incorporating visual information. Towards this goal, we have set three objectives:
  • O1. Automatic speech recognition using end-to-end learning
  • O2. Image understanding using end-to-end learning
  • O3. Multimodal end-to-end speech recognition by fusing the two types of end-to-end architectures (O1 and O2)


  • D1. State of the art survey ◇ link new
  • D2. End-to-end automatic speech recognition system
  • D3. End-to-end image understanding system
  • D4. End-to-end multimodal system for automatic speech recognition
  • D5. Two conference articles and a journal article