A multistream multimodal foundation model for real-time voice-based application

Workshop online

Workshop online: A multistream multimodal foundation model for real-time voice-based application held by Patrick Pérez.

When: May 12ve, 10.00 am
Where: online at the link: https://tinyurl.com/3wwvj7ks

Speaker: Patrick Pérez, CEO at Kyutai, Paris, France

Moderators: Raffaello Camoriano and Gabriele Rosi of the DAUIN

L'evento è svolto nell'ambito del Ciclo di seminari online "Ellis Turin Talk" in collaborazione con l'Artificial intelligence Hub del Politecnico di Torino e il Gruppo di ricerca Vandal del Dipartimento di Automatica e Informatica.

The event is in the "Ellis Turin Talk" online seminar series, collaborating with the Artificial Intelligence Hub of the Politecnico di Torino and the Vandal Research Group of the Department of Control and Computer Engineering.

Abstract: A unique way for humans to seamlessly exchange information and emotion, speech should be a key means for us to communicate with and through machines. This is not yet the case. In an effort to progress toward this goal, we introduce a versatile speech-text decoder-only model that can serve a number of voice-based applications. It has in particular allowed us to build Moshi, the first-ever full-duplex spoken-dialogue system (with no latency and no imposed speaker turns) as well as Hibiki, the first simultaneous voice-to-voice translation model with voice preservation to run on a mobile phone. This multistream multimodal model can also be turned into a visual-speech model (VSM) via cross-attention with visual information, which allows Moshi to freely discuss about an image while maintaining its natural conversation style and low latency. This talk will provide an illustrated tour of this research.

Flyer (252 KB)