A multistream multimodal foundation model for real-time voice-based application
Seminario online A multistream multimodal foundation model for real-time voice-based application tenuto da Patrick Pérez.
Quando: 12 maggio, ore 10.00
Dove: online al link: https://tinyurl.com/3wwvj7ks
Relatore: Patrick Pérez, CEO presso Kyutai, Paris, France
Moderano: Raffaello Camoriano e Gabriele Rosi del DAUIN
L'evento è svolto nell'ambito del Ciclo di seminari online "Ellis Turin Talk" in collaborazione con l'Artificial intelligence Hub del Politecnico di Torino e il Gruppo di ricerca Vandal del Dipartimento di Automatica e Informatica.
Abstract: A unique way for humans to seamlessly exchange information and emotion, speech should be a key means for us to communicate with and through machines. This is not yet the case. In an effort to progress toward this goal, we introduce a versatile speech-text decoder-only model that can serve a number of voice-based applications. It has in particular allowed us to build Moshi, the first-ever full-duplex spoken-dialogue system (with no latency and no imposed speaker turns) as well as Hibiki, the first simultaneous voice-to-voice translation model with voice preservation to run on a mobile phone. This multistream multimodal model can also be turned into a visual-speech model (VSM) via cross-attention with visual information, which allows Moshi to freely discuss about an image while maintaining its natural conversation style and low latency. This talk will provide an illustrated tour of this research.