1、Moshi:a speech-text foundation model for real-time dialogue/Kyutai Technical Report 202418-01 September 18,2024Moshi:a speech-text foundation model for real-time dialogueAlexandre D efossezalexkyutai.orgLaurent Mazar eManu OrsiniAm elie RoyerPatrick P erezHerv e J egouEdouard GraveNeil Zeghidourneil
2、kyutai.orgKyutaiEqual contributionAbstractWe introduce Moshi,a speech-text foundation model and full-duplex spoken dialogue frame-work.Current systems for spoken dialogue rely on pipelines of independent components,namely voice activity detection,speech recognition,textual dialogue and text-to-speec
3、h.Such frameworks cannot emulate the experience of real conversations.First,their complex-ity induces a latency of several seconds between interactions.Second,text being the inter-mediate modality for dialogue,non-linguistic information that modifies meaning such asemotion or non-speech sounds is lo
4、st in the interaction.Finally,they rely on a segmenta-tion into speaker turns,which does not take into account overlapping speech,interruptionsand interjections.Moshi solves these independent issues altogether by casting spoken dia-logue as speech-to-speech generation.Starting from a text language m
5、odel backbone,Moshigenerates speech as tokens from the residual quantizer of a neural audio codec,while model-ing separately its own speech and that of the user into parallel streams.This allows for theremoval of explicit speaker turns,and the modeling of arbitrary conversational dynamics.We moreove
6、r extend the hierarchical semantic-to-acoustic token generation of previous workto first predict time-aligned text tokens as a prefix to audio tokens.Not only this“InnerMonologue”method significantly improves the linguistic quality of generated speech,but wealso illustrate how it can provide streami