CosyVoice is a family of fundamental speech generation models developed as part of the FunAudioLLM framework.
Multi-lingual voice generation: CosyVoice supports five languages - Chinese, English, Japanese, Cantonese, and Korean.
Zero-shot adaptation: It can adapt to new speakers without additional training, even with as little as 3 seconds of prompt speech.
Cross-lingual voice cloning: CosyVoice can replicate voices across different languages.
Emotional expression: The model is capable of creating emotionally resonant voices.
Instructional control: It offers nuanced influence over speech output through instructional text.
Open-source models: Three variants have been released - CosyVoice-base-300M, CosyVoice-instruct-300M, and CosyVoice-sft-300M, each with specific strengths.
Architecture: CosyVoice uses an autoregressive transformer to generate speech tokens, an ODE-based diffusion model for Mel spectrum reconstruction, and a HiFTNet-based vocoder for waveform synthesis.
CosyVoice is designed to produce natural-sounding voices for various applications, making it an essential component of the FunAudioLLM framework for enhancing voice interactions between humans and large language models. It can be integrated with other models like SenseVoice and LLMs to enable applications such as speech translation, emotional voice chat, and expressive audiobook narration.