Seed-TTS is a family of large-scale autoregressive text-to-speech (TTS) models developed by ByteDance that can generate highly natural and expressive speech from text.
Key Innovations of Seed-TTS
- A novel text encoding approach that allows the models to better capture the nuances of human speech
- The ability to control various speech attributes like emotion, speaking style, and audio quality
- State-of-the-art performance in speaker similarity and naturalness that matches human speech, as demonstrated by both objective and subjective evaluations
- Even higher subjective scores across these metrics with fine-tuning
- A self-distillation method for speech factorization and reinforcement learning to enhance model robustness, speaker similarity and controllability
- A non-autoregressive variant called Seed-TTS DiT that utilizes a fully diffusion-based architecture, performs end-to-end speech generation without pre-estimated phoneme durations, and achieves comparable performance to the autoregressive variant
The Seed-TTS architecture consists of a text encoder, audio decoder, and conditioning modules. It serves as a foundation model for speech generation and excels at in-context learning. The models are trained on large-scale speech data to produce diverse and expressive speech that is virtually indistinguishable from human speech.