MARS5 TTS is a novel open-source text-to-speech model released by Camb AI.
It offers exceptional prosodic control and voice cloning capabilities, requiring less than 5 seconds of audio input.
The model employs a unique two-stage architecture consisting of a 750M Auto-Regressive (AR) model and a 450M Non-Auto-Regressive (NAR) model.
MARS5 utilizes a BPE tokenizer, enabling precise control over punctuation, pauses, and stops, thus advancing the field of speech synthesis.
The model can replicate complex prosody, including sports commentary, anime voices, and movie performances, across 140+ languages.
It supports two inference modes: a fast "shallow clone" that doesn't require the reference audio's transcript, and a slower but higher-quality "deep clone" that utilizes the prompt transcript.
The system's architecture follows a two-stage AR-NAR pipeline, where an autoregressive transformer model generates coarse speech features, which are then refined using a Denoising Diffusion Probabilistic Model (DDPM).
MARS5 allows for nuanced control over prosody through punctuation and capitalization in the input text.
The model demonstrates impressive results in voice cloning and prosodic control, making it suitable for various applications in entertainment, education, and accessibility.
MARS5 TTS represents a significant advancement in open-source text-to-speech technology, offering developers and researchers a powerful tool for generating high-quality, prosodically rich speech with minimal input.