Fish Speech is an advanced open-source text-to-speech (TTS) model developed by Fish Audio.
Multilingual capability: Fish Speech V1 has been trained on 150,000 hours of audio data, including 50,000 hours each of English, Chinese, and Japanese speech.
Model size and versions: The developers plan to release both Medium (400M parameters) and Large (1B parameters) versions of the pretrained and fine-tuned models.
Speed: Fish Speech is notably fast, operating at approximately 20 tokens per second. This allows for generating content much faster, around 20 seconds of audio per second on a 4090 GPU.
Open-source and customizable: The model is open-source, allowing users to fine-tune it on their own data for customization.
Technical innovations: Fish Speech incorporates several technical advancements, including:
Improving compression rates using techniques like FSQ (Feedback Soft Quantization)
License: The model is released under the BY-CC-NC-SA-4.0 license, with the source code under the BSD-3-Clause license.
Performance: Users have reported that Fish Speech performs well, with some noting that it's "much better than other TTS" options. However, there are some minor issues with pronunciation in certain languages and occasional hallucinations for single words.
Cloning capability: Experiments have shown that Fish Speech can effectively clone a person's speaking style in English, Chinese, and Japanese with just 30 minutes of data.
Fish Speech represents a significant advancement in open-source TTS technology, offering high-quality, multilingual speech synthesis with the flexibility for further customization and improvement.