SenseVoice is a speech foundation model developed as part of the FunAudioLLM framework, designed to enhance natural voice interactions between humans and large language models (LLMs).
Key features of SenseVoice
It offers multiple speech understanding capabilities, including:
- Automatic Speech Recognition (ASR)
- Language Identification (LID)
- Speech Emotion Recognition (SER)
- Audio Event Detection (AED)
SenseVoice comes in two main variants:
- SenseVoice-Small: An encoder-only speech foundation model for fast speech understanding.
- SenseVoice-Large: An encoder-decoder speech foundation model for more accurate speech understanding with support for more languages.
Key features of SenseVoice include:
- Support for over 50 languages
- Exceptionally low latency processing
- Ability to detect audio events such as music, applause, and laughter
- Emotion recognition, including categories like happy, angry, and sad
The SenseVoice model and its related resources are open-sourced and available on various platforms:
- GitHub: The FunAudioLLM organization hosts repositories related to SenseVoice, including training, inference, and fine-tuning code.
- ModelScope: Offers pre-trained SenseVoice models for download and use.
- Hugging Face: Provides access to SenseVoice models, such as the SenseVoiceSmall variant.
Developers can integrate SenseVoice into their projects using the provided APIs and tools, enabling applications like speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.