Meta, the parent company of Facebook, unveiled two new tools for music and speech: MusicGen and Voicebox. The two new large language models use AI to create authentic-sounding music and speech from text prompts.
Voicebox is a generative AI model for text-to-speech that can help with audio editing, sampling and styling. MusicGen is an open-source text-to-music AI that rivals Google’s MusicLM.
Meta claims that Voicebox is capable of producing high-quality audio clips from scratch or edit pre-recorded samples. It can remove unwanted sounds like car horns or barking dogs, while still preserving the content and style of the audio, it said in a blog post on Friday.
Voicebox: The ChatGPT for audio
Voicebox operates much in the same way as OpenAI’s ChatGPT or Dall-E. Instead of generating poetry or images, it creates audio clips from text prompts. The AI is trained on a large set of data of audio recordings and transcripts, totaling more than 50,000 hours.
The information includes public domain audiobooks in about six languages: English, French, Spanish, German, Polish, and Portuguese. According to researchers from Meta, the multi-lingual capability gives Voicebox a wide range of exposure to different speakers and accents, as well as a deep understanding of the nuances of each language.
“Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech,” the researchers said.
Voicebox matches the style of an audio as short as two-seconds and uses that for text-to-speech generation, Meta says. The AI can also erase audio noise and mis-spoken words from a speech, allowing users to recreate recordings without having to re-record everything.
For instance, if a dog barks in the middle of a speech, Voicebox can be instructed to cut out the barking and regenerate the lost audio – much like an audio eraser. In the future, Meta said its AI could create custom voices for virtual assistants and metaverse characters.
“This type of technology could be used in the future to help creators easily edit audio tracks, allow visually impaired people to hear written messages from friends in their voices, and enable people to speak any foreign language in their own voice,” Meta said.
Introducing Voicebox, a new breakthrough generative speech system based on Flow Matching, a new method proposed by Meta AI. It can synthesize speech across six languages, perform noise removal, edit content, transfer audio style & more.
More details on this work & examples ⬇️
— Meta AI (@MetaAI) June 16, 2023
MusicGen Rivals Google’s MusicLM
A week ago, Meta also launched MusicGen, a text-to-music large language model that can generate original music, similar to Google’s MusicLM. The model is open-source, meaning that everyone can freely use it to create anything from rock to pop music.
MusicGen is a transformer-based music generation model that can create short new pieces – around 12 seconds – of music based on text prompts. Users can specify the genre of music they want to generate, the mood they want to create, and MusicGen will then create a new song based on the input.
According to Meta’s Audiocraft research team, the AI works by predicting the next section in a piece of music, just as a language model predicts the next characters in a sentence.
— Gabriel Synnaeve (@syhw) June 9, 2023
In a study, the researchers compared MusicGen to other music generation software, including Google’s MusicLM, Riffusion, Mousai, and Noise2Music. They found that Noise2Music was able to generate more “plausible” results, as measured by both objective and subjective metrics.
However, MusicGen scored highest for accurate musical concepts, audio-to-text alignment, and human-scored overall audio quality and accuracy. You can try out MusicGen online at Facebook’s HuggingFace page.