A new synthetic media project combines text-to-image and audio generation called Riffusion. Users can describe music in words and hear an AI-generated example as well as see a visual representation of the sound.
Riffusion uses the Stable Diffusion synthetic image generator, fine-tuned to work for music to create a sonogram, a visual representation of sound as a graph with time on the horizontal, and sound frequency on the vertical axis, as seen in the image above. Hobbyists Seth Forsgren and Hayk Martiros created a custom Stable Diffusion model with real sonograms and accompanying metadata describing the sound and musical genre they represent. The AI works as a specialized version of Stable Diffusion’s text-to-image generator, mimicking and even producing odd mixes of sound in the form a visual. The sonogram can then be translated into sound. Riffusion uses Torchaudio to read the frequency and time to play the sound.
“Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts,” Martiros explained in a LinkedIn post. “There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.”
The quality of the music is variable, but it’s a notable combination of synthetic media tools and generative AI. The potential images, and thus sounds, are infinite based on the seed used. It’s also an interesting extension of the idea of text-to-image AI services as once the image is generated it can be transformed in other ways. For instance, OpenAI’s Point-E tool creates three-dimensional objects from text prompts by first creating two-dimensional images and then translating that into a third dimension or Meta’s Make-a-Video tool for producing videos from text.
Follow @voicebotaiFollow @erichschwartz
OpenAI Debuts Text-to-3D Model AI Tool Point-E
Meta Jumps into Synthetic Media With Text-to-Video AI Generator ‘Make-A-Video’
Synthetic Media Startup Stability AI Raises $101M
Eric Hal Schwartz is Head Writer and Podcast Producer for Voicebot.AI. Eric has been a professional writer and editor for more than a dozen years, specializing in the stories of how science and technology intersect with business and society. Eric is based in New York City.