Tuesday, 10 January 2023

How Diffusion Drives Generative AI

NAB

Text-to-image AI exploded last year as technical advances greatly enhanced the fidelity of art that AI systems could create. At the heart of these systems is a technology called diffusion, which is already being used to auto-generate music and video.

article here

So what is diffusion, exactly, and why is it such a massive leap over the previous state of the art? Kyle Wiggers has done the research at TechCrunch.

We learn that earlier forms of AI technology relied on generative adversarial networks, or GANs. These proved pretty good at creating the first deepfaking apps. For example, StyleGAN, an NVIDIA-developed system, can generate high-resolution head shots of fictional people by learning attributes like facial pose, freckles and hair.

In practice, though, GANs suffered from a number of shortcomings owing to their architecture, says Wiggers. The models were inherently unstable and also needed lots of data and compute power to run and train, which made them tough to scale.

Diffusion rode to the rescue. The tech has actually been around for a decade but it wasn’t until OpenAI developed CLIP (Contrastive Language-Image Pre-Training) that diffusion became practical in everyday applications.

CLIP classifies data — for example, images — to “score” each step of the diffusion process based on how likely it is to be classified under a given text prompt (e.g. “a sketch of a dog in a flowery lawn”).

Wiggers explains that, at the start, the data has a very low CLIP-given score, because it’s mostly noise. But as the diffusion system reconstructs data from the noise, it slowly comes closer to matching the prompt.

“A useful analogy is uncarved marble — like a master sculptor telling a novice where to carve, CLIP guides the diffusion system toward an image that gives a higher score.”

OpenAI introduced CLIP alongside the image-generating system DALL-E. Since then, it’s made its way into DALL-E’s successor, DALL-E 2, as well as open source alternatives like Stable Diffusion.

So what can CLIP-guided diffusion models do? They’re quite good at generating art — from photorealistic imagery to sketches, drawings and paintings in the style of practically any artist.

Researchers have also experimented with using guided diffusion models to compose new music. Harmonai, an organization with financial backing from Stability AI, the London-based startup behind Stable Diffusion, released a diffusion-based model that can output clips of music by training on hundreds of hours of existing songs. More recently, developers Seth Forsgren and Hayk Martiros created a hobby project dubbed Riffusion that uses a diffusion model cleverly trained on spectrograms — visual representations — of audio to generate tunes.

 Researchers have also applied it to generating videos, compressing images and synthesizing speech. Diffusion may be replaced with a more efficient machine learning technique but the exploration has only just begun.

 


No comments:

Post a Comment