NAB
Text-to-image AI exploded last year as technical advances greatly enhanced the fidelity of art that AI systems could create. At the heart of these systems is a technology called diffusion, which is already being used to auto-generate music and video.
article here
So what is
diffusion, exactly, and why is it such a massive leap over the previous state
of the art? Kyle Wiggers has done the research at TechCrunch.
We learn that
earlier forms of AI technology relied on generative adversarial networks, or
GANs. These proved pretty good at creating the first deepfaking apps. For
example, StyleGAN, an NVIDIA-developed system, can generate
high-resolution head shots of fictional people by learning attributes like
facial pose, freckles and hair.
In practice,
though, GANs suffered from a number of shortcomings owing to their
architecture, says Wiggers. The models were inherently unstable and also needed
lots of data and compute power to run and train, which made them tough to
scale.
Diffusion rode to
the rescue. The tech has actually been around for a decade but it wasn’t until
OpenAI developed CLIP (Contrastive Language-Image Pre-Training) that diffusion
became practical in everyday applications.
CLIP classifies
data — for example, images — to “score” each step of the diffusion process
based on how likely it is to be classified under a given text prompt (e.g. “a
sketch of a dog in a flowery lawn”).
Wiggers explains
that, at the start, the data has a very low CLIP-given score, because it’s
mostly noise. But as the diffusion system reconstructs data from the noise, it
slowly comes closer to matching the prompt.
“A useful analogy
is uncarved marble — like a master sculptor telling a novice where to carve,
CLIP guides the diffusion system toward an image that gives a higher score.”
OpenAI introduced CLIP
alongside the image-generating system DALL-E. Since then, it’s made its way
into DALL-E’s successor, DALL-E 2, as well as open source alternatives like
Stable Diffusion.
So what can
CLIP-guided diffusion models do? They’re quite good at generating art — from
photorealistic imagery to sketches, drawings and paintings in the style of
practically any artist.
Researchers have
also experimented with using guided diffusion models to compose new
music. Harmonai, an organization with financial backing from Stability
AI, the London-based startup behind Stable Diffusion, released a
diffusion-based model that can output clips of music by training on hundreds of
hours of existing songs. More recently, developers Seth Forsgren and Hayk
Martiros created a hobby project dubbed Riffusion that uses a
diffusion model cleverly trained on spectrograms — visual representations — of
audio to generate tunes.
Researchers have also applied it to generating
videos, compressing images and synthesizing speech. Diffusion
may be replaced with a more efficient machine learning technique but the
exploration has only just begun.
No comments:
Post a Comment