NAB
The inevitable has happened,
albeit a little sooner than expected. After all the hoopla surrounding
text-to-image AI generators in recent months, Meta is first out of the gate
with a text-to-video version.
article here
Perhaps Meta wanted
to establish some headline leadership in this space, since the results aren’t
ready for primetime.
But as developments
in text-to-image generation has shown, by the time you read this the technology
will already have advanced.
Meta is only giving
a glimpse to the public at the tech it calls Make-A-Video. It’s still being
researched with no hint of a commercial release.
“Generative AI
research is pushing creative expression forward by giving people tools to
quickly and easily create new content,” Meta stated in a blog post announcing
the new AI tool. “With just a few words or lines of text, Make-A-Video can
bring imagination to life and create one-of-a-kind videos full of vivid colors
and landscapes.”
In a Facebook
post, Meta CEO Mark Zuckerberg described the work as “amazing
progress,” adding, “It’s much harder to generate video than photos because
beyond correctly generating each pixel, the system also has to predict how
they’ll change over time.”
Examples on
Make-A-Video’s announcement page include “a young couple walking in heavy rain”
and “a teddy bear painting a portrait.” It also showcases Make-A-Video’s
ability to take a static source image and animate it. For example, a still
photo of a sea turtle, once processed through the AI model, can appear to be
swimming.
The key technology
behind Make-A-Video — and why it has arrived sooner than some
experts anticipated — is that it builds off existing work with
text-to-image synthesis used with image generators like OpenAI’s DALL-E. Meta
announced its own text-to-image AI model in July.
According to Benj
Edwards at Arts Technica, instead of training the Make-A-Video model
on labeled video data (for example, captioned descriptions of the actions
depicted), Meta instead took image synthesis data (still images trained with
captions) and applied unlabeled video training data so the model learns a sense
of where a text or image prompt might exist in time and space. It can then
predict what comes after the image and display the scene in motion for a short
period.
In Meta’s white
paper, “Make-A-Video: Text-To-Video Generation Without Text-Video Data,” the
researchers note that Make-A-Video is training on pairs of images and captions
as well as unlabeled video footage. Training content was sourced from two
datasets which, together, contain millions of videos spanning hundreds of
thousands of hours of footage. This includes stock video footage created by
sites like Shutterstock and scraped from the web.
The Verge’s James
Vincent shares other examples, but notes that they were all provided by
Meta. “That means the clips could have been cherry-picked to show the system in
its best light,” he says. “The videos are clearly artificial, with blurred
subjects and distorted animation, but still represent a significant development
in the field of AI content generation.”
The clips are no
longer than five seconds (16 frames of video) at a resolution of 64 by 64
pixels, which are then boosted in size using a separate AI model to 768 by 768.
They contain no audio but span a huge range of prompts.
The researchers
note that the model has many technical limitations beyond blurry footage and
disjointed animation. For example, their training methods are unable to learn
information that might only be inferred by a human watching a video — e.g.,
whether a video of a waving hand is going left to right or right to left. Other
problems include generating videos longer than five seconds, videos with
multiple scenes and events, and higher resolution.
The researchers are
also aware of walking into a minefield of controversy. Make-A-Video has
“learnt and likely exaggerated social biases, including harmful
ones,” while all AI-generated video content from the AI contains a watermark to
“help ensure viewers know the video was generated with AI and is not a captured
video.”
Cracking the code
to create photorealistic video on demand — and then drive it with a narrative —
is exercising other minds too.
Chinese researchers
are behind another text-to-video model named CogVideo, OpenAI is also
thought to be working on one, and no doubt there are numerous other initiatives
in the works.
No comments:
Post a Comment