NAB
Only a few months ago the art world was agog at breakthroughs in text-to-image synthesis but already new models have arrived capable of text-to-video. Advances in the field have been so swift that Meta’s Make-A-Video – announced just three weeks ago - looks basic.
article here
Another, called Phenaki, can generate video from a still image and a prompt rather than a text prompt alone. It can also make far longer clips: users can create videos multiple minutes long based on several different prompts that form the script for the video. The example given by MIT’s Technology Review is of ‘A photorealistic teddy bear is swimming in the ocean at San Francisco. The teddy bear goes underwater. The teddy bear keeps swimming under the water with colorful fishes. A panda bear is swimming underwater.’)
“A technology like this could revolutionize filmmaking and
animation,” writes Melissa Heikkilä. “It’s frankly amazing how quickly this
happened. DALL-E was launched just last year. It’s both extremely exciting and
slightly horrifying to think where we’ll be this time next year.”
In its white paper, Phenaki explains that
generating videos from text is particularly challenging due to the
computational cost, limited quantities of high quality text-video data and
variable length of videos. To address these issues, Phenaki compresses the
video to a small representation of “discrete tokens. This tokenizer uses causal
attention in time, which allows it to work with variable-length videos.”
It goes onto to explain how it achieves a compressed
representation of video. “Previous work on text to video either use per-frame
image encoders or fixed length video encoders. The former allows for generating
videos of arbitrary length, however in practice, the videos have to be short because
the encoder does not compress the videos in time and the tokens are highly
redundant in consecutive frames. The latter is more efficient in the number of
tokens but it does not allow to generate variable length videos. In Phenaki,
our goal is to generate videos of variable length while keeping the number of
video tokens to a minimum so they can be modeled … within current computational
limitations.”
Google also a text-to-video AI model called DreamFusion. This generates 3D
images which can be viewed from any angle, the lighting can be changed, and the
model can be placed into any 3D environment – handy for metaverse building you
would imagine.
In its paper, DreamFusion researchers explains that existing
generative AI have been “driven by diffusion models trained on billions of
image-text pairs,” but adapting this approach to 3D synthesis “would require
large-scale datasets of labeled 3D assets and efficient architectures for
denoising 3D data, neither of which currently exist.”
Instead, Google circumvents these limitations by using a
pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis.
After a bit more jiggery-pokery “the resulting 3D model of
the given text can be viewed from any angle, relit by arbitrary illumination,
or composited into any 3D environment. Our approach requires no 3D training
data and no modifications to the image diffusion model, demonstrating the
effectiveness of pretrained image diffusion models as priors.”
Wonderful you think – but such advances raise ethical
questions, not least given the inherent bias of the data sets on which previous
AI text-to-speech engines have been built.
“As the technology develops, there are fears it could be
harnessed as a powerful tool to create and disseminate misinformation,” warns
MIT. “It’s only going to become harder and harder to know what’s real online,
and video AI opens up a slew of unique dangers that audio and images don’t,
such as the prospect of turbo-charged deepfakes.”
AI-generated video could be a powerful tool for
misinformation, because people have a greater tendency to believe and share
fake videos than fake audio and text versions of the same content, according to
researchers at Penn State University.
The creators of Pheraki write in their paper that while the
videos their model produces are not yet indistinguishable in quality from real
ones, it “is within the realm of possibility, even today.” The models’ creators
say that before releasing their model, they want to get a better understanding
of data, prompts, and filtering outputs and measure biases in order to mitigate
harms.
The European Union is trying to do something about it. The
AI Liability Directive is a new bill and is part of a push from Europe to force
AI developers not to release dangerous systems.
According to MIT, the bill will add teeth to the EU’s AI Act, which is set to become law around a similar time.
The AI Act would require extra checks for “high risk” uses of AI that have the
most potential to harm people. This could include AI systems used for policing,
recruitment, or health care.
“It would give people and companies the right to sue for
damages when they have been harmed by an AI system—for example, if they can
prove that discriminatory AI has been used to disadvantage them as part of a
hiring process.
But there’s a catch: Consumers will have to prove that
the company's AI harmed them, which could be a huge undertaking.”
No comments:
Post a Comment