NAB Amplify
AI engines are getting pretty good at accurate image
recognition but fail spectacularly in understanding what it is they are looking
at. An approach used for natural language processing could address that.
https://amplify.nabshow.com/articles/computers-can-see-but-they-dont-have-vision/
In a shoot-out between humans and the AI smarts of Amazon
AWS Rekognition, Google Vision, IBM Watson, and Microsoft Azure Computer
Vision, the machines came out on top.
On a pure accuracy basis, Amazon, Google and Microsoft
scored higher than human tagging for tags with greater than 90% confidence in a
test completed by Perficient Digital, an edge-AI accelerator chip company,
as reported at ZDnet.
However, in a machines versus humans rematch, the
engine-generated descriptions matched up poorly with the way that we would
describe the image. In other words, the study concluded, there is a clear
difference between a tag being accurate and what a human would use to describe
an image.
A couple years on, Perficient Digital CEO Steve Teig says
advances in natural language processing (NLP) techniques can be applied to
computer vision to give machines a better understanding of what they are
seeing.
So-called attention-based neural network techniques, which
are designed to mimic cognitive processes by giving an artificial neural
network an idea of history or context, could be applied to image processing.
In NLP, the Attention mechanism looks at an input sequence,
such as a sentence, and decides after each piece of data in the sequence
(syllable or word) which other parts of the sequence are relevant. This is
similar to how you are reading this article: Your brain is holding certain
words in your memory even as it focuses on each new word you’re reading,
because the words you’ve already read combined with the word you’re reading
right now lend valuable context that help you understand the text.
Applying the same concept to a still image (rather than a temporal
sequence such as a video) is less obvious but Teig says Attention can be used
in a spatial context here. Syllables or words would be analogous to patches of
the images.
As outlined by Teig using an example of computer vision
applied to an image of a dog, “There’s a brown pixel next to a grey pixel, next
to…” is “a terrible description of what’s going on in the picture,” as opposed
to “There is a dog in the picture.”
He says new techniques help an AI “describe the pieces of
the image in semantic terms. It can then aggregate those into more useful
concepts for downstream reasoning.”
Interviewed by EE Times, Teig said, “I think there’s a
lot of room to advance here, both from a theory and software point of view and
from a hardware point of view, when one doesn’t have to bludgeon the data with
gigantic matrices, which I very much doubt your brain is doing. There’s so much
that can be filtered out in context without having to compare it to everything
else.”
This matters because current NLP processing from the likes
of Google is computational intensive. Deep learning language models like
Generative Pre-trained Transformer 3 (GPT-3) require 175 billion parameters —
or 1 trillion bits of information.
If you want to do this at the network Edge, to fuel next-gen
applications over 5G, then think again.
“It’s like… I’m going to ask you a trillion questions in
order to understand what you’ve just said,” Teig says. “Maybe it can’t be done
in 20,000 or two million, but a trillion — get out of here! The flaw isn’t that
we have a small [processor at the Edge]; the flaw there is that having 175
billion parameters means you did something really wrong.”
That said, this is all evolving very fast. He thinks that
reducing Attention-based networks’ parameter count, and representing them
efficiently, could bring attention-based embedded vision to Edge devices soon.
No comments:
Post a Comment