A new AI system by Microsoft is taking things to a new level by introducing 'imagination' and the ability to draw pictures of objects from scratch.
The ability to automate the production of images and sell them as art, satirised by Warhol and idealised by Hollywood, just got another nudge closer. Microsoft has developed an Artificial Intelligence bot that can draw near pixel-perfect renditions of objects purely from a text. The drawing bot closes a research circle around the intersection of computer vision and natural language processing with profound implications for the creation of still and moving images. Microsoft itself envisions the day when its AI will digitally animate a feature film from nothing other than a script.
Its research team has been developing the technology for some time, starting with a programme that automatically writes photo captions (the CaptionBot) and then writing software that answers questions humans ask about images, such as the location or attributes of objects, which can be especially helpful for the blind (SeeingAI). The focus then turned to using text to generate an image.
The AI has to use its imagination
This proved a more challenging task than image captioning because the process requires the drawing bot to imagine details that are not contained in the caption. The machine learning algorithms running the AI are required to fill in the blanks – to imagine if you will - some missing parts of the images. To achieve this, Microsoft has got two computer models to bounce off each other and sift authentic from fake information. One computer model generates the image, based on learned linkages between descriptive terms and pictures. A parallel ‘discriminator’ checks how genuine the image looks. The back-and-forth between the models fine-tunes the look of the image.
While working pretty well when generating images from simple text descriptions, the quality wasn’t so hot with more complex inputs. To improve it, they applied an algorithm that breaks up the input text into individual words and then matches those words to specific regions of the image. In effect, the researchers say, this mathematically represents the human concept of attention.
Even more startling, the model learns ‘common sense’ from the training data and it pulls on this learned notion to fill in details of images that are left to the imagination.
As an example, Microsoft explains that, if given a task of drawing a bird, the bot will usually draw a bird sitting on a tree branch even if that's not explicit in the text because the images it was trained on often showed something similar.
As it stands, the technology is imperfect. Close examination of images “almost always” reveals flaws, such as birds with blue beaks instead of black and fruit stands with mutant bananas.
“These flaws are a clear indication that a computer, not a human, created the images,” Microsoft admits.
Nevertheless, according to results on an industry standard test reported in a research paper posted on arXiv.org, the quality of images produced by the AI are a nearly three-fold improvement over previous techniques for text-to-image generation, leading Microsoft to claim it as a milestone “on the road toward a generic, human-like intelligence that augments human capabilities.”
If AI and humans are to live in the same world, they have to have a way to interact with each other, and language and vision are the two most important shared of doing so.
Facebook & Google
Such research is not the preserve of Redmond, of course. Facebook is teaching its neural networks to automatically create images of things like aeroplanes, cars and animals, and alarmingly says that about 40 percent of the time, these images can fool us into believing we're looking at a real thing.
Google is doing something similar, by teaching machines to look for familiar patterns in a photo, enhancing those patterns, and then repeating the process with the same image. The result, reckons Wired, is a kind of machine-generated abstract art.
But what about the implications for photography or artistic creation? Text-to-image generation technology could find practical applications acting as a sort of sketch assistant to painters and interior designers, or as a tool for voice-activated photo refinement.
Can you imagine that? Requesting your digital assistant to ‘erase people in the background’ or ‘shade the car blue’ or even ‘add a blue car in front of the building’.
From the debate about whether Robert Capa’s 1930s photograph of a falling soldier was taken at the moment of death or was staged, to the record multi-million pound digitally doctored photograph Rhine II by Andreas Gursky, we are already in a world where the fake and the real not only blur, but we are getting past concern about it.
AI from Nvidia can compose music, IBM’s Watson can assemble video clips into a narrative, other algorithms are being trained to generate scripts for sale to Hollywood. There’s nothing to stop a fully automated CGI feature from being produced or for cinematographers to conjure a realistic canvas simply by talking into a microphone. Far from being concerned about artificial intrusion onto human artistic expression, it sounds like a lot of fun to be welcomed.