IBC
article here
The breakthrough in the making of Here was not the authenticity of a de-aged Tom Hanks, but that the face-swapping technique could be achieved live on set.
In Here, we follow the relationship of Richard and Margaret Young, played Tom Hanks and Robin Wright, as it evolves from their teens to their twilight years. The de-ageing and ageing process of the actors is achieved using the latest developments in generative AI technology.
It’s the largest-scale use of an emerging technology from London-based company Metaphysic that is able to digitally swap faces live.
For director Robert Zemeckis, the technology made it possible to tell a story that might have been too expensive using conventional VFX.
“With the overhead of a more traditional pipeline the cost to make the movie would be extremely challenging,” says Jo Plaete, Chief Innovation Officer, Metaphysic. “Also, the quality would be lower for the more expensive pipeline. You might end up in the Uncanny Valley, which would be a problem for this film.”
Screen tests
In summer 2022, Zemeckis arranged screen tests with Hanks as a shoot-out for whether the technology would work. Generative AI-powered tools were only starting to emerge and Metaphysic itself was a startup of 15 people. It had developed an AI facial replacement system which was used live during the final of America’s Got Talent to generate a deepfake of Elvis Presley and deepfakes of judges Heidi Klum and Sofia Vergara.
“The brief for the screen test was can you make Tom Hanks [aged 68] look like Tom Hanks as he looked in Big [a boyish 32]?” explains Plaete. “A couple of other AI companies and large VFX vendors were asked to do the same test but our result came out best and ultimately landed the movie.”
The test proved to Zemeckis and VFX Supervisor Kevin Baille (The Walk) that Hanks could be de-aged convincingly without the use of a body double.
“There was some exploration of that but ultimately the test was so successful that it gave Bob and Kevin the confidence that it can be done to the level they needed,” adds Plaete. “It was a really high bar because the story relies on the audience having a deep emotional connection to the performances.”
Zemeckis chose to follow the format of the graphic novel on which the story is based by staging scenes from a single point of view in a living room.
“The camera is observing,” says Plaete. “All of life happens in that room and there's not really anywhere [for the VFX] to hide. They wanted to see whether we could match the iconic likeness of Tom as people remember him. You need to hit it 100%. You can’t end up with an approximation.”
Additional tests were made in pre-production for how the actors’ make-up and hair at different ages would work with the AI model. Zemeckis preferred the term ‘digital makeup’ to describe the technology’s use.
Rights to double
The next step was to build models of Hanks and Wright from frames pulled from their movies plus interviews at various points in their careers. They weren’t able to obtain the rights to use every show they wanted but Plaete says this is nothing unusual.
“It’s a conversation we have early on in a project with all clients about what we can license. It is a time-consuming process because different parties own different [IP].”
Although Metaphysic has partnered with agency CAA to develop generative AI tools and services for talent, in this case the studio – Sony – worked to gain permissions for the actor’s image rights.
“We require those protections to be in place on every project that we tackle,” Plaete says. “Our clients have the consent over who we are synthesising and ultimately, as to what data we get. We also work with the studios to make sure that the correct licensing is in place for all data that we then feed into our neural models. At the end of the project that neural model doesn't go any further. It stays within the boundaries of that project.”
Data captured included facial movements, skin textures, and appearances under varied lighting conditions and camera angles. From that, they built several bespoke neural models for the actors’ appearance in their 20s, 40s, 50s and so on. Each model was the baseline for the next stage which was to hone the look and age range using what Plaete calls visual data science.
Visual data science
“You are exposing the data against the neural network for it to learn how to synthesise a face. That's a very iterative process in which our visual data scientists constantly assess what comes up in the dailies to ensure the actor’s performance comes across authentically.”
This VFX process is not nothing new in post-production but instead of working with 3D models and textures the AI artists are now working with a neural network.
As Plaete explains: “They will present different permutations of the network to produce the most faithful version of the actor’s performance. Our machine learning engineers are also very close to that process because they're looking at the architecture of the neural network to help tweak the technical layer and shape the right outcome.
“So that's kind of the loop. We build the model which is what you want to get right first. In post-production we train the neural networks against the photographic plates so that they learn the lighting, the scene context and the actor’s expressions that we have to swap in each moment.”
On top of the neural network, Metaphysic built a ‘neural performance tool set’ which allows them to edit or fix issues such as matching up eye lines or amplifying a certain expression.
“In the approvals process with Kevin and Bob, we showed them versions of our neural network generating the characters’ young faces and they would send notes asking, for example, for the eye line to flirt with the camera a little more or to adjust the make-up to hit the right look. That's where the neural performance process comes in. Our artists are able to nudge the neural network into the right place shot by shot.”
If this is not necessarily different to conventional pipelines, what is groundbreaking is that a real-time version of the system called Metaphysic Live allowed Zemeckis and the actors to see an AI-applied face-swapping of scenes while shooting.
Metaphysic Live
“The real-time output is actually extremely good and I would argue that it's better than what you’d get with a traditional rendering pipeline,” says Plaete. “The photorealism and the non-Uncanny Valley effect is present in the real-time system as well. I think that's why it was so successful. We just had to optimise these models to be more efficient on set.”
During photography, a lower resolution 1080p feed was sent from the camera to Metaphysic’s crew who were in a cabin just off set. The proxy was ingested to a server powered by a couple of GPUs.
“The thing sounds like a jet engine so it couldn't be anywhere near the set,” Plaete says. “We wanted to be on-prem so we didn't have any latency going to the cloud.
“The first thing the system has to do before you can even think about swapping and compositing is to run facial detection. You want to find Tom and Robin or other actors [Paul Bettany and Kelly Reilly] and there could potentially be more people in the scene, so you want to be sure that you're swapping on the right person. That's like a small computer vision problem on its own. Once identified, that asset was passed on to the real-time face swap mechanism to generate the face. The next step is real-time compositing. All these steps have to take milliseconds.”
On set, Zemeckis was able to view a raw feed with no AI on a monitor side by side with the facial swap version lagging just four frames behind.
“I’ve been on lots of virtual production sets where directors are offered views of how the scene would look from a 3D engine, and while they do look at it a bit, they then spend most of their time with the talent, so I was curious about which monitor Bob would look at. He was always tuned into the real-time feed. For instance, he could see how the de-aged faces played with the actor’s current-day body. The actors too could see how their performance triggered the younger equivalent.
Plaete, who has worked on the ABBA: Voyage FX concert experience; Ready Player One and Star Wars: The Last Jedi during his previous role at ILM says he didn’t anticipate quite how impactful the solution would be. “I come from a world where it takes six months to see a first grey scale face appear to now being able to iterate the image in the moment.”
It’s easy to see how the technology could replace performance capture systems which use markers and body suits to target animations.
“It replaced performance capture for this project, right? We use computer vision not markers for facial capture and the synthesis of a new face. The neural network replaces a lot of steps that used to be separate in the 3D pipeline.”
Facial replacement on Here weren’t entirely achieved by AI. Some hair, makeup and prosthetics were applied to Hanks in scenes of him ageing (past his current age of 68). Wright too (aged 58) wore some prosthetics over which Metaphysic layered its AI tech.
“When you have a young target you know exactly where you’re going [in terms of accurate representation] but when you use prosthetics to age-up you can only add [make-up to the face] when what you actually want to do is take away,” says Plaete. “Our skin and faces naturally erode as we age.”
To age-up Wright they made data shoots of two older women who were cast for their compatibility in terms of age and with Wright’s facial structure.
“By merging the oldest layer of Robin’s data, which then we synthetically aged up with different ML techniques, and then adding in real people at that age and training that network as a combination of these, we were able to achieve a very believable representation of older Robin. It looks more realistic than if it had been done with prosthetics alone.”
Wrinkles in the process
Metaphysic claims its real-time technology is able to perform techniques like facial structure adjustments that could previously only be achieved in post.
“If you’re filming the same actors in a de-aging process then, although they look older today, their facial anatomy is the same. The way they trigger their expressions remains the same. But when you have an acting double you have another set of challenges. It means you need to fix in place the structural elements of the face first. You also need to get your casting right. That’s why Bob and ourselves chose not to go down that route on Here.”
Using the technology on such a scale for an A-list feature sounds expensive but the production budget was $50m. Metaphysic says the studio would not have greenlit the project had the technique not been cost effective.
The tech was previously used in post-production on Furiosa: A Mad Max Saga and in Alien: Romulus on Ian Holm's android character. Metaphysic have also used it in a live performance by Eminem at the VMAs in which the rapper performed on stage with his younger alter ego Slim Shady.
“That was interesting because there were 30 to 40 people in the frame and our facial detection system almost ground to a halt with the facial detection. Every project has its own challenge,” says Plaete.
There’s a sense in which Metaphysic has cracked the hardest part of animating digital humans which is the human face itself. It will now apply the technique to full bodies and to more parts of the frame.
“We will take other visual effects processes and supercharge them with AI whilst keeping the control layers,” says Plaete. “That's the key here. Filmmakers need control and to craft what comes out of these networks. If you can achieve that, and Here is a great example where everyone is involved to lifts the quality collectively, then the technology has a lot of a lot of interesting applications to be explored.”