Sunday, 9 May 2021

Time to check on the machines

TV Technology 

https://issuu.com/futurepublishing/docs/tvt461.digital_may_2021/14

AI/ML tools are improving and increasing in use but are not yet, and perhaps will never be, the magic bullet for all media cases.  

 

M&E organizations are seeking automation to drive efficiencies with Artificial Intelligence (AI) and Machine Learning (ML) the technology keys. In a post pandemic world where remote and distributed work model is the new norm, AI engines could come into their own. 

“We’re at the beginning of a golden age of AI and ML,” says Hiren Hindocha, co-founder, president, & CEO at Digital Nirvana. “The use in media is tremendous. It makes content searchable and translatable into multiple language allowing content to be consumed by users anywhere.” 

Others strike a note of caution. “AI/ML is working and has not led to the mass layoffs that some feared,” says Tom Sahara, former VP of Operations and Technology for Turner Sports. “But nor has it reduced budgets by a huge percentage.” 

Julian Fernandez-Campon, CTO, Tedial agrees, “AI/ML has become a 'must-have' feature across all technologies but we have to be quite cautious about the practical application of them. They've proved good results in some scenarios but in reality, broadcasters are not getting big benefits right now. Being able to test and select the proper AI/ML tool quickly and cost-effectively will definitely help adoption.” 

Roy Burns, VP of Media Solutions at LA-based systems integrator Integrated Media Technologies says customers are confused about what AI/ML is or what they want to do with it; “We have to explain, it’s not a magic bullet.” 

Speech to text 

The key benefit and one ready for general operation today is captioning. Speech-to-text quality is rapidly improving, leading to the possibility of fully automatic, low-latency creation of subtitles. Dalet reports time savings of up to 80% in delivering subtitled content for news and digital publishing. Additional ML capabilities allow systems to properly segment and layout captions to increase readability and compliance with subtitling guidelines.  

“Speech-to-text translation is probably most mature where there’s about a 90 percent confidence rate,” says Burns. “For some people that’s good enough but for others making an embarrassing mistake is still too risky. 

“It’s important to understand that the output of object or facial recognition tools are not human readable,” he adds. “They are designed to give a ton of metadata about the asset but to correlate it against your media you need a MAM. That’s what I try to explain. If you can ingest AI outputs to a MAM and correlate against a central database of record that is what is going to make it searchable.” 

Hearst Television has adapted its MAM with Prime Focus for automation of commercials file management. Joe Addalia, Heart’s Director of Technology Projects, explains, “We have hundreds of commercials coming into our systems every day. If we have to mark in and out each time and do a quality check we are not being efficient. Instead, the MAM automates this, harvests necessary metadata and supplies it downstream to playout. There’s no reason for an operator to go into the item.” 

Addalia emphasizes the importance of metadata. “You can have the most glorious 4K image you want but if you can’t find it you may as well not have it. AI/ML is about being able to find what you need as you need it.” 

Hearst’s internal description for this is ‘enabled media’. Metadata-enriched video, audio or stills content that Addalia says will advance the possibilities for new workflow and products. 

“On top of speech-to-text, automatic machine translation is nearing maturity to enable multi-lingual captioning scenarios,” says Michael Elhadad, Director R&D, Dalet. “The main obstacle is that standard machine translation models are not fully ready to translate captions out-of-the-box. Automatic machine translation is trained on fluent text, and when translating each segment into individual subtitles, the text is not fluent enough; standard models fail to produce adequate text.” 

The alternative method, which consists of merging together caption segments into longer chunks, leads to another challenge: how to segment back the translated chunk into aligned and well-timed segments? 

“A specialized ML method must be developed to address this challenge and produce high-quality translated captions,” says Elhadad. “This remains an open challenge for the industry and something that we’re working on at our research lab.” 

 

Current uses of AI/ML in media 

Additionally, object and face recognition logged as metadata can assist scripting and video editing, specifically with a rough cut being automatically created based on given metadata fields, for example, creating highlights from keywords, objects, text-on-screen. 

Tedial has been using AI/ML tools for some customers in the automatic logging of legacy archives, to identify celebrities and OCR using AI tools. 

A big pain point is text collusion. This is when on-screen text (perhaps indicating a place or date) overlaps with caption files. Files presented with this error will immediately fail the QC of all the big streamers but detecting issues manually in every version permutation is not cost effective. 

OwnZones offer a deep analysis platform to scan content and compare it against other media items like time text.   “The AI analysis tool can find the location of onscreen graphics using OCR (Optical Character Recognition) and with information from time text is able to detect a collision,” explains director of product, Peter Artmont. “A failure report is automatically sent back to whoever did the localisation work to fix captioning before sending on to OTT services.” 

Typically, it would take an hour or two to manually QC and flag issues per hour of content. OwnZone claim its AI does this in 15 minutes. 

Another common use case is scene detection for censorship. Artmont describes a scenario in which an episodic drama containing occasional swear words requires closed captions to be blurred out for transmission during daytime hours. 

“Typically, you have to store all the versions you create eating up storage on-prem,” he says. “In our example, you are storing 300-400GB per show yet the only difference between each version is to 3-4MB of frames. We apply our AI analysis to generate a (IMF compliant) Composition Playlist from the content. By storing only the differences (scenes with bad words) from the original we can trim content by 46%, making storage far more efficient.” 

Digital Nirvana were set a business problem by an unnamed distributor of entertainment news. The client requirement was to take 20 hours of content boiled down to a 20 minutes show with a turnaround time of two hours. 

“Using speech-to-text we created accurate transcripts for editors to find content of interest then we generated closed captions in both English and Spanish,” explains Hindocha. “Using computer vision we’re able to identity facial recognition, objects and logos in a video stream. The use cases are huge.” 

For example, you could locate the number of times a sponsor’s logo is shown during a live sports broadcast, which is a contractual requirement of many broadcast deals. A similar technique could rapidly identify billboards around the perimeter of sports stadia. 

“Natural language processing has also advanced to a high degree of accuracy,” Hindocha says. Netflix has shown us people want content from all over the world the ability for technology to be viewed by anyone is tremendous.” 

 

Current use of AI/ML in sports 

Sports is a greenfield where the use of AI can leverage the content production to generate more tailored content for a specific audience. “With the reduction in live events, the ability to monetize valuable content from years of archive is key,” says Elhadad.  

There are two principal applications: Indexing (tagging, transcription, classification and object/face recognition) of vast archives; and automatic highlight creation, event-driven automatic overlays and titling.  

Aviv Arnon, Co-Founder at WSC Sports, says its technology is taking ‘dirty’ production streams of a sport into its cloud-based platform where an AI/ML system breaks the game into hundreds of individual plays. The system applies a variety of algorithms to ID each clip, make it searchable and to clip each with optimal start and end times. It pulls the results into video packages for publishing. 

“Our ML modules understand the particular patterns for how basketball is produced including replays, camera movement, scene changes. We have all those indicators mapped to automatically produce an entire game.” 

He says sports leagues need to scale their content operations by packaging different clips to social media and other websites and that AI is the only way to do so rapidly. 

“I can’t say it’s 100% accurate but 99% is too low. A better question might be, if I had a manual editor would they have clipped it a second shorter or longer? It’s not about the veracity of the content. There’s no doubt it has helped speed up the process by allowing an operator to handle a lot of content in a short amount of time.” 

Elhadad explains that Indexing is collected frame by frame (e.g., a frame contains the face or the jersey of a known player), but search results should be presented as clips (coherent segments where clues collected from subsequent frames are aggregated into meaningful classification).  

“While descriptive standards exist to capture the nature of such aggregation (MPEG-7), the industry has not yet produced methods to predict such aggregation in an effective manner.” 

SMPTE developing AI/ML best practices 

Work is under way in this area. SMPTE is working with the ETC@USC’s AI & Neuroscience in Media Project to help the media community better understand the scope of the AI technology. 

“AI is promising but it’s an amorphous set of technologies,” says project director Yves Bergquist noting that a quarter of organizations report over 50% failure rate in their AI initiatives. Consultancy Deloitte also found that half of media organizations report major shortages of AI talent.  

“There are a lot of challenges around data quality, formats, privacy, ontologies, and how to deploy AI/ML models in enterprise,” Bergquist says. “AI/ML is experimental and expensive and there are duplications across the industry. We think there are strong opportunities for interoperability throughout the media industry. Not everything has to be in the form of standards. We also want to share best practices.”  

The task force is made up of about 40 members including Sony Pictures, WarnerMedia and Adobe. Research is focused around four topics: Data and Ontologies, AI Ethics, Platform Performance & Interoperability, and Organizational and Cultural Integration. 

“That last topic is the most important and least understood challenge,” he says. “It speaks to the ability of an organization to create an awareness and understanding of what AI is and is not, how to interpret its output, how to understand the data.”  

The task force is “expecting to make a substantial contribution in the area of suggesting best practices in terms of creating a culture of analytics.” 

Cross-industry collaboration  

Addalia agrees that no single tech provider can do it all. “TV is a cottage industry. We have to use our collective resources properly so we can leapfrog into next-gen where we have automated workflows. This requires cross-industry collaboration. Vendors must work together.” 

There’s also an onus on end users to provide feedback on tools and system. “They need to define the use case and the desired results. A ‘build it and they will come’ approach does not fit.” 

Muralidhar Sridhar, VP, AI/ML Products at Prime Focus says M&E needs specific treatment and that ready-made AI engines have not transformed the industry. 

Human-like comparison of video masters, transcription for subtitling and captions, AI-based QC, re-conformance of a source from a master, automatic retiming of pre- and post-edit masters are all important uses cases, he says, “provided the AI can be made to understand the nuances. 

“The problem is that while most AI engines try to augment content analysis the ability to accurately address nuances are not easy to match. Marking segments like cards, slates and montages takes time and cost. None of the off-the-shelf engines serve the industry. It needs 100% accuracy and 100% frame accuracy. You cannot let one frame slip out with the content.”  

Prime Focus’ prescription for clients like Star India, Hotstar and Hearst Television is to combine computer vision techniques with neural networks where necessary then customize the solution for the customer. 

Developments ahead  

Higher accuracy and speed across all AI/ML capabilities continues to advance.  

Fernandez-Campon predicts use for Intelligent BPM, where AI tools improve workflows by taking automatic decisions and focusing the work of operators on creative tasks. “That's why it's key to offer flexible and cost-effective AI/ML tools that can easily integrate into workflows and be swapped one for another.” 

Dalet are working on a new approach to indexing faces and creating ‘private knowledge graphs’ it thinks will speed up the process of cataloguing large content archives and building libraries of local personalities for regional needs. Current technologies are not able to easily recognise faces throughout a historical archive (i.e. the same person from 20 years ago), Dalet say.  

“Better context-aware speech transcription, more accurate tagging and smoother automatic editing will further reduce the need for manual, repetitive tasks,” says Elhadad. “AI/ML empower content producers and distributors to focus on higher value creative work. AI/ML is becoming pervasive: another tool in the box which users will find increasingly beneficial without the doom-laden ‘man vs. machine’ context.” 

Sidebar: Digital us 

We’re no stranger to AI-driven digital humans. From General Tarkin in Star Wars to Rachel in Blade Runner 2049, Hollywood is able to capture an actor on a stage with multiple cameras and build a 3D model using their performance to drive the animation. 

“This is a very expensive process affordable only by a few and not scalable to billions of people,” says Shunsuke Saito, PhD in Computer Science at University of Southern California. “On other hand we all have access to high resolution cameras in our pocket. What if we use those inputs to create high quality digital humans?” 

Saito and colleagues at Berkeley and Waseda Universities devised a way to create the real-time volumetric capture and display of people. Using just a single view webcam or smartphone video, the system isolates the person, samples their texture and using ML to produces a 3D avatar. This digital version can be viewed from any angle, even from the back which was never seen by the program and it does so in virtually realtime. The project Pixel-aligned Implicit Function (PIFu) won the ‘Best in Show’ Jury Award at SIGGRAPH 2020. One application is to teleport these digital avatars into video conference sessions. 

No comments:

Post a Comment