NAB
Transcription is
one process that stands to be uniquely impacted by recent developments in AI.
Thanks to ever-evolving language and learning models, transcribing audio to
text has never been faster or easier. But there are also limitations to new
AI-powered transcription solutions.
article here
The global translation service market will exceed $47 billion by 2031, largely driven by media and entertainment. Yet current costs to caption titles for distribution on streaming services ranges between $60-$100 per program hour, and typically takes between 1-3 days to complete “because of excessive manual intervention” claims Cineverse CTO Tony Huidor.
“Captions, and
localization more broadly, are generally major pain points for content owners
seeking to monetize their assets across the many streaming services,” Huidor
added.
That’s because
content companies need to generate far more revenue by broadening their
audiences at significantly reduced costs.
“Companies have
been priced out of bringing their entire content catalogs to market due to the
extremely high costs of captioning and localization,” Huidor said.
The traditional
transcription process involves an individual transcriber listening to a piece
of audio and manually converting every audio element they hear to text. It is
clearly very labor intensive using trained specialists, and costly.
But it does produce
accurate results.
AI transcription
eliminates the need for a human transcriber and relies instead on automatic
speech recognition technology. ASR uses language and learning models to
interpret human speech and convert specific sounds (or phonemes) as written
language.
Some of the most
popular speech-to-text software is provided by Google, Azure, IBM, and Dragon
Professional.
The upside of using
automated transcription is the ability for companies to scale more of their
output, to keep pace with huge global demand and to slash the costs of the
whole exercise.
The main downsides,
as outlined by Vitac, are inaccuracy. AI system tend to deliver poor
quality results when the input recording is poor, when there are more than one
speaker and when the audio contains a substantial amount of overlapping speech.
Other factors that can inhibit the AI’s ability are when speakers have diverse
accents or dialects.
“All these
variables can substantially impact AI’s ability to interpret and represent the
audio of a recording and result in a final transcript containing a substantial
number of errors,” Vitac says.
Its prescription to
achieve “exceptionally high rates of accuracy” is to match automation with
human experts. Not coincidentally this is exactly the service it offers.
Broadcasters and
publishers are a little reticent to rely on AI transcription given that tools
to date have not proved fool proof. The BBC, for instance, values the trust
that viewers put in the veracity of its output more than most broadcasters. It
also faces increasing pressure to cut costs. It is exploring and evaluating AI
tools which is a route that it advises others to follow.
Vanessa Lecomte,
localization operations manager at BBC Studios, telling language
information site Slator that for all the benefits that AI has in
localization, it “must match BBC’s quality standards at a minimum.”
She said, “The main
question is whether AI can improve current processes, increase speed to market,
and reduce costs.”
Lecomte advised
balancing opportunities against the risk. “These technologies offer the
potential to speed up the process, which in turn enables you to localize more
content, reach new markets, but it shouldn’t be done to the detriment of
quality or of a well-respected industry. So do the right thing and commit to a
thoughtful localization strategy.”
The BBC is also
addressing AI in dubbing using synthetic voices. Lecomte described the current
dubbing process as “time-consuming and expensive involving many technical and
creative talents.” She said her division is exploring the capabilities of AI
dubbing technology to try and deliver more content, faster, and still meet
quality standards, adding that this should be done acting responsibly in
regards to talent rights.
Anton Dvorkovich,
CEO & Founder of Dubformer, also flagged the industry responsibility
of establishing regulations around the ethical use of human voices.
He also believes AI
dubbing is “poised to dramatically transform the media industry…with solutions
that cut production costs by 30-50%.
“For now, investors
and the media are struggling with the challenge of evaluating new solutions.
However, the focus is shifting to the potential costs of emerging tools and
their impact on the media industry,” he wrote in an op-ed for Streaming
Media.
Solutions range from those like Papercup and Deepdub where humans finalize the AI-powered dubbing to “DIY translation tools” aimed at enabling freelance content creators to translate their videos with AI. One such solution, from Heygen, relies on natural-sounding speech synthesis and text-to-speech software developed by Eleven Labs.
He predicts that
the introduction of an “AI Dubbing Manager,” or proof listener, tasked with
fine-tuning AI dubbing systems or types of content. This role could include
listening to the automatic voice overs to grasp cultural nuances, refine voice
modulation, and make corrections. Some actors and interpreters may transition
into this profession as it evolves, he suggested.
There could be
Creative Directors for AI-enhanced productions to guide creative content
developed through AI dubbing while the market for actors to license their
AI-generated voices will grow. “More tools will enter the market, enabling
individuals to generate their voices with AI. Actors will be able to create new
voices based on their own.”
AI-Powered
Localization and Captioning Tools
Software
developer Enco introduced AITrack and ENCO-GPT, which both use
ChatGPT to generate language responses from text-based queries for automated TV
and radio production workflows.
AITrack, for
instance, integrates with Enco’s DAD radio automation system to generate and
insert voice tracks between songs. It leverages synthetic voice engines to
produce natural-sounding, engaging content between songs.
ENCO-GPT could be
used to condense a lengthy written news article into a few sentences, or inject
breaking news updates within live ad breaks or automatically creates ad copy on
behalf of sponsors.
Company president
Ken Frommert sees an opportunity to go bigger with both solutions. “We see
opportunities to convert a morning or afternoon drive radio show into a
short-form podcast, or summarize an 11:00 p.m. local news program for the TV
station’s website…. It offers a seamless way to publish content in diverse
forms.”
LEXI Recorded, a
VOD automated captioning solution from Australian firm AI Media, claims 98%
accuracy, “comparable to human captioning,” and even higher with the use of
custom dictionaries or topic models. Its use is priced from 20 cents per
minute.
“We are not just
meeting but exceeding the demands for high-volume, quick, and precise
captioning of recorded content,” said AI-Media’s Chief Product Officer, Bill
McLaughlin who will present the product at NAB Show in April.
Captions offers
an AI-based video editing app and a solution for automatically generating
subtitles. Both products are aimed at content creators and marketers.
It also offers an
in-house voice cloning tool trained on licensed audio recordings to translate
users’ audio into 28 other languages or use an AI voiceover to narrate the
content from scratch.
Gaurav Misra, CEO and cofounder says Captions’ approach to video editing software is different because its tools are designed for specifically editing talking videos. “Most video production editing is focused more on aesthetics like filters and colors, whereas our focus became more about conveying an idea or experience,” he told Rashi Shrivastava at Forbes.
Vitac’s claims its
own AI captioning solution, Verbit Captivate, stands apart from “generic”
ASR engines in being designed, developed and built, inhouse. “Whereas other AI
captioning vendors either provide an engine or a service, Vitac is unique in
that we own both. And because of that, we can change, update, upgrade, and
customize customer offerings, tuning our solutions to individual customer
needs, creating an offering that achieves accuracy and results on a personal
level.”
Additionally, it
pairs the tech with “human backup” — specialists who boost performance with
prep, pre- and post-session research, and live-session monitoring.
Cineverse’s MatchCaption,
targets bulk film, television and video libraries localization “at significant
scale.” It claims its generated captions are “perfectly timed and formatted
according to industry standards, then auto converted into multiple
caption/subtitle formats, to meet the specifications of all streaming
platforms.
It also claims its
system can complete the same tasks which currently cost content owners $60-100
for less than $10 per program hour, “and a full feature film can be completed,
and quality checked in less than one hour — an 85% reduction in cost and 90% reduction
in time.
No comments:
Post a Comment