I auditioned AI voices for TV continuity work and it went OK.

I previously wrote about using an AI to write television continuity scripts. Long story short, it turns out Chat GPT is pretty good at the task. The next skill that AI must conquer if it’s going to cut the legwork and cost in a continuity workflow, is voiceover. This post is about what happens when you take a continuity script written by an AI, and feed it into a speech synthesis engine.

I used the Eleven Labs service for this experiment. It’s super cheap and easy to operate. It comes with a big range of built-in voices, and it can also analyse uploaded audio files to make a synthetic version of any voice. That’s an obvious capability to explore for a channel wanting automation, and that already has its voice talent. If you’re a voice talent, this is controversial and worrying. But it depends on the deal – if you licence rather than sell your voice then maybe you can earn money without having to attend VO sessions.

Let’s move on to the experiment. I chose three AI generated scripts from the previous work. One each in the Sport, Kids’ and Drama channel genres. I fed them unaltered into the Eleven Labs UI.

SPORT

Here is the script I used for Sport. It includes a name that may cause pronunciation difficulties for anyone not familiar with the language:

Get ready for redemption! UFC Fight Night presents Ankalaev vs. Walker, a light heavyweight rematch in Las Vegas. Last time, controversy ruled. Foul play accusations linger. Sparks will fly. Next, here on Example.

Eleven Labs tags its voices to give an idea of the tone and potential use-cases. For Sport I looked for tags that seemed to fit the bill: videogames, over-hyped, seductive. I started with Callum:

Callum’s first read was a decent stab, hardly mechanical at all and he got the name right. It didn’t have the shouty energy I would expect from a UFC trailer but it got the message across. That being said, If I were the Producer I would have binned this clip although it’s surprisingly close.

Callum (Hoarse / videogames) – default settings

For his second read I adjusted the “Stability” parameter. This increases the level of expressiveness. It gave the voice a decent energy, to make it feel like a growly channel voiceover. It may not necessarily match the creative requirements of this or that Producer, but it is a technically usable voiceover.

Callum (Hoarse / videogames) – stability 30%

Callum showed us straight away that AI scripts and synthetic voices can land a piece of creative work pretty close to the target of usability. Perhaps Charlotte with her seductive voice and love of videogames can get closer to the bullseye:

Charlotte was a disappointing booking. Her energy was very low, and even turning the Stability turned down to zero for maximum expressiveness had no effect.

Charlotte (Seductive / videogames) – default settings
Charlotte (Seductive / videogames) – stability 30%
Charlotte (Seductive / videogames) – stability 0% (unstable)

No need to dwell on Charlotte’s shortcomings, time to move on. Khemet was warming up his deep voice in the canteen and his read was really good:

Khemet (Deep / characters-animation) – default settings
Khemet (Deep / characters-animation) – stability 30%

I would totally watch that show and his 30% read was beautiful. Next up into the booth was over-hyped gamer Freya to try and top Khemet’s talent:

Sadly, with the Stability at its default value of 50% Freya was a bit mechanical and not really usable.

Freya (Over-hyped / videogames) – default settings

However, with the Stability down at 30% Freya came to life. Her final read had a nice energy and was really usable.

Freya (Over-hyped / videogames) – stability 30%

Again, I’m counting it as a success if criticism of a voice clip can be couched in human terms, because if that just means find another synthetic voice or roll your own.

From the above I learned that tweaking the settings can bring certain voices up to a usable standard, and that some voices are inherently not suitable for continuity use.

KIDS

For the Kids’ genre I used the script where the AI “invented” the word “eggstraordinary”:

Get ready for a cosmic clash in “Space Chickens In Space.” Chickens, school plays, and intergalactic wars—seriously. It’s eggstraordinary chaos. Next, here on Example.

Again, looking for relevant tags I decided that a mention of animation would be worth a go, as well as anything suggesting high energy. First into the VO booth was Anika:

Anika (Excited / entertainment/tv) – default settings
Anika (Excited / entertainment/tv) – stability 30%

I found Anika little mechanical, and her energy was all about talking to adults rather than kids. As in the Sport reads, adjusting the Stability made a decent improvement but not enough to get Anika called back for more. I liked the way Melissa was described as “intense”:

This image has an empty alt attribute; its file name is Melissa.png
Melissa (Intense / characters-animation) – default settings
Melissa (Intense / characters-animation) – stability 30%

But she was far too flat and matter-of-fact. Even with the expressiveness cranked up she sounded like a bad actor trying to read a script. Neither of her reads was usable. She joined Anika at the bus-stop. Then in walked Gigi:

This image has an empty alt attribute; its file name is Gigi.png
Gigi (Childish / animation) – default settings
Gigi (Childish / animation) – stability 30%

With her stability adjusted to 30% Gigi nailed her audition. She was not mechanical, she had the right energy. Finally I wanted to try some male voices for this gig, so I looked at Christopher and Jasper. Only one read each here, because it had become obvious that turning up the expressiveness is always required:

Christopher (Casual / characters-animation) – stability 30%
Jasper (Intense / characters-animation) – stability 30%

Neither of these could match the excellence of Gigi although both of them did seem usable for other channels beyond the Kids’ genre.

DRAMA

I chose the most promising AI script for the Drama genre:

Join the suspense as Sherlock unravels a murder tied to a billionaire CEO. Family bonds strained, secrets exposed. “Elementary.” Stay tuned for more intrigue. Next, here on Example.

I started out with the Stability set to 30% (high expressiveness) as a default from the experience on Sport and Kids. Since Christopher was already in the green room from the Kids’ auditions, I started with him:

Christopher (Casual / characters-animation) – stability 30%

Christopher more or less nailed it, perhaps the script was letting him down and making his delivery a little mechanical. Gentle-voiced Kellan was next to try out:

Kellan (Gentle / characters-animation) – default settings

Kellan had a nice clean delivery, although he sounded uneasy delivering lines for TV continuity. He’s maybe better at audiobooks. Next in line was Henry with his pedigree in entertainment:

Henry (Modulated / entertainment tv) – stability 30%

Henry is all about daytime, with a nice bright read and again the script being perhaps the weak link. The pacing could do with some improvement but this is a usable clip. Amelia was next, and I turned the expressiveness back to default (50%). This is drama after all, no need to rush it:

Amelia (Formal / narrative-story) – default settings

Her read was usable, with perhaps a little difficulty with the pacing. That could be the script as with the other auditions. Finally I saw Hemaka:

Hemaka (Deep) – default settings

As with the other readers, Hemaka produced a clear, usable output with only the pacing very slightly off. If this was a VO session I would tweak the script to make it easier to get right. However, since the script comes from an AI, and the goal is to make this a lightly monitored operation, making script changes in a sustainable way cannot involve regular manual editing. Somehow we need the script AI from one vendor to be aware of how a voice AI from another vendor will convert the text to sounds. At the moment that seems like quite a stretch, and in truth it’s probably better solved by creating purpose-built continuity voices inside Eleven Labs.

SUMMARY

For me, this experiment has shown that it is possible to make an AI write a script, and then give that script unaltered to a synthetic voice and have it produce audio that can be used on broadcast TV.

Sports and Kids appeared to be more amenable genres than Drama, suggesting that the AI does not handle vocal subtlety well. The script – voice – script feedback loop familiar to anyone who has been in a VO session does not realistically exist in the AI world. Producers will need to check scripts are structured to suit a synthetic voice. The way that works is not to manually alter each one (and thereby bring back the feedback loop). Rather, it’s a tweak to the script prompt, and possibly a voice modelling session with human talent, to get an accurate simulation of the cadence and tone of real continuity.

If you run a channel, you might look at all this as an enhancement to your spoken continuity workflows. Perhaps you can redeploy creative resources to other targets? You might be wondering if you can add an automated voice to an otherwise dry FAST service. The answer to both those questions is “yes” and you can contact me direct if you want to find out about the toolset I have built which can automate this via a fleet of After Effects robots.

If you are a continuity producer or a voice talent, you may have different, less positive thoughts. I would say to you that AI is always a co-pilot not an auto-pilot. Someone has to set the parameters, and keep an ear on the output. These voices don’t come from nowhere; they are simulations of real voices. There is creative human talent behind all this.