Problem Definition: Surfacing the Hidden Data in Audio for Advertising
Podcasting and audio media have taken the stage as the newest and fastest-growing media frontier that’s ripe for advertising investment.
For over a decade, the industry’s growth was held back by a lack of audio advertising tools. The standard approach for advertisers to find audio for investment was having employees manually review content. This was limited by human bias and scale. It’s impossible for a group of people to comb through thousands of episodes in a catalog for accurate contextual and safety insights, nonetheless, keep up 107,000+ shows that released an episode in the last 3 days.
Additionally, human review of a podcast episode for brand safety and suitability happens once and then the annotator must move on to the next piece of content to classify. However, society’s perception of a topic as risky or acceptable varies widely over time. As both industry safety standards and mainstream perceptions of a given topic change—who will maintain the accuracy of episode safety ratings in audio catalogs?
Content risk is a living definition that evolves as people do, and having an automated solution to accurately reflect society’s best assessment of content, based on collectively agreed-upon standards, is essential to supporting any expansive media and advertising industry.
One type of solution that seeks to address the challenges of brand safety, suitability, and contextual targeting at scale is keyword targeting.
On the beneficial side, keyword-driven solutions can be helpful in addressing the scale and speed challenges facing brand safety automation in audio. Once you automate podcast transcription, you can quickly filter those transcripts for keywords to tell an advertiser, “This content isn’t safe for your brand, because there are violent words in it.” Or, “This content is contextually aligned to your business because it’s about shoes and you’re a fashion brand.”
Unfortunately, keyword solutions are vulnerable to contextual inaccuracies that are inherent to their model design and can amplify the mistakes that we observed in manual reviews.
Here are three examples to illustrate the negative impact of keyword-driven brand safety and contextual targeting solutions in audio:
- Keywords like “shoot” can be high risk in breaking news content (e.g. “mass shooting”) but completely brand safe in sports content (e.g. “shooting a ball”).
- Keywords like “hood” or “pot” are discriminatory and block brands from more fashion and cooking inventory than anything else.
- A host in a comedy show says the F-word when joking about his day, and a keyword-driven solution within GARM’s industry standards will flag this as both “Obscenity,” which it is, but also as “Adult Explicit & Sexual Content”, which it’s not.
The results of keyword solution errors like these when processed across 100,000s of episodes are that:
- Tons of safe and no-risk content in advertising inventory gets removed from the marketplace.
- Diverse creators are unfairly penalized by advertisers using keyword solutions and discriminatory blocklists for content filtering, instead of contextual analysis and risk levels.
- Brands are unable to accurately find content within their ranges of comfort and suitability (e.g. “I’m fine with cursing but not with sexual references”).
So, how can one design a model that’s capable of translating the complexity of human language into contextual and safety insights that meet an advertiser’s needs?
This brings us to the four tenets of effective advertising technology for processing human speech. If a business chooses a solution that doesn’t achieve all four, creators and advertisers will both pay the price.
- Accuracy – This tenet describes the correct classification of content within standardized definitions of risk and context across diverse perspectives. How much of a person’s ability to understand human speech goes beyond keywords? Tone, intention, sentiment, and topic — these factors are incredibly important to brand suitability and contextual targeting accuracy.
- Relevancy – Once you’ve accurately classified the safety and contextual information in audio content, which are truly the most relevant segments to achieve a brand’s goals? For example, a health-driven dog food business would find a more optimal (i.e. higher ROI) placement in a podcast guest feature with a top dog trainer than a fiction podcast where a pet dog is a recurring side character.
- Speed – The solution needs to be able to process audio at top speed to keep up with the exponential rate at which new audio content is being uploaded online. It needs to quickly turn new audio content into insights, while also keeping an eye on the pulse of trends and sentiments toward various social topics.
- Scale – Connected to all of these points, the solution must sustain high performance at scale. An agency should be able to look out into the ocean of audio content and quickly filter down to the exact shows, episodes, and segments that will best serve its advertising goals.
These were the tenets and challenges in mind when we designed our solution at Sounder.
Our team of AI/ML engineers, alongside our leadership team of audio and ad tech veterans, started with and continues to hold one goal:
Create audio-first technology that supports a thriving audio ecosystem and empowers all creators, publishers, and advertisers while preserving the listener’s experience.
These four tenets served as the main product requirements when designing the AI/ML models that would lay the foundation for all of Sounder’s audio data solutions— brand safety and suitability, contextual targeting, podcast market intelligence, and more.
Last summer, we proudly launched our first solution for automated brand safety and contextual targeting in audio.
Today, we’re sharing an overview of the model design underpinning this solution, and how it addresses all the challenges that we described above from traditional solutions.
The Solution: Our Model, Methodology, and Lessons Learned
Data is one of the most crucial elements for building a strong AI model. Our approach to creating an AI-based solution for brand safety and suitability started off with placing audio data, specifically podcast data, at the center of our model’s training, testing, and evaluation processes. Through our platform and our Podnods toolset, we have the capability to sweep through millions of podcasts and through tens of millions of episodes to sample for model training.
While some solutions are specialized in surfacing insights from visuals or text, our solution is specialized in handling spoken word audio files—whether that be in podcasts, videos, or otherwise.
In the following paragraphs, I’ll walk through:
- How we train and test our models on the right data to surface advanced advertising insights
- How we evaluate model performance
- Lessons from past model iterations and experiments to support future developers
How we gather the right data for training and testing models
When designing AI solutions, the most difficult and important piece of the work is creating the right dataset. The correct dataset can support multiple needs and points of weakness in a model’s performance and enables you to train and test different models more effectively.
At Sounder, we’ve evolved to use a mix of automation and manual effort when creating labeling data. We apply industry-standard media classifications from groups like the Global Alliance for Responsible Media (GARM) and Interactive Ad Bureau (IAB) to guide our data collection of safe and ‘unsafe’/’unsuitable’ samples from podcasting content. (Access our whitepaper applying GARM’s classifications to audio here.)
To meet our data collection requirements, we also built an in-house semantic search capability that indexes transcripts across millions of episodes.
Our in-house experts on ad industry standards are involved in every step of modeling—collaborating with our annotation team and AI engineers as we review the data for training, testing, and evaluation. Having a deep knowledge of podcasting and ad technology from domain experts helps us immensely when developing datasets and planning for model improvements.
Our partners of audio publishers and advertisers (e.g. iHeartMedia and SpokenLayer) also provide essential feedback as they implement and evaluate our tools across their audio catalog and advertising sales.
How we evaluate model performance
The first step was prioritization and defining key metrics.
We started with prioritizing accuracy and relevancy above the other tenets and building a framework for measuring success in these tiers. We experimented with a variety of evaluation datasets and techniques, while simultaneously building a large pipeline for gathering domain-specific training data.
For example, the first manually labeled dataset we received was generated in June 2021 with the help of an external party with years of experience in brand safety annotation. From there, we greatly expanded our evaluation dataset to include a balanced coverage of samples reflecting accuracy against different industry standards and their sub-classes.
We currently use several top and supporting metrics to evaluate the effectiveness of our models’ brand safety, suitability, and contextual labels.
How we experimented with different AI solutions
If we go back to the four tenets listed above, the next two tenets we targeted were scale and speed. Early on in development, we discussed speed requirements with our creators and publishers. We put a stake on the ground to provide a high-performing solution that could accurately assign brand safety labels for one hour of audio under two minutes (including transcription time). We designed prototypes with several AI models that are cost-effective and fast. In the early stages, our models focused on solutions that mostly stemmed from multi-class and multi-label classification.
As part of the evolution of our modeling, we started experimenting with new techniques that advanced performance at shorter segment levels after our clients requested more granular insights into safety and context. Similar to the concept of heat-maps, we developed the capability to trace episode and show-level evaluations back to the exact segments that culminated in those high-level results.
Lastly, we also needed to refine our model to improve performance in iterations based on feedback from the annotators and evaluation tests. Today, we are using an ensemble of transformers and other classifiers for our GARM brand safety and suitability solution.
Together with how we collected data, evaluated our models, and created an infrastructure that can dynamically scale within minutes, we’ve built an at-scale AI solution that can handle tens of thousands of episodes within hours.
The Results: Our Model in Action
We’ve upheld our speed and scale thresholds during the last several months that we have been live with our clients. We processed over 900,0000 episodes while staying under two minutes of processing time per hour of content. As an example, our model currently achieves an 85% F1 score on short content (that is 15 minutes or less in duration) which surpasses the typical accuracy of human moderators (~80%) when labelling audio for brand safety and suitability across categories.
Since brand safety and suitability was the first audio data solution we launched this summer, let’s review examples of the model in action for this use case.
Examples of Model Output Applied to Brand Safety and Suitability in Podcasts
#1 Preventing Over-Indexing on Safety and Unnecessary Loss of Ad Inventory
In this example, we look at an episode of a top-ranking podcast that’s also highly popular with advertisers—The Tim Ferriss Show, specifically, “Episode – #597: Morgan Fallon — 10 Years on the Road with Anthony Bourdain, 9 Emmy Nominations, Lessons from Michael Mann, Adventures with Steven Rinella, High Standards, Wisdom from West Virginia, and More”.
Here are our model’s evaluation results within GARM’s industry standards for brand safety and suitability:
As described in the accuracy tenet, our model establishes contextual understanding by looking at topics, entities, tone, intention, sentiment, and more in audio segments. Any model for human speech will need to review words, but keywords are not the primary driver in our model and this example will demonstrate why that’s necessary.
Looking at the keywords in this episode—we see a high density of one of the most blocked brand safety keywords in the market, “shoot(ing).” There are also numerous depictions of guns. Is this episode high-risk?
A keyword-driven solution would typically grade this episode not only as high risk in GARM’s “Arms & Ammunition” category but entirely unsafe and to be blocked from ad inventory/monetization for the heavy use of the “risky” word “shoot”. However, this is a problem because a contextual analysis tells a more meaningful and different story.
One essential detail about this episode—Morgan Fallon is a filmmaker and much of the episode is discussing his work and history with film. The word shooting was used to describe “shooting a film,” which is unrelated to weapons and makes this completely safe inventory for media buyers.
Reviewing our model’s topical analysis of the episode, it grasps this contextual insight.
Keyword solutions are, by design, unable to handle language complexity and contextual overlap in human speech. They can’t reliably interpret what topics are being discussed, and how those topics are being discussed (i.e. speaker’s intention, tone, sentiment towards the topic, etc.). Our audio intelligence solutions can.
When applied to content at scale, both creators and advertisers lose a significant opportunity due to the large mistakes keyword-driven solutions make in brand safety, suitability, and contextual analyses.
This is precisely why accuracy and relevancy are the top two tenets of our AI/ML model design.
Before moving on, let’s quickly review our model’s suitability assessment of the episode:
- Arms & Ammunition – Low Risk: The result stems from a discussion of guns and muzzle breaks with educational intent, making it safer than other types of discussions around weapons.
- Illegal Drugs/Tobacco/Vaping/Alcohol – Low Risk: There is a very brief story that references binge drinking tequila, and then getting sober.
- Death, Injury & Military – Moderate Risk: There are some references to and descriptions of an individual’s death.
- Obscenity – Moderate Risk: Obscenities are used without the intention to shock the audience but rather as a general exclamation.
All in all, Tim Ferriss’ episode content is highly appealing to listeners and advertisers. Many advertisers may intentionally seek a bit of edge (i.e. moderate risk) to align with their brand personality and audience outreach.
As an example of the natural complexity of language, this episode demonstrates how a large majority of audio content will present similar challenges to adtech surfacing insights from the spoken word.
Any solution removing this content from inventory, or the 100,000s of episodes similar in risk factors, is ineffective and restricts the growth of creators, brands, and the audio industry.
#2 Accurately Defining Risk Levels in Challenging Topics to Protect Diversity in Inventory
Another major challenge in brand safety and suitability is the history of discriminatory keywords and topics blocking diverse creators from ad inventory and monetization opportunities. When applied at scale in large blocklists by advertisers, this results in the widespread demonetization of minority voices.
On top of that, these keywords also remove large swathes of content that are completely unrelated to contentious social topics. As mentioned earlier, blocklists can often grow to 8,000+ discriminatory keywords like “pot” and “hood” which remove safe culinary and fashion content more than anything related to the unfairly targeted groups.
It’s another reason keyword-driven solutions are a substantial threat to the greater well-being of the audio ecosystem. By having clear definitions of safety outside of keywords, we can protect diverse voices in ad inventory so that brands can invest in them, take a stand with their values, and speak to their communities.
It’s also in a brand’s best interest to connect with their key audiences around shared values.
Ara Kurnit, VP and Managing Director of Strategy for Advertising at the New York Times states, “In the current environment brands can’t afford to play it safe. 62% of global consumers want companies to take a stand on issues they’re passionate about, and 64% see brands that actively communicate their purpose as more attractive.”
That said, let’s look at an example of our model reviewing audio content that addresses a major social issue and also a historically difficult topic for brand safety and advertisers—racial inequality.
In this example, we look at an episode by Dr. Lauren Streicher, specifically “Episode 26: The Impact of Systemic Racism in Mid-life Women’s Health.”
Here are our model’s evaluation results:
Once again, our model demonstrates intelligent sensitivity to navigating brand safety and suitability definitions as it evaluates what is discussed in an episode and how it’s discussed.
Keyword solutions, or even manual review depending on the bias of the reviewer, would flag this content as higher risk from the discussion of racial issues alone.
Here are a few examples of commonly flagged keywords that arise in this episode.
Our model successfully interprets that this is a professional content piece with the intention to educate listeners, and so it grades the content as low risk for debated social topics.
Indeed, from the model’s topical analysis we can see that it correctly recognizes that the focus of the content is educating listeners about science and medicine while elaborating upon a nested sub-topic of racial challenges within cultural institutions.
Here’s a quick breakdown of its brand safety and suitability results:
- Debated Sensitive Social Issues – Low Risk – This medical-focused, researched based episode surfaces discussions of racial disparities and institutional racism in the field of medicine.
- Hate Speech & Acts of Aggression – Low Risk – This medical-focused, researched based episode surfaces discussions of racial microaggressions and short clips.
This example is representative of many audio segments created by the 43% of podcasters from diverse backgrounds often speaking to and educating their audience about social challenges faced by minority groups. It’s meaningful and quality content for the right brand to invest in over shared values.
Closing Thoughts
As described earlier, the most important aspects of solution design are the training dataset and working with experts to help refine a model’s development. We see complex challenges like these all the time when processing libraries of public and client content, and we’re proud to see our model perform with the level of precision and sensitivity that it does today. Naturally, we’ll continue to develop our solutions to deliver ever-better results.
By working on ad tech and well-defined industry standards, we can create a media ecosystem that delivers the greatest benefits and experiences to consumer communities, creators, advertisers, and all industry players.
Together, and with technology to support our limitations, there is incredible abundance and opportunity to be unlocked in the creative media and advertising industries.
Thank you for your time!