Measuring text in videos, or how sharp is a chiseled jawline

By Aaron Foote ’24

This post describes the project I did last summer as part of the QAC Summer Apprenticeship. The project was done in support of the work of the Wesleyan Media Project and the DeltaLab. I developed a method for efficiently extracting text from videos and, in the process, got some insights on how sharp human faces are when examined by computer vision algorithms.

Like many other areas around us, social media, and political advertising on social media, have turned big-data in recent years. Advertising on social media allows candidates to reach a potential group of voters that may be otherwise difficult to reach. With online ads costing much less than TV ads, it is much easier to advertise online, and there has been an explosion in the number of ads and the volume of content. This explosion is due in part to the growth of social media itself. In 2008, Facebook reached 100 million users and Snapchat did not even exist. As of September 2021, there were 1.9 billion daily active users of Facebook (Dean) and 306 million daily active users on Snapchat (Iqbal). For any candidate hoping to have their message heard, advertising on social media is a must. 

Wesleyan Media Project (WMP) tracks political campaigns and their advertising, both on TV and online. This involves knowing how much money was spent, and, ideally, on which issues/topics it was spent.

My project focused on working with screen text in video ads. Text appears in videos in two forms: there is speech, which can be transcribed using off-the-shelf tools like Automatic Speech Recognition (ASR), and there is text on screen. Sometimes the audio and the screen text are different, and sometimes an ad has no spoken words at all. There are off-the-shelf tools that can help, but it costs money to use them. For example, Google has a Video Intelligence API that will return screen text from a video, but its use costs 15 cents per minute of video. There is also an option of running optical character recognition (OCR) on every frame from a video, but that will produce noisy results, because in videos with animated text the same text string appears multiple times. So, my goal was to come up with a method that will extract text from videos with as little noisy data as possible. 

My approach was to identify key frames and perform OCR on them. Usually, in videos, text either appears in full or is gradually added. The frames where the amount of text is at a maximum are the key frames. Once I found these frames, I could stack them into a single image and submit it to an OCR service. To do that, I needed to determine a metric that could be calculated for each frame. The frames where the metrics had their local maxima would constitute the key frames.

Two metrics were considered: the proportion of the frame covered by text and the number of corners in the frame.

The proportion-of-text approach was ultimately not effective at identifying the key frames. In theory, the screen starts with some amount of text on the screen and is slowly filled up with text as an idea/message is developed. Once the idea is completed or the screen is filled up, the screen would then be wiped so that more text would be added. Tools like Tesseract (or Google’s OCR) can return the bounding box coordinates for blocks of text, and knowing them we can compute the proportion of frame covered by text. By calculating the proportion of each frame that is covered by text and extracting the local maxima frames, all of the text will be extracted because the maximum frame will contain the text of all of the frames leading up to the peak. 

This theory was only correct in some cases, and many times the fluctuations in proportion did not correspond to changes in text on screen. This would happen in two cases: the size of the text changes or the new text occupies a similar region as the previous text.

Animated gif from an ad run by Donald Trump campaign in 2020

Figure 1. Video (animated GIF) from a Facebook ad run by Donald J. Trump page and paid for by MAKE AMERICA GREAT AGAIN COMMITTEE. Facebook ad id 241249350213043 – link

Figure 2. Time series of proportion of text in a video frame, for the video in Fig. 1.

 

Note how “TAKE THE SURVEY” grows in the video above. While the text does not change, the proportion of the text on the screen changes, potentially leading to the incorrect extracting of frames. Beside the video is a graph plotting the proportion of the frame that is covered by text vs. the frame number. From frame 100 until the end of the video, text is no longer being added, but the proportion continues to increase as the size of the text increases. 

Counting the number of corners in the image was a much more effective approach. Corner detection is a useful technique in computer vision, usually applied to identify points of interest or features of an image. Since a frame is really just a collection of pixels represented by numeric values for color, one can take derivatives of the image to track changes of these values over the pixels. 

The Harris Corner Detection algorithm does exactly this. For each pixel (and its neighborhood), an image gradient is calculated. For a neighborhood of pixels without corners or edges, the gradient does not change much in any direction if that neighborhood is shifted slightly. For a neighborhood with an edge, there is no change in gradient for shifts along the edge but there is a change in shifts in another direction. For a neighborhood with a corner, any shift of the neighborhood will result in a change in the gradient.

Text fonts contain a lot of corners and these should be detected by the algorithm. An interesting question is whether the count of corners would be biased due to presence of human faces in a video frame. To investigate this, I made a chart showing two frames from the same ad with corners highlighted in red. The image with text has many more corners than even a detailed close-up image without text. 

Figure 3. Results of corner detection. Corner pixels are shown in red. The corner detection algorithm does not tag facial features as “true” corners. Images are the screenshots from a Facebook ad run by Preserve America PAC and paid for by PRESERVE AMERICA PAC. Facebook ad id 2762201964060022 – link

Although non-text features of images do have corners, the relative difference means that the corners of text “drown out” the impact of the corners from non-text elements of a frame. An added bonus of corner detection is that it is not tripped up by the same pitfalls of the proportion of text approach. No matter how much text grows and shrinks, it still has the same amount of corners. Additionally, the region of the screen in which the text appears is irrelevant for corner detection. Below is the same ad from above but this time the plot graphs the corner count versus the frame number.

Figure 4. Time series of corner counts. Input is the video from the ad shown in Fig. 1 above. The inset images show the screenshots from specific frame numbers.

This chart is much cleaner than the proportion-of-text approach, and it also reflects the change in the image much more accurately. As each new line of text slides onto the screen, it can be seen that the corner count increases, flattening in the moments between lines being added. Identifying local extrema from this chart is easier as well because there is much less noise.

One could argue that the corner count continues to increase over the last portion of the video even as text is no longer being added, but what is actually important in this chart is that the chart has only one local maxima on that interval. Over the portion of the video with constant text (approximately frame 100 onward) this chart has a constant upward slope, while the proportion-of-text vs. frame number chart is jumping around for that entire interval.

Extracting the key frames is a straightforward process. As established, the key frames are the ones that are local maxima of corner count values. In mathematical terms, the local maxima correspond to peaks in a time series signal. There are standard libraries in Python that can do this task. There is one more twist. The peak finding algorithm requires some tuning parameters and, under some circumstances, it may be over-sensitive to changes in the time series and will return a high number of peak points. To further prevent noise and reduce the number of frames, the maximum number of keyframes is limited to the number of seconds in the video. This number may seem arbitrary, but remember that each keyframe is supposed to be a wholly unique frame of text. It would not make sense for a video to completely change the text on the screen more than once in a second because then the viewer would not have a chance to read it. 

 

References:

Dean, Brian. 2021. “How Many People Use Facebook in 2021?” Backlinko https://backlinko.com/facebook-users.

Iqbal, Mansoor. 2021 “Snapchat Revenue and Usage Statistics” Business of Apps https://www.businessofapps.com/data/snapchat-statistics/.