Video Fingerprinting

jotschi
Monday, Jun 21, 2021

Credit: Photo by George Prentzas on Unsplash

Intro

In this post I want to share my experiments on how to generate video fingerprints.

The goal of the fingerprinting is to enable identification and comparing processes for video media.

Fingerprinting is the key process that is needed to handle comparing processes, similarity search, de-duplication and video identification.

Basics

Fingerprinting works by generating a binary hash of the image. These hashes can in turn be compared to each other. Lets take a look on how this process works with text first before we dive into media hashing.

Text Bin Diff

Text	Bin	Diff
loom	01101100011011110`1101111`01101101	-
lo0m	01101100011011110`0110000`01101101	6

loom

01101100011011110110111101101101

lo0m

01101100011011110011000001101101

In this example shows that both strings have a difference of 6 bits. The remaining bits are similar. The same process can be applied to video fingerprinting. By comparing a generated hash from one video to another it is possible to determine whether both videos share similar characteristics which were encoded in the fingerprint hash.

The Levenshtein distance describes this metric in more detail.

For a large set of fingerprints it is however not feasible to pre-compute all Levenshtein distances for all fingerprint combinations.

Another approach to handle comparing fingerprints is to use Nearest neighbor search. In this case the fingerprint can be seen as a multidimensional vector. Similar vectors are closer in relation and thus NNS can be applied to find neighbors similar to the selected fingerprint.

The last approach is to store the fingerprint as a vector in a binary tree. Queries can be run against this tree to determine nearest neighbors.

Additional information can of course also be included in the vectors to search for more specific traits of the media. Outside of media fingerprinting this process may also be used for product suggestions / recommendations. User, product, behaviour information may be added to the vectors which can be searched.

There are various projects on github which also allow the creation and storage of such binary trees for nearest neighbor search.

The one I tested was Annoy from spotify.

Process

For video you of course need to deal with image data. The process I came up with works as following:

Seek to 15% position of the video
Iterate over a given amount of frames
Skip black frames (e.g. cut-screen frames)
Process the frame by reducing, greyscaling, bluring, normalizing it
Stack/Combine the frames additively into a single output image
Reduce the output image to a binary color precision
Convert the image into a bitstream
Convert to hex

I decided to try the stacking approach in order to get generate fingerprints which were less prone to slight changes in framerate and start offset. In order to speedup the process some of the frames after a taken frame will be omitted.

Detailed Process

Stage 1 - Source

Resize the source frame to 512x512.

Figure 1. Source frame

Stage 2 - Preparation

Convert to greyscale and normalize the image to reduce extra bright spots. Blur the image and normalize again. Now stack the resulting frame with the previously processed frame. Resize to 16x16 pixel

Figure 2. Reduced

Stage 3 - Normalize

Normalize the image

Figure 3. Normalized

Stage 4 - Stacking

Stack the frame with the previously frame

Figure 4. Stacked

Stage 5 - Normalize

Normalize the stacked image to reduce bright spots that may have been created due to stacking.

Figure 5. Normalized

Stage 6 - Binary

Convert grey values to one bit precision.

Figure 6. Result

The result image content gets converted into hex which forms the final fingerprint value.

060007000f003e001d0085000600f40076597a007c86ffdefffffefffcfffdff01

Example Video

This video shows the stacking process in action using a sample video from big buck bunny.

Other sources

pHash

pHash is an open source perceptual hashing mechanism for media. As far as I recall it uses edge detection to find specific areas of the video which are more prominent. The detected edges are transformed via Hough Transform into a different spatial data which is robust to scaling and rotation. A more detailed description can be found in this post: Phash knows the perception of the human eye

Chromaprint

Chromaprint is mainly designed for audio but shares some concepts to video fingerprinting. The main aspect is that fourier transformation is applied to the audio data to prepare it for processing. Checkout the authors detailed explanation on how Chromaprint works.

Final thoughts

My initial implementation of this process works as expected but there still some cases in which the fingerprint process will not yield a meaningful stacked image / generated hash. I assume there may be a problem with the selection process of frames or that the normalization does not work as expected. I have to look into this.

I hope this post was interesting to you. I may create a follow-up post in the future which will cover the actual storage and query processes for the generated fingerprints.