Silverster - Single Post

IMDb trailers Who and what is in the scene?

Artificial Intelligence on movies

Jeroen / 17 june 2019 / Intrest

The aim of the thesis is providing a first methodology to enable the search of objects and people in digitally filmed images.

We used the free of use data collections from IMDb. In addition, we extended our data by webscraping using python.
As an example we will use the trailer of 'The Wolf of Wallstreet' (WOW).

Object detection is done by using YOLO.v3 CNN. a Convolutional neural network capable of detecting 80 objects. The network has been trained on the COCO dataset (Common Objects in Context).

Object detection at work.

Diffence between facedetection and facerecognition

- facedection: only when detecting the face of a frame.
- facerecognition: detecting the face of a frame and beeing able to link it to a specific person.

facedetection at work.

In the Question 'Who and what is in the scene?' there are 3 subjects to discuss. So far we have an answer on the 'who' and the 'what', however we still have to discuss the subject of the 'scene':

How will we define a scene?
-> A scene can be defined as every change in camera angle or when the location of the camera changes.

SSIM at work. Alternating color for each detected shot.

Analysis

An image in Python is simular to a mathematical matrix, called a numpy array. Calculating de MSE between two frames is easy: the squared difference between two array's divided by the number of pixels.



The SSIM index is calculated on various windows of an image. The measure between two windows x and y of common size N×N.

However MSE is not the solution. As shown below, MSE fails when a shot is closed or starts by fading, a technique often used in movies. Therefore, we need the concept of structural information:
Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close.

MSE vs SSIM

Calculating SSI on all frames from our trailers. We introduce a new variable: the difference between SSIM on two consecutive frames. We call it: SSIdrop.

SSIMdrop: predicting a new shot

If SSIMdrop is a positive value it means two frames look more alike. If SSIMdrop is negative then two frames are more different. The drop value is a measurement for the difference.

example of 4 trailers: accuracy, recall and trueness

We are ready now to put everything together in one dashboard from our selection of movie trailers, the answer to the question: 'who and what is in the scene?'.

Final result: WOW - shot33

Final result: WOW - shot87

By the end of this project we have analysed over more then 60 million lines of data.



Back to Blog