List of Operators ↓
This chapter explains how to use object detection based on deep learning.
With object detection we want to find the different instances in an image and assign them to a class. The instances can partially overlap and still be distinguished as distinct. This is illustrated in the following schema.
Object detection leads to two different tasks: Finding the instances and classifying them. In order to do so, we use a combined network consisting of three main parts. The first part, called backbone, consists of a pretrained classification network. Its task is to generate various feature maps, so the classifying layer is removed. These feature maps encode different kinds of information at different scales, depending how deep they are in the network. See also the chapter Deep Learning. Thereby, feature maps with the same width and height are said to belong to the same level. In the second part, backbone layers of different levels are combined. More precisely, backbone levels of different levels are specified as docking layers. Their feature maps are combined. As a result we obtain feature maps containing information of lower and higher levels. These are the feature maps we will use in the third part. This second part is also called feature pyramid and together with the first part it constitutes the feature pyramid network. The third part consists of additional networks, called heads, for every selected level. They get the corresponding feature maps as input and learn how to localize and classify, respectively, potential objects. Additionally this third part includes the reduction of overlapping predicted bounding boxes. An overview of the three parts is shown in the following figure.
Let us have a look what happens in this third part.
In object detection, the location in the image of an instance is given by a
rectangular bounding box.
Hence, the first task is to find a suiting bounding box for every single
To do so, the network generates reference bounding boxes and learns, how to
modify them to fit the instances best possible.
These reference bounding boxes are called anchors.
The better these anchors represent the shape of the different ground truth
bounding boxes, the easier the network can learn them.
For this purpose the network generates a set of anchors
on every anchor point, thus on every pixel of the used feature maps of the
Such a set consists of anchors of all combinations of shapes, sizes, and
for instance type
'rectangle2' (see below) also orientations.
The shape of those boxes is affected by the parameter
'anchor_aspect_ratios' the size by the parameter
'anchor_num_subscales', and the orientation by the parameter
'anchor_angles', see the illustration below and
If the parameters generate multiple identical anchors, the
network internally ignores those duplicates.