This chapter explains how to use object detection based on deep learning.
With object detection we want to find the different instances in an image and assign them to a class. The instances can partially overlap and still be distinguished as distinct. This is illustrated in the following schema.
Object detection leads to two different tasks: Finding the instances and classifying them. In order to do so, we use a combined network consisting of three main parts. The first part, called backbone, consists of a pretrained classification network. Its task is to generate various feature maps, so the classifying layer is removed. These feature maps encode different kinds of information at different scales, depending how deep they are in the network. See also the chapter Deep Learning. Thereby, feature maps with the same width and height are said to belong to the same level. In the second part, we take different feature maps providing features of different levels and combine them. As a result we obtain feature maps containing information of lower and higher levels. These are the feature maps we will use in the third part. This second part is also called feature pyramid and together with the first part it constitutes the feature pyramid network. The third part consists of additional networks, which get the selected feature maps as input and learn how to localize and classify potential objects. Additionally this third part includes the reduction of overlapping predicted bounding boxes. An overview of the three parts is shown in the following figure.
Let us have a look what happens in this third part.
In object detection, the location in the image of an instance is given by a
rectangular, axis parallel bounding box.
Hence, the first task is to find a suiting bounding box for every single
instance.
To do so, the network generates reference bounding boxes and learns, how to
modify them to fit the instances best possible.
While the bounding boxes with instances are all rectangular,
they may have different sizes and side ratios.
Thus, the network has to learn where such bounding boxes may be and which
shape they may have.
Within the approach taken in HALCON, the network proposes
for every pixel of every feature map of the feature pyramid a set of
reference bounding boxes.
The shape of those boxes is affected by the parameter 'aspect_ratios'
and the size by the parameter 'num_subscales'
,
see the illustration below and
.
In this way get_dl_model_param
'aspect_ratios'
times 'num_subscales'
reference bounding boxes are generated for every pixel mentioned before.
These reference bounding boxes are the base positions of potential objects.