Deep Learning [HALCON Operator Reference / Version 18.11.3.0]

Deep Learning

Introduction

The term deep learning (DL) refers to a family of machine learning methods. In HALCON, the following methods are implemented:

Classification:: Classify an image into one class out of a given set of classes. For further information please see the chapter Deep Learning / Classification.

A possible example for classification: The image gets assigned to a class.
Object Detection:: Detect objects of the given classes and localize them within the image. For further information please see the chapter Deep Learning / Object Detection.

A possible example for object detection: Within the input image three instances are found and assigned to a class.
Semantic Segmentation:: Assign a class to each pixel of an image. For further information please see the chapter Deep Learning / Semantic Segmentation.

A possible example for semantic segmentation: Every pixel of the input image is assigned to a class.

Thereby the latter two are implemented within the general DL model, see the chapter Deep Learning / Model. All three different deep learning methods listed above have a network performing the assignment task. The network is trained by only considering the input and output, which is also called end-to-end learning. Basically, using images and the information, what is visible in them, the training algorithm adjusts the network in a way to distinguish the different classes and eventually also how to find the corresponding objects. For you, it has the nice outcome of no need for manual feature specification. Instead you have to select and collect appropriate data.

System Requirements

Deep learning with Convolutional Neural Networks (CNN) has different requirements for the functionalities training of the network and applying the network for inference.

Training of a deep learning network requires NVIDIA GPUs, and uses the libraries cuDNN and cuBLAS. To speed up the training process, we recommend in HALCON to use a sufficiently fast hard drive. Thus, a solid-state drive (SSD) is preferable to conventional hard disk drives (HDD).

Inference with a deep learning network can be executed on x86 and x64 CPUs as well as on NVIDIA GPUs, wherefore the same requirements apply as for training.

For the specific requirements please refer to the HALCON “Installation Guide”.

General Workflow

As the DL methods mentioned above differ in what they do and how they need the data, you need to know which method is most appropriate for your specific task. Once this is clear, you need to collect a suitable amount of data, meaning images and the information needed by the method. After that, there is a common general workflow for all these DL methods:

Prepare the Network and the Data

The network needs to be prepared for your task and your data adapted to the specific network.

Get a network: Read in a pretrained network or create one.
The network needs to know which problem it shall solve, i.e., which classes are to be distinguished and what such samples look like. This is represented by your dataset, i.e., your images with the corresponding ground truth information.
The network will impose several requirements on the images (as e.g., the image dimension, gray value range, ... ). Therefore the images have to be preprocessed so that the network can process them.
We recommend to split the dataset into three distinct datasets which are used for training, validation, and testing.

Train the Network and Evaluate the Training Progress

Once your network is set up and your data prepared it is time to train the network for your specific task.

Set the hyperparameters appropriate to your task and system.
Optionally specify your data augmentation.
Start the training and evaluate your network during training.

Apply and Evaluate the Final Network

Your network is trained for your task and ready to be applied. But before deploying it in the real world you should evaluate how well the network performs on basis of your test dataset.

Inference Phase

When your network is trained and you are satisfied with its performance, you can use it for inference on new images. Thereby the images need to be preprocessed according to the requirements of the network (thus, in the same way as for training).

Data

The term 'data' is used in the context of deep learning as the images and the information, what is in them. This last information has to be provided in a way the network can understand. Not surprisingly, the different DL methods have their own requirements concerning what information has to be provided and how. Please see the corresponding chapters for the specific requirements.

The network further poses requirements on the images regarding the image dimensions, the gray value range, and the type. The specific values depend on the network itself and can be queried with get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamGetDlModelParam and get_dl_classifier_paramget_dl_classifier_paramGetDlClassifierParamGetDlClassifierParamGetDlClassifierParam, respectively. Additionally, depending on the method there are also requirements regarding the information as e.g., the bounding boxes. To fulfill all these requirements, the data may have to be preprocessed, which can be done most conveniently with the corresponding procedures, e.g., preprocess_dl_samples and preprocess_dl_classifier_images, respectively.

When you train your network, the network gets adapted to its task. But at one point you will want to evaluate what the network learned and at an even later point you will want to test the network. Therefore the dataset will be split into three subsets which should be independent and identically distributed. In simple words, the subsets should not be connected to each other in any way and each set contains for every class the same distribution of images. This splitting is conveniently done by the procedures split_dl_dataset and split_dl_classifier_data_set, respectively. The clearly largest subset will be used for the retraining. We refer to this dataset as the training dataset. At a certain point the performance of the network is evaluated to check whether it is beneficial to continue the network optimization. For this validation the second set of data is used, the validation dataset. Even if the validation dataset is disjoint from the first one, it has an influence on the network optimization. Therefore to test the possible predictions when the model is deployed in the real world, the third dataset is used, the test dataset. For a representative network validation or evaluation, the validation and test dataset should have statistically relevant data, which gives a lower bound on the amount of data needed.

Note also, that for training the network, you best use representative images, i.e., images like the ones you want to process later and not only 'perfect' images, as otherwise the network may have difficulties with non-'perfect' images.

The Network and the Training Process

In the context of deep learning, the assignments are performed by sending the input image through a network. The output of the total network consists of a number of predictions. Such predictions are e.g., for a classification task the confidence for each class, expressing how likely the image shows an instance of this class.

Such a network consists of a certain number of layers or filters, which are arranged and connected in a specific way. In general, any layer is a building block performing specific tasks. It can be seen as a container, which receives input, transforms it according to a function, and returns the output to the next layer. Thereby different functions are possible for different types of layers. Several possible examples are given in the “Solution Guide on Classification”. The output of a layer is also called feature map. Many layers or filters have weights, parameters which are also called filter weights. These are the parameters modified during the training of a network.

Schema of an extract of a possible classification network. Below we show feature maps corresponding to the layers, zoomed to a uniform size.

To train a network for a specific task, a loss function is added. There are different loss functions depending on the task, but they all work according to the following principle. A loss function compares the prediction from the network with the given information, what it should find in the image (and, if applicable, also where), and penalizes deviations. Now the filter weights are updated in such a way that the loss function is minimized. Thus, training the network for the specific tasks, one strives to minimize the loss (an error function) of the network, in the hope of doing so will also improve the performance measure. In practice, this optimization is done by calculating the gradient and updating the parameters of the different layers (filter weights) accordingly. This is repeated by iterating multiple times over the training data.

There are additional parameters that influence the training, but which are not directly learned during the regular training. These parameters have values set before starting the training. We refer to this last type of parameters as hyperparameters in order to distinguish them from the network parameters that are optimized during training. See the section “Setting the Training Parameters: The Hyperparameters”.

To train all filter weights from scratch a lot of resources are needed. Therefore one can take advantage from the following observation. The first layers detect low level features like edges and curves. The feature map of the following layers are smaller, but they represent more complex features. For a large network, the low level features are general enough so the weights of the corresponding layers will not change much among different tasks. This leads to a technique called transfer learning: One takes an already trained network and retrains it for a specific task, benefiting from already quite suitable filter weights for the lower layers. As a result, considerably less resources are needed. While in general the network should be more reliable when trained on a larger dataset, the amount of data needed for retraining also depends on the complexity of the task. A basic schema for the workflow of transfer learning is shown with the aid of classification in the figure below.


(1)	(2)	(3)	(4)

Basic schema of transfer learning with the aid of classification. (1) A pretrained network is read. (2) Training phase, the network gets trained with the training data. (3) The trained model with new capabilities. (4) Inference phase, the trained network infers on new images.

Setting the Training Parameters: The Hyperparameters

The different DL methods are designed for different tasks and will vary in the way they are built up. But they all have in common that during the training of the network, one strives to minimize the corresponding loss function, see the section “The Network and the Training Process”. For doing so, there is a set of further parameters which is set before starting the training and not optimized during the training. We refer to these parameters as hyperparameters. For a DL model, you can set a change strategy, specifying when and how you want these hyperparameters changed during the training. In this section, we explain the idea of the different hyperparameters. Note, that certain methods have additional hyperparameters, you find more information their respective chapter.

As already mentioned, the loss compares the predictions from the network with the given information about the content of the image. The loss now penalizes deviations. Training the network means updating the filter weights in such a way, that the loss has to penalize less, thus the loss result is optimized. To do so, a certain amount of data is taken from the training dataset. For this subset the gradient of the loss is calculated and the network modified in updating its filter weights accordingly. Now this is repeated with the next subset of data till the whole training data is processed. These subsets of the training data are called batches and the size of these subsets, the 'batch_size'"batch_size""batch_size""batch_size""batch_size", determines the number of data taken into a batch and as a consequence processed together.

A full iteration over the entire training data is called epoch. It is beneficial to iterate several times over the training data. The number of iterations is defined by 'epochs'. Thus, 'epochs' determines how many times the algorithm loops over the training set.

As mentioned before, after every calculation of the loss gradient the filter weights are updated (which is done batch-wise and therefore via the stochastic gradient descent algorithm SGD). For this update, there are two important hyperparameters: The 'learning_rate'"learning_rate""learning_rate""learning_rate""learning_rate" , determining the weight of the gradient on the updated loss function arguments (the filter weights), and the 'momentum'"momentum""momentum""momentum""momentum" within the interval , specifying the influence of previous updates. More information can be found in the documentation of train_dl_classifier_batchtrain_dl_classifier_batchTrainDlClassifierBatchTrainDlClassifierBatchTrainDlClassifierBatch and train_dl_model_batchtrain_dl_model_batchTrainDlModelBatchTrainDlModelBatchTrainDlModelBatch, respectively. In simple words, when we update the loss function arguments, we still remember the step we took for the last update. Now, we take a step in direction of the gradient with a length depending to the learning rate; additionally we repeat the step we did last times, but this time only times as long. A visualization is given in the figure below. A too large learning rate might result in divergence of the algorithm, a very small learning rate will take unnecessarily many steps. Therefore, it is customary to start with a larger learning rate and potentially reduce it during training. With a momentum , the momentum method has no influence, so only the gradient determines the update vector.

Sketch of the 'learning_rate'"learning_rate""learning_rate""learning_rate""learning_rate" and the 'momentum'"momentum""momentum""momentum""momentum" during an actualization step. The gradient step: the learning rate times the gradient g (g - dashed lines). The momentum step: the momentum times the previous update vector v (v - dotted lines). Together, they form the actual step: the update vector v (v - solid lines).

To prevent the neural networks from overfitting (see the part “Risk of Underfitting and Overfitting” below), regularization can be used. With this technique an extra term is added to the loss function. One possible type of regularization is weight decay, for details see the documentation of train_dl_model_batchtrain_dl_model_batchTrainDlModelBatchTrainDlModelBatchTrainDlModelBatch and train_dl_classifier_batchtrain_dl_classifier_batchTrainDlClassifierBatchTrainDlClassifierBatchTrainDlClassifierBatch, respectively. It works by penalizing large weights, i.e., pushing the weights towards zero. Simply put, this regularization favors simpler models that are less likely to fit to noise in the training data and generalize better. It can be set by the hyperparameter 'weight_prior'"weight_prior""weight_prior""weight_prior""weight_prior". Choosing its value is a trade-off between the model's ability to generalize, overfitting, and underfitting. If 'weight_prior'"weight_prior""weight_prior""weight_prior""weight_prior" is too small the model might overfit, if it is too large, the model might loose its ability to fit the data well because all weights are effectively zero.

With the training data and all the hyperparameters, there are many different aspects that can have an influence on the outcome of such complex algorithms. To improve the performance of a network, generally the addition of training data also helps. Please note, whether to gather more data is a good solution always depends also on how easily one can do so. Usually, a small additional fraction will not noticeably change the total performance.

Supervising the training

The different DL methods have different results. Accordingly they also use different measures to determine 'how well' a network performs. But when training a network, there are common behaviors and pitfalls, which are described here.

Validation During Training

When it comes to the validation of the network performance, it is important to note that this is not a pure optimization problem (see the parts “The Network and the Training Process” and “Setting the Training Parameters” above).

In order to observe the training progress, it is usually helpful to visualize a validation measure, e.g., for the training of a classification network, the error over the samples of a batch. As the samples differ, the difficulty of the assignment task may differ. Thus it may be that the network performs better or worse for the samples of a given batch than for the samples of another batch. So it is normal that the validation measure is not changing smoothly over the iterations. But in total it should improve. Adjusting the hyperparameters 'learning_rate'"learning_rate""learning_rate""learning_rate""learning_rate" and 'momentum'"momentum""momentum""momentum""momentum" can help to improve the validation measure again. The following figures show possible scenarios.


(1)	(2)

Sketch of an validation measure during training, here using the error from classification as example. (1) General tendencies for possible outcomes with different 'learning_rate'"learning_rate""learning_rate""learning_rate""learning_rate" values, dark blue: good learning rate, red: very high learning rate, light blue: high learning rate, orange: low learning rate. (2) Ideal case with a learning rate policy to reduce the 'learning_rate'"learning_rate""learning_rate""learning_rate""learning_rate" value after a given number of iterations. In orange: training error, dark blue: validation error. The arrow marks the iteration, at which the learning rate is decreased.

Risk of Underfitting and Overfitting

Underfitting occurs if the model is not able to capture the complexity of the task. It is directly reflected in the validation measure on the training set which stays high.

Overfitting happens when the network starts to'memorize' training data instead of learning how to generalize. This is shown by a validation measure on the training set which stays good or even improves while the validation measure on the validation set decreases. In such a case, regularization may help. See the explanations of the hyperparameter 'weight_prior'"weight_prior""weight_prior""weight_prior""weight_prior" in the section “Setting the Training Parameters: The Hyperparameters”. Note that a similar phenomenon occurs when the model capacity is too high with respect to the data.

Sketch of a possible overfitting scenario, visible on the generalization gap (indicated with the arrow). The error from classification serves as an example for a validation measure.

Confusion Matrix

A network infers for an instance a top prediction, the class for which the network deduces the highest affinity. When we know its ground truth class, we can compare the two class affiliations: the predicted one and the correct one. Thereby, the instance differs between the different types of methods, while e.g., in classification the instances are images, in semantic segmentation the instances are single pixels.

When more than two classes are distinguished, one can also reduce the comparison into binary problems. This means, for a given class you just compare if it is the same class (positive) or any other class (negative). For such binary classification problems the comparison is reduced to the following four possible entities (whereof not all are applicable for every method):

True positives (TP: predicted positive, labeled positive),
true negatives (TN: predicted negative, labeled negative),
false positives (FP: predicted positive, labeled negative),
false negatives (FN: predicted negative, labeled positive).

A confusion matrix is a table with such comparisons. This table makes it easy to see how well the network performs for each class. For every class it lists how many instances have been predicted into which class. E.g., for a classifier distinguishing the three classes 'apple', 'peach', and 'orange', the confusion matrix shows how many images with ground truth class affiliation 'apple' have been classified as 'apple' and how many have been classified as 'peach' or 'orange'. Of course, this is listed for the other classes as well. This example is shown in the figure below. In HALCON, we represent for each class the instances with this ground truth label in a column and the instances predicted to belong to this class in a row.


(1)	(2)

An example for a confusion matrices from classification. We see that 68 images of an 'apple' have been classified as such (TP), 60 images showing not an 'apple' have been correctly classified as a 'peach' (30) or 'pear' (30) (TN), 0 images show a 'peach' or a 'pear' but have been classified as an 'apple' (FP) and 24 images of an 'apple' have wrongly been classified as 'peach' (21) or 'pear' (3) (FN). (1) A confusion matrix for all three distinguished classes. It appears as if the network 'confuses' apples and peaches more than all other combinations. (2) The confusion matrix of the binary problem to better visualize the 'apple' class.

Glossary

In the following, we describe the most important terms used in the context of deep learning:

annotation

An annotation is the ground truth information, what a given instance in the data represents, in a way recognizable for the network. This is e.g., the bounding box and the corresponding label for an instance in object detection.

backbone

A backbone is a part of a pretrained classification network. Its task is to generate various feature maps, for what reason the classifying layer has been removed.

batch size - hyperparameter 'batch_size'"batch_size""batch_size""batch_size""batch_size"

The dataset is divided into smaller subsets of data, which are called batches. The batch size determines the number of images taken into a batch and thus processed simultaneously.

bounding box

Bounding boxes are axis-parallel rectangular boxes used to define a part within an image and to specify the localization of an object within an image.

class agnostic

Class agnostic means without the knowledge of the different classes.

In HALCON, we use it for reduction of overlapping predicted bounding boxes. This means, for a class agnostic bounding box suppression the suppression of overlapping instances is done ignoring the knowledge of classes, thus strongly overlapping instances get suppressed independently of their class.

change strategy

A change strategy denotes the strategy, when and how hyperparameters are changed during the training of a DL model.

class

Classes are discrete categories (e.g., 'apple', 'peach', 'pear') that the network distinguishes. In HALCON, the class of an instance is given by its appropriate annotation.

classifier

In the context of deep learning we refer to the term classifier as follows. The classifier takes an image as input and returns the inferred confidence values, expressing how likely the image belongs to every distinguished class. E.g., the three classes 'apple', 'peach', and 'pear' are distinguished. Now we give an image of an apple to the classifier. As a result, the confidences 'apple': 0.92, 'peach': 0.07, and 'pear': 0.01 could be returned.

COCO

COCO is an abbreviation for "common objects in context", a large-scale object detection, segmentation, and captioning dataset. There is a common file format for each of the different annotation types.

confidence

Confidence is a number expressing the affinity of an instance to a class. In HALCON the confidence is the probability, given in the range of [0,1]. Alternative name: score

confusion matrix

A confusion matrix is a table which compares the classes predicted by the network (top-1) with the ground truth class affiliations. It is often used to visualize the performance of the network on a validation or test set.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks are neural networks used in deep learning, characterized by the presence of at least one convolutional layer in the network. They are particularly successful for image classification.

data

We use the term data in the context of deep learning for instances to be recognized (e.g., images) and their appropriate information concerning the predictable characteristics (e.g., the labels in case of classification).

data augmentation

Data augmentation is the generation of altered copies of samples within a dataset. This is done in order to augment the richness of the dataset, e.g., through flipping or rotating.

dataset: training, validation, and test set

With dataset we refer to the complete set of data used for a training. The dataset is split into three, if possible disjoint, subsets:

The training set contains the data on which the algorithm optimizes the network directly.
The validation set contains the data to evaluate the network performance during training.
The test set is used to test possible inferences (predictions), thus to test the performance on data without any influence on the network optimization.

deep learning

The term "deep learning" was originally used to describe the training of neural networks with multiple hidden layers. Today it is rather used as a generic term for several different concepts in machine learning. In HALCON, we use the term deep learning for methods using a neural network with multiple hidden layers.

epoch

In the context of deep learning, an epoch is a single training iteration over the entire training data, i.e., over all batches. Iterations over epochs should not be confused with the iterations over single batches (e.g., within an epoch).

errors

In the context of deep learning, we refer to error when the inferred class of an instance does not match the real class (e.g., the ground truth label in case of classification). Within HALCON, we use the term error in deep learning when we refer to the top-1 error.

feature map

A feature map is the output of a given layer

feature pyramid

A feature pyramid is simply a group of feature maps, whereby every feature map origins from another level, i.e., it is smaller than its preceding levels.

hyperparameter

Like every machine learning model, CNNs contain many formulas with many parameters. During training the model learns from the data in the sense of optimizing the parameters. However, such models can have other, additional parameters, which are not directly learned during the regular training. These parameters have values set before starting the training. We refer to this last type of parameters as hyperparameters in order to distinguish them from the network parameters that are optimized during training. Or from another point of view, hyperparameters are solver-specific parameters.

Prominent examples are the initial learning rate or the batch size.

inference phase

The inference phase is the stage when a trained network is applied to predict (infer) instances (which can be the total input image or just a part of it) and eventually their localization. Unlike during the training phase, the network is not changed anymore in the inference phase.

intersection over union

The intersection over union (IoU) is a measure to quantify the overlap of two areas. We can determine the parts common in both areas, the intersection, as well as the united areas, the union. The IoU is the ratio between the two areas intersection and union.

The application of this concept may differ between the methods.

label

Labels are arbitrary strings used to define the class of an image. In HALCON these labels are given by the image name (eventually followed by a combination of underscore and digits) or by the folder name, e.g., 'apple_01.png', 'pear.png', 'peach/01.png'.

layer and hidden layer

A layer is a building block in a neural network, thus performing specific tasks (e.g., convolution, pooling, etc., for further details we refer to the “Solution Guide on Classification”). It can be seen as a container, which receives weighted input, transforms it, and returns the output to the next layer. Input and output layers are connected to the dataset, i.e., the images or the labels, respectively. All layers in between are called hidden layers.

learning rate - hyperparameter 'learning_rate'"learning_rate""learning_rate""learning_rate""learning_rate"

The learning rate is the weighting, with which the gradient (see the entry for the stochastic gradient descent SGD) is considered when updating the arguments of the loss function. In simple words, when we want to optimize a function, the gradient tells us the direction in which we shall optimize and the learning rate determines how far along this direction we step.

Alternative names: , step size

level

The term level is used to denote within a feature pyramid network the whole group of layers, whose feature maps have the same width and height. Thereby the input image represents level 0.

loss

A loss function compares the prediction from the network with the given information, what it should find in the image (and, if applicable, also where), and penalizes deviations. This loss function is the function we optimize during the training process to adapt the network to a specific task.

Alternative names: objective function, cost function, utility function

momentum - hyperparameter 'momentum'"momentum""momentum""momentum""momentum"

The momentum is used for the optimization of the loss function arguments. When the loss function arguments are updated (after having calculated the gradient), a fraction of the previous update vector (of the past iteration step) is added. This has the effect of damping oscillations. We refer to the hyperparameter as momentum. When is set to , the momentum method has no influence. In simple words, when we update the loss function arguments, we still remember the step we did for the last update. Now we go a step in direction of the gradient with a length according to the learning rate and additionally we repeat the step we did last time, but this time only times as long.

non-maximum suppression

In object detection, non-maximum suppression is used to suppress overlapping predicted bounding boxes. When different instances overlap more than a given threshold value, only the one with the highest confidence value is kept while the other instances, not having the maximum confidence value, are suppressed.

overfitting

Overfitting happens when the network starts to 'memorize' training data instead of learning how to find general rules for the classification. This becomes visible when the model continues to minimize error on the training set but the error on the validation set increases. Since most neural networks have a huge amount of weights, these networks are particularly prone to overfitting.

regularization - hyperparameter 'weight_prior'"weight_prior""weight_prior""weight_prior""weight_prior"

Regularization is a technique to prevent neural networks from overfitting by adding an extra term to the loss function. It works by penalizing large weights, i.e., pushing the weights towards zero. Simply put, regularization favors simpler models that are less likely to fit to noise in the training data and generalize better. In HALCON, regularization is controlled via the parameter 'weight_prior'"weight_prior""weight_prior""weight_prior""weight_prior".

Alternative names: regularization parameter, weight decay parameter, (note that in HALCON we use for the learning rate and within formulas the symbol for the regularization parameter).

retraining

We define retraining as updating the weights of an already pretrained network, i.e., during retraining the network learns the specific task.

Alternative names: fine-tuning.

solver

The solver optimizes the network by updating the weights in a way to optimize (i.e., minimize) the loss.

stochastic gradient descent (SGD)

SGD is an iterative optimization algorithm for differentiable functions. In deep learning we use this algorithm to calculate the gradient to optimize (i.e., minimize) the loss function. A key feature of the SGD is to calculate the gradient only based on a single batch containing stochastically sampled data and not all data.

top-k error

The classifier infers for a given image class confidences of how likely the image belongs to every distinguished class. Thus, for an image we can sort the predicted classes according to the confidence value the classifier assigned. The top-k error tells the ratio of predictions where the ground truth class is not within the k predicted classes with highest probability. In the case of top-1 error, we check if the target label matches the prediction with the highest probability. In the case of top-3 error, we check if the target label matches one of the top 3 predictions (the 3 labels getting the highest probability for this image).

Alternative names: top-k score

transfer learning

Transfer learning refers to the technique where a network is built upon the knowledge of an already existing network. In concrete terms this means taking an already (pre)trained network with its weights and adapt the output layer to the respective application to get your network. In HALCON, we also see the following retraining step as a part of transfer learning.

underfitting

Underfitting occurs when the model over-generalizes. In other words it is not able to describe the complexity of the task. This is directly reflected in the error on the training set, which does not decrease significantly.

weights

In general weights are the free parameters of the network, which are altered during the training due to the optimization of the loss. A layer with weights multiplies or adds them with its input values. In contrast to hyperparameters, weights are optimized and thus changed during the training.

List of Sections

Operators