List of Sections ↓
The term deep learning (DL) refers to a family of machine learning methods. In HALCON, the following methods are implemented:
Thereby the latter two are implemented within the general DL model, see the chapter Deep Learning / Model. All three different deep learning methods listed above have a network performing the assignment task. The network is trained by only considering the input and output, which is also called end-to-end learning. Basically, using images and the information, what is visible in them, the training algorithm adjusts the network in a way to distinguish the different classes and eventually also how to find the corresponding objects. For you, it has the nice outcome of no need for manual feature specification. Instead you have to select and collect appropriate data.
Deep learning with Convolutional Neural Networks (CNN) has different requirements for the functionalities training of the network and applying the network for inference.
Training of a deep learning network requires NVIDIA GPUs, and uses the libraries cuDNN and cuBLAS. To speed up the training process, we recommend in HALCON to use a sufficiently fast hard drive. Thus, a solid-state drive (SSD) is preferable to conventional hard disk drives (HDD).
Inference with a deep learning network can be executed on x86 and x64 CPUs as well as on NVIDIA GPUs, wherefore the same requirements apply as for training.
For the specific requirements please refer to the HALCON
As the DL methods mentioned above differ in what they do and how they need the data, you need to know which method is most appropriate for your specific task. Once this is clear, you need to collect a suitable amount of data, meaning images and the information needed by the method. After that, there is a common general workflow for all these DL methods:
The network needs to be prepared for your task and your data adapted to the specific network.
Get a network: Read in a pretrained network or create one.
The network needs to know which problem it shall solve, i.e., which classes are to be distinguished and what such samples look like. This is represented by your dataset, i.e., your images with the corresponding ground truth information.
The network will impose several requirements on the images (as e.g., the image dimension, gray value range, ... ). Therefore the images have to be preprocessed so that the network can process them.
We recommend to split the dataset into three distinct datasets which are used for training, validation, and testing.
Once your network is set up and your data prepared it is time to train the network for your specific task.
Set the hyperparameters appropriate to your task and system.
Optionally specify your data augmentation.
Start the training and evaluate your network during training.
Your network is trained for your task and ready to be applied. But before deploying it in the real world you should evaluate how well the network performs on basis of your test dataset.
When your network is trained and you are satisfied with its performance, you can use it for inference on new images. Thereby the images need to be preprocessed according to the requirements of the network (thus, in the same way as for training).
The term 'data' is used in the context of deep learning as the images and the information, what is in them. This last information has to be provided in a way the network can understand. Not surprisingly, the different DL methods have their own requirements concerning what information has to be provided and how. Please see the corresponding chapters for the specific requirements.
The network further poses requirements on the images regarding the image
dimensions, the gray value range, and the type.
The specific values depend on the network itself and can be queried with
Additionally, depending on the method there are also requirements
regarding the information as e.g., the bounding boxes.
To fulfill all these requirements, the data may have to be
preprocessed, which can be done most conveniently with the corresponding
When you train your network, the network gets adapted to its task.
But at one point you will want to evaluate what the network learned and
at an even later point you will want to test the network.
Therefore the dataset will be split into three subsets
which should be independent and identically distributed.
In simple words, the subsets should not be connected to each
other in any way and each set contains for every class the same
distribution of images.
This splitting is conveniently done by the procedures
The clearly largest subset will be used for the retraining. We refer
to this dataset as the training dataset.
At a certain point the performance of the network is evaluated to check
whether it is beneficial to continue the network optimization. For this
validation the second set of data is used, the validation dataset.
Even if the validation dataset is disjoint from the first one, it has an
influence on the network optimization. Therefore to test the possible
predictions when the model is deployed in the real world, the third
dataset is used, the test dataset.
For a representative network validation or evaluation, the validation and
test dataset should have statistically relevant data, which gives a lower
bound on the amount of data needed.
Note also, that for training the network, you best use representative images, i.e., images like the ones you want to process later and not only 'perfect' images, as otherwise the network may have difficulties with non-'perfect' images.
In the context of deep learning, the assignments are performed by sending the input image through a network. The output of the total network consists of a number of predictions. Such predictions are e.g., for a classification task the confidence for each class, expressing how likely the image shows an instance of this class.
Such a network consists of a certain number of layers or filters,
which are arranged and connected in a specific way.
In general, any layer is a building block performing specific tasks.
It can be seen as a container, which receives input, transforms it
according to a function, and returns the output to the next layer.
Thereby different functions are possible for different types of layers.
Several possible examples are given in the
“Solution Guide on Classification”.
The output of a layer is also called feature map.
Many layers or filters have weights, parameters which are also
called filter weights. These are the parameters modified during the
training of a network.
To train a network for a specific task, a loss function is added. There are different loss functions depending on the task, but they all work according to the following principle. A loss function compares the prediction from the network with the given information, what it should find in the image (and, if applicable, also where), and penalizes deviations. Now the filter weights are updated in such a way that the loss function is minimized. Thus, training the network for the specific tasks, one strives to minimize the loss (an error function) of the network, in the hope of doing so will also improve the performance measure. In practice, this optimization is done by calculating the gradient and updating the parameters of the different layers (filter weights) accordingly. This is repeated by iterating multiple times over the training data.
There are additional parameters that influence the training, but which are not directly learned during the regular training. These parameters have values set before starting the training. We refer to this last type of parameters as hyperparameters in order to distinguish them from the network parameters that are optimized during training. See the section “Setting the Training Parameters: The Hyperparameters”.To train all filter weights from scratch a lot of resources are needed. Therefore one can take advantage from the following observation. The first layers detect low level features like edges and curves. The feature map of the following layers are smaller, but they represent more complex features. For a large network, the low level features are general enough so the weights of the corresponding layers will not change much among different tasks. This leads to a technique called transfer learning: One takes an already trained network and retrains it for a specific task, benefiting from already quite suitable filter weights for the lower layers. As a result, considerably less resources are needed. While in general the network should be more reliable when trained on a larger dataset, the amount of data needed for retraining also depends on the complexity of the task. A basic schema for the workflow of transfer learning is shown with the aid of classification in the figure below.
The different DL methods are designed for different tasks and will vary in the way they are built up. But they all have in common that during the training of the network, one strives to minimize the corresponding loss function, see the section “The Network and the Training Process”. For doing so, there is a set of further parameters which is set before starting the training and not optimized during the training. We refer to these parameters as hyperparameters. For a DL model, you can set a change strategy, specifying when and how you want these hyperparameters changed during the training. In this section, we explain the idea of the different hyperparameters. Note, that certain methods have additional hyperparameters, you find more information their respective chapter.
As already mentioned, the loss compares the predictions from the network
with the given information about the content of the image.
The loss now penalizes deviations.
Training the network means updating the filter weights in such a way, that
the loss has to penalize less, thus the loss result is optimized.
To do so, a certain amount of data is taken from the training dataset.
For this subset the gradient of the loss is calculated and the network
modified in updating its filter weights accordingly.
Now this is repeated with the next subset of data till the whole
training data is processed.
These subsets of the training data are called batches and the
size of these subsets, the
'batch_size', determines the number of
data taken into a batch and as a consequence processed together.
A full iteration over the entire training data is called epoch.
It is beneficial to iterate several times over the training data.
The number of iterations is defined by
'epochs' determines how many times the algorithm loops over
the training set.
'learning_rate', determining the weight of the gradient on the updated loss function arguments (the filter weights), and the
'momentum'within the interval , specifying the influence of previous updates. More information can be found in the documentation of
, respectively. In simple words, when we update the loss function arguments, we still remember the step we took for the last update. Now, we take a step in direction of the gradient with a length depending to the learning rate; additionally we repeat the step we did last times, but this time only times as long. A visualization is given in the figure below. A too large learning rate might result in divergence of the algorithm, a very small learning rate will take unnecessarily many steps. Therefore, it is customary to start with a larger learning rate and potentially reduce it during training. With a momentum , the momentum method has no influence, so only the gradient determines the update vector.
To prevent the neural networks from overfitting (see the part “Risk of
Underfitting and Overfitting” below), regularization can be used.
With this technique an extra term is added to the loss function.
One possible type of regularization is weight decay, for details
see the documentation of
It works by penalizing large weights, i.e., pushing the weights towards
Simply put, this regularization favors simpler models that are less
likely to fit to noise in the training data and generalize better.
It can be set by the hyperparameter
Choosing its value is a trade-off between the model's ability to
generalize, overfitting, and underfitting.
too small the model might overfit, if it is too large, the model might
loose its ability to fit the data well because all weights are effectively
With the training data and all the hyperparameters, there are many different aspects that can have an influence on the outcome of such complex algorithms. To improve the performance of a network, generally the addition of training data also helps. Please note, whether to gather more data is a good solution always depends also on how easily one can do so. Usually, a small additional fraction will not noticeably change the total performance.
The different DL methods have different results. Accordingly they also use different measures to determine 'how well' a network performs. But when training a network, there are common behaviors and pitfalls, which are described here.
When it comes to the validation of the network performance, it is important to note that this is not a pure optimization problem (see the parts “The Network and the Training Process” and “Setting the Training Parameters” above).
In order to observe the training progress, it is usually helpful
to visualize a validation measure, e.g., for the training of a
classification network, the error over the samples of a batch.
As the samples differ, the difficulty of the assignment task may differ.
Thus it may be that the network performs better or worse for the samples
of a given batch than for the samples of another batch.
So it is normal that the validation measure is not changing smoothly
over the iterations. But in total it should improve.
Adjusting the hyperparameters
'momentum' can help to improve the validation measure
again. The following figures show possible scenarios.
Underfitting occurs if the model is not able to capture the complexity of the task. It is directly reflected in the validation measure on the training set which stays high.
Overfitting happens when the network starts to'memorize' training
data instead of learning how to generalize. This is shown by a
validation measure on the training set which stays good or even improves
while the validation measure on the validation set decreases.
In such a case, regularization may help. See the explanations of the
'weight_prior' in the section
“Setting the Training Parameters: The Hyperparameters”.
Note that a similar phenomenon occurs when the model capacity is too
high with respect to the data.
A network infers for an instance a top prediction, the class for which the network deduces the highest affinity. When we know its ground truth class, we can compare the two class affiliations: the predicted one and the correct one. Thereby, the instance differs between the different types of methods, while e.g., in classification the instances are images, in semantic segmentation the instances are single pixels.
When more than two classes are distinguished, one can also reduce the comparison into binary problems. This means, for a given class you just compare if it is the same class (positive) or any other class (negative). For such binary classification problems the comparison is reduced to the following four possible entities (whereof not all are applicable for every method):
True positives (TP: predicted positive, labeled positive),
true negatives (TN: predicted negative, labeled negative),
false positives (FP: predicted positive, labeled negative),
false negatives (FN: predicted negative, labeled positive).
A confusion matrix is a table with such comparisons. This table makes it easy to see how well the network performs for each class. For every class it lists how many instances have been predicted into which class. E.g., for a classifier distinguishing the three classes 'apple', 'peach', and 'orange', the confusion matrix shows how many images with ground truth class affiliation 'apple' have been classified as 'apple' and how many have been classified as 'peach' or 'orange'. Of course, this is listed for the other classes as well. This example is shown in the figure below. In HALCON, we represent for each class the instances with this ground truth label in a column and the instances predicted to belong to this class in a row.
In the following, we describe the most important terms used in the context of deep learning:
An annotation is the ground truth information, what a given instance in the data represents, in a way recognizable for the network. This is e.g., the bounding box and the corresponding label for an instance in object detection.
A backbone is a part of a pretrained classification network. Its task is to generate various feature maps, for what reason the classifying layer has been removed.
The dataset is divided into smaller subsets of data, which are called batches. The batch size determines the number of images taken into a batch and thus processed simultaneously.
Bounding boxes are axis-parallel rectangular boxes used to define a part within an image and to specify the localization of an object within an image.
Class agnostic means without the knowledge of the different classes.
In HALCON, we use it for reduction of overlapping predicted bounding boxes. This means, for a class agnostic bounding box suppression the suppression of overlapping instances is done ignoring the knowledge of classes, thus strongly overlapping instances get suppressed independently of their class.
A change strategy denotes the strategy, when and how hyperparameters are changed during the training of a DL model.
Classes are discrete categories (e.g., 'apple', 'peach', 'pear') that the network distinguishes. In HALCON, the class of an instance is given by its appropriate annotation.
In the context of deep learning we refer to the term classifier as follows. The classifier takes an image as input and returns the inferred confidence values, expressing how likely the image belongs to every distinguished class. E.g., the three classes 'apple', 'peach', and 'pear' are distinguished. Now we give an image of an apple to the classifier. As a result, the confidences 'apple': 0.92, 'peach': 0.07, and 'pear': 0.01 could be returned.
COCO is an abbreviation for "common objects in context", a large-scale object detection, segmentation, and captioning dataset. There is a common file format for each of the different annotation types.
Confidence is a number expressing the affinity of an instance to a class. In HALCON the confidence is the probability, given in the range of [0,1]. Alternative name: score
A confusion matrix is a table which compares the classes predicted by the network (top-1) with the ground truth class affiliations. It is often used to visualize the performance of the network on a validation or test set.
Convolutional Neural Networks are neural networks used in deep learning, characterized by the presence of at least one convolutional layer in the network. They are particularly successful for image classification.
We use the term data in the context of deep learning for instances to be recognized (e.g., images) and their appropriate information concerning the predictable characteristics (e.g., the labels in case of classification).
Data augmentation is the generation of altered copies of samples within a dataset. This is done in order to augment the richness of the dataset, e.g., through flipping or rotating.
With dataset we refer to the complete set of data used for a training. The dataset is split into three, if possible disjoint, subsets:
The training set contains the data on which the algorithm optimizes the network directly.
The validation set contains the data to evaluate the network performance during training.
The test set is used to test possible inferences (predictions), thus to test the performance on data without any influence on the network optimization.
The term "deep learning" was originally used to describe the training of neural networks with multiple hidden layers. Today it is rather used as a generic term for several different concepts in machine learning. In HALCON, we use the term deep learning for methods using a neural network with multiple hidden layers.
In the context of deep learning, an epoch is a single training iteration over the entire training data, i.e., over all batches. Iterations over epochs should not be confused with the iterations over single batches (e.g., within an epoch).
In the context of deep learning, we refer to error when the inferred class of an instance does not match the real class (e.g., the ground truth label in case of classification). Within HALCON, we use the term error in deep learning when we refer to the top-1 error.
A feature map is the output of a given layer
A feature pyramid is simply a group of feature maps, whereby every feature map origins from another level, i.e., it is smaller than its preceding levels.
Like every machine learning model, CNNs contain many formulas with many parameters. During training the model learns from the data in the sense of optimizing the parameters. However, such models can have other, additional parameters, which are not directly learned during the regular training. These parameters have values set before starting the training. We refer to this last type of parameters as hyperparameters in order to distinguish them from the network parameters that are optimized during training. Or from another point of view, hyperparameters are solver-specific parameters.
Prominent examples are the initial learning rate or the batch size.
The inference phase is the stage when a trained network is applied to predict (infer) instances (which can be the total input image or just a part of it) and eventually their localization. Unlike during the training phase, the network is not changed anymore in the inference phase.
The intersection over union (IoU) is a measure to quantify the overlap of two areas. We can determine the parts common in both areas, the intersection, as well as the united areas, the union. The IoU is the ratio between the two areas intersection and union.
The application of this concept may differ between the methods.
Labels are arbitrary strings used to define the class of an image. In HALCON these labels are given by the image name (eventually followed by a combination of underscore and digits) or by the folder name, e.g., 'apple_01.png', 'pear.png', 'peach/01.png'.
A layer is a building block in a neural network, thus performing
specific tasks (e.g., convolution, pooling, etc., for further details we
refer to the
“Solution Guide on Classification”).
It can be seen as a container, which receives weighted input,
transforms it, and returns the output to the next layer.
Input and output layers are connected to the dataset, i.e., the images
or the labels, respectively.
All layers in between are called hidden layers.
The learning rate is the weighting, with which the gradient (see the entry for the stochastic gradient descent SGD) is considered when updating the arguments of the loss function. In simple words, when we want to optimize a function, the gradient tells us the direction in which we shall optimize and the learning rate determines how far along this direction we step.
Alternative names: , step size
The term level is used to denote within a feature pyramid network the whole group of layers, whose feature maps have the same width and height. Thereby the input image represents level 0.
A loss function compares the prediction from the network with the given information, what it should find in the image (and, if applicable, also where), and penalizes deviations. This loss function is the function we optimize during the training process to adapt the network to a specific task.
Alternative names: objective function, cost function, utility function
The momentum is used for the optimization of the loss function arguments. When the loss function arguments are updated (after having calculated the gradient), a fraction of the previous update vector (of the past iteration step) is added. This has the effect of damping oscillations. We refer to the hyperparameter as momentum. When is set to , the momentum method has no influence. In simple words, when we update the loss function arguments, we still remember the step we did for the last update. Now we go a step in direction of the gradient with a length according to the learning rate and additionally we repeat the step we did last time, but this time only times as long.
In object detection, non-maximum suppression is used to suppress overlapping predicted bounding boxes. When different instances overlap more than a given threshold value, only the one with the highest confidence value is kept while the other instances, not having the maximum confidence value, are suppressed.
Overfitting happens when the network starts to 'memorize' training data instead of learning how to find general rules for the classification. This becomes visible when the model continues to minimize error on the training set but the error on the validation set increases. Since most neural networks have a huge amount of weights, these networks are particularly prone to overfitting.
Regularization is a technique to prevent neural networks from
overfitting by adding an extra term to the loss function.
It works by penalizing large weights, i.e., pushing the weights towards
zero. Simply put, regularization favors simpler models that are less
likely to fit to noise in the training data and generalize better.
In HALCON, regularization is controlled via the parameter
Alternative names: regularization parameter, weight decay parameter, (note that in HALCON we use for the learning rate and within formulas the symbol for the regularization parameter).
We define retraining as updating the weights of an already pretrained network, i.e., during retraining the network learns the specific task.
Alternative names: fine-tuning.
The solver optimizes the network by updating the weights in a way to optimize (i.e., minimize) the loss.
SGD is an iterative optimization algorithm for differentiable functions. In deep learning we use this algorithm to calculate the gradient to optimize (i.e., minimize) the loss function. A key feature of the SGD is to calculate the gradient only based on a single batch containing stochastically sampled data and not all data.
The classifier infers for a given image class confidences of how likely the image belongs to every distinguished class. Thus, for an image we can sort the predicted classes according to the confidence value the classifier assigned. The top-k error tells the ratio of predictions where the ground truth class is not within the k predicted classes with highest probability. In the case of top-1 error, we check if the target label matches the prediction with the highest probability. In the case of top-3 error, we check if the target label matches one of the top 3 predictions (the 3 labels getting the highest probability for this image).
Alternative names: top-k score
Transfer learning refers to the technique where a network is built upon the knowledge of an already existing network. In concrete terms this means taking an already (pre)trained network with its weights and adapt the output layer to the respective application to get your network. In HALCON, we also see the following retraining step as a part of transfer learning.
Underfitting occurs when the model over-generalizes. In other words it is not able to describe the complexity of the task. This is directly reflected in the error on the training set, which does not decrease significantly.
In general weights are the free parameters of the network, which are altered during the training due to the optimization of the loss. A layer with weights multiplies or adds them with its input values. In contrast to hyperparameters, weights are optimized and thus changed during the training.