List of Operators ↓
The term deep learning (DL) refers to a family of machine learning methods. In HALCON, the following methods are implemented:
All of the deep learning methods listed above use a network for the assignment task. In HALCON they are implemented within the general DL model, see Deep Learning / Model. The model is trained by only considering the input and output, which is also called end-to-end learning. Basically, using images and the information, what is visible in them, the training algorithm adjusts the model in a way to distinguish the different classes and eventually also how to find the corresponding objects. For you, it has the nice outcome of no need for manual feature specification. Instead you have to select and collect appropriate data.
For Deep learning additional prerequisites apply.
Please see the requirements listed in the HALCON
paragraph “Requirements for Deep Learning and Deep-Learning-Based Methods”.
To speed up the training process, we recommend in HALCON to use a sufficiently fast hard drive. Thus, a solid-state drive (SSD) is preferable to conventional hard disk drives (HDD).
As the DL methods mentioned above differ in what they do and how they need the data, you need to know which method is most appropriate for your specific task. Once this is clear, you need to collect a suitable amount of data, meaning images and the information needed by the method. After that, there is a common general workflow for all these DL methods:
The network needs to be prepared for your task and your data adapted to the specific network.
Get a network: Read in a pretrained network or create one.
The network needs to know which problem it shall solve, i.e., which classes are to be distinguished and what such samples look like. This is represented by your dataset, i.e., your images with the corresponding ground truth information.
The network will impose several requirements on the images (as e.g., the image dimension, gray value range, ... ). Therefore the images have to be preprocessed so that the network can process them.
We recommend to split the dataset into three distinct datasets which are used for training, validation, and testing.
Once your network is set up and your data prepared it is time to train the network for your specific task.
Set the hyperparameters appropriate to your task and system.
Optionally specify your data augmentation.
Start the training and evaluate your network.
Your network is trained for your task and ready to be applied. But before deploying it in the real world you should evaluate how well the network performs on basis of your test dataset.
When your network is trained and you are satisfied with its performance, you can use it for inference on new images. Thereby the images need to be preprocessed according to the requirements of the network (thus, in the same way as for training).
The term 'data' is used in the context of deep learning as the images and the information, what is in them. This last information has to be provided in a way the network can understand. Not surprisingly, the different DL methods have their own requirements concerning what information has to be provided and how. Please see the corresponding chapters for the specific requirements.
The network further poses requirements on the images regarding the image
dimensions, the gray value range, and the type.
The specific values depend on the network itself and can be queried with
Additionally, depending on the method there are also requirements
regarding the information as e.g., the bounding boxes.
To fulfill all these requirements, the data may have to be
preprocessed, which can be done most conveniently with the corresponding
When you train your network, the network gets adapted to its task.
But at one point you will want to evaluate what the network learned and
at an even later point you will want to test the network.
Therefore the dataset will be split into three subsets
which should be independent and identically distributed.
In simple words, the subsets should not be connected to each
other in any way and each set contains for every class the same
distribution of images.
This splitting is conveniently done by the procedure
The clearly largest subset will be used for the retraining. We refer
to this dataset as the training dataset.
At a certain point the performance of the network is evaluated to check
whether it is beneficial to continue the network optimization. For this
validation the second set of data is used, the validation dataset.
Even if the validation dataset is disjoint from the first one, it has an
influence on the network optimization. Therefore to test the possible
predictions when the model is deployed in the real world, the third
dataset is used, the test dataset.
For a representative network validation or evaluation, the validation and
test dataset should have statistically relevant data, which gives a lower
bound on the amount of data needed.
Note also, that for training the network, you best use representative images, i.e., images like the ones you want to process later and not only 'perfect' images, as otherwise the network may have difficulties with non-'perfect' images.
In the context of deep learning, the assignments are performed by sending the input image through a network. The output of the total network consists of a number of predictions. Such predictions are e.g., for a classification task the confidence for each class, expressing how likely the image shows an instance of this class.
The specific network will vary, especially from one method to another.
Some methods like e.g., object detection, use a subnetwork to generate
feature maps (see the explanations given below and in
Deep Learning / Object Detection).
Here, we will explain a basic Convolutional Neural Network (CNN).
Such a network consists of a certain number of layers or filters,
which are arranged and connected in a specific way.
In general, any layer is a building block performing specific tasks.
It can be seen as a container, which receives input, transforms it
according to a function, and returns the output to the next layer.
Thereby different functions are possible for different types of layers.
Several possible examples are given in the
“Solution Guide on Classification”.
Many layers or filters have weights, parameters which are also
called filter weights. These are the parameters modified during the
training of a network.
The output of most layers are feature maps.
Thereby the number of feature maps (the depth of the layer output) and
their size (width and height) depends on the specific layer.
To train a network for a specific task, a loss function is added. There are different loss functions depending on the task, but they all work according to the following principle. A loss function compares the prediction from the network with the given information, what it should find in the image (and, if applicable, also where), and penalizes deviations. Now the filter weights are updated in such a way that the loss function is minimized. Thus, training the network for the specific tasks, one strives to minimize the loss (an error function) of the network, in the hope of doing so will also improve the performance measure. In practice, this optimization is done by calculating the gradient and updating the parameters of the different layers (filter weights) accordingly. This is repeated by iterating multiple times over the training data.
There are additional parameters that influence the training, but which are not directly learned during the regular training. These parameters have values set before starting the training. We refer to this last type of parameters as hyperparameters in order to distinguish them from the network parameters that are optimized during training. See the section “Setting the Training Parameters: The Hyperparameters”.To train all filter weights from scratch a lot of resources are needed. Therefore one can take advantage from the following observation. The first layers detect low level features like edges and curves. The feature map of the following layers are smaller, but they represent more complex features. For a large network, the low level features are general enough so the weights of the corresponding layers will not change much among different tasks. This leads to a technique called transfer learning: One takes an already trained network and retrains it for a specific task, benefiting from already quite suitable filter weights for the lower layers. As a result, considerably less resources are needed. While in general the network should be more reliable when trained on a larger dataset, the amount of data needed for retraining also depends on the complexity of the task. A basic schema for the workflow of transfer learning is shown with the aid of classification in the figure below.
The different DL methods are designed for different tasks and will vary in the way they are built up. They all have in common that during the training of the model one faces a minimization problem. Training the network or subnetwork, one strives to minimize an appropriate loss function, see the section “The Network and the Training Process”. For doing so, there is a set of further parameters which is set before starting the training and not optimized during the training. We refer to these parameters as hyperparameters. For a DL model, you can set a change strategy, specifying when and how you want these hyperparameters changed during the training. In this section, we explain the idea of the different hyperparameters. Note, that certain methods have additional hyperparameters, you find more information in their respective chapter.
As already mentioned, the loss compares the predictions from the network
with the given information about the content of the image.
The loss now penalizes deviations.
Training the network means updating the filter weights in such a way, that
the loss has to penalize less, thus the loss result is optimized.
To do so, a certain amount of data is taken from the training dataset.
For this subset the gradient of the loss is calculated and the network
modified in updating its filter weights accordingly.
Now this is repeated with the next subset of data till the whole
training data is processed.
These subsets of the training data are called batches and the
size of these subsets, the
'batch_size', determines the number of
data taken into a batch and as a consequence processed together.
A full iteration over the entire training data is called epoch.
It is beneficial to iterate several times over the training data.
The number of iterations is defined by
'epochs' determines how many times the algorithm loops over
the training set.
'learning_rate', determining the weight of the gradient on the updated loss function arguments (the filter weights), and the
'momentum'within the interval , specifying the influence of previous updates. More information can be found in the documentation of
. In simple words, when we update the loss function arguments, we still remember the step we took for the last update. Now, we take a step in direction of the gradient with a length depending to the learning rate; additionally we repeat the step we did last times, but this time only times as long. A visualization is given in the figure below. A too large learning rate might result in divergence of the algorithm, a very small learning rate will take unnecessarily many steps. Therefore, it is customary to start with a larger learning rate and potentially reduce it during training. With a momentum , the momentum method has no influence, so only the gradient determines the update vector.