Semantic Segmentation

List of Sections ↓

This chapter explains how to use semantic segmentation based on deep learning, both for the training and inference phases.

With semantic segmentation we assign each pixel of the input image to a class using a deep learning (DL) network.

image/svg+xml apple lemon orange background
A possible example for semantic segmentation: Every pixel of the input image is assigned to a class, but neither the three different instances of the class 'apple' nor the two different instances of the class 'orange' are distinguished objects.

The result of semantic segmentation is an output image, in which the pixel value signifies the assigned class of the corresponding pixel in the input image. Thus, in HALCON the output image is of the same size as the input image. For general DL networks the deeper feature maps, representing more complex features, are usually smaller than the input image (see the section “The Network and the Training Process” in Deep Learning). To obtain an output of the same size as the input, HALCON uses segmentation networks with two components: an encoder and a decoder. The encoder determines features of the input image as done, e.g., for deep-learning-based classification. As this information is 'encoded' in a compressed format, the decoder is needed to reconstruct the information to the desired outcome, which, in this case, is the assignment of each pixel to a class. Note that, as pixels are classified, overlapping instances of the same class are not distinguished as distinct.

Semantic segmentation with deep learning is implemented within the more general deep learning model of HALCON. For more information to the latter one, see the chapter Deep Learning / Model.

The following sections are introductions to the general workflow needed for semantic segmentation, information related to the involved data and parameters, and explanations to the evaluation measures.

General Workflow

In this paragraph, we describe the general workflow for a semantic segmentation task based on deep learning. Thereby we assume, your dataset is already labeled, see also the section “Data” below. Have a look at the HDevelop example series segment_pill_defects_deep_learning for an application. Note, this example is split into the four parts 'Preprocess', 'Training', 'Evaluation', and 'Inference', which give guidance on possible implementations.

Preprocess the data

This part is about how to preprocess your data. The single steps are also shown in the HDevelop example segment_pill_defects_deep_learning_1_preprocess.hdev.

  1. The information what is to be found in which image of your training dataset needs to be transferred. This is done by the procedure

    • read_dl_dataset_segmentation.

    Thereby a dictionary DLDatasetDLDatasetDLDatasetDLDatasetDLDataset is created, which serves as a database and stores all necessary information about your data. For more information about the data and the way it is transferred, see the section “Data” below and the chapter Deep Learning / Model.

  2. Split the dataset represented by the dictionary DLDatasetDLDatasetDLDatasetDLDatasetDLDataset. This can be done using the procedure

    • split_dl_dataset.

    The resulting split will be saved over the key splitsplitsplitsplitsplit in each sample entry of DLDatasetDLDatasetDLDatasetDLDatasetDLDataset.

  3. Now you can preprocess your dataset. For this, you can use the procedure

    • preprocess_dl_dataset.

    In case of custom preprocessing, this procedure offers guidance on the implementation.

    To use this procedure, specify the preprocessing parameters as e.g., the image size. For this latter one you should select the smallest possible image size at which the regions to segment are still well recognizable. Store all the parameter with their values in a dictionary DLPreprocessParamDLPreprocessParamDLPreprocessParamDLPreprocessParamDLPreprocessParam, wherefore you can use the procedure

    • create_dl_preprocess_param.

    We recommend to save this dictionary DLPreprocessParamDLPreprocessParamDLPreprocessParamDLPreprocessParamDLPreprocessParam in order to have access to the preprocessing parameter values later during the inference phase.

    During the preprocessing of your dataset also the images weight_imageweight_imageweight_imageweight_imageweightImage will be generated for the training dataset by preprocess_dl_dataset. They assign each class the weight ('class weights') its pixels get during training (see the section “Model Parameters and Hyperparameters” below).

Training of the model

This part is about how to train a DL semantic segmentation model. The single steps are also shown in the HDevelop example segment_pill_defects_deep_learning_2_train.hdev.

  1. A network has to be read using the operator

  2. The model parameters need to be set via the operator

    Such parameters are e.g., image_dimensionsimage_dimensionsimage_dimensionsimage_dimensionsimageDimensions and class_idsclass_idsclass_idsclass_idsclassIds, see the documentation of get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamGetDlModelParam.

    You can always retrieve the current parameter values using the operator

  3. Set the training parameters and store them in the dictionary 'TrainingParam'"TrainingParam""TrainingParam""TrainingParam""TrainingParam". These parameters include:

    • the hyperparameters, for an overview see the section “Model Parameters and Hyperparameters” below and the chapter Deep Learning.

    • parameters for possible data augmentation (optional).

    • parameters for the evaluation during training.

    • parameters for the visualization of training results.

    • parameters for serialization.

    This can be done using the procedure

    • create_dl_train_param.

  4. Train the model. This can be done using the procedure

    • train_dl_model.

    The procedure expects:

    • the model handle DLSegmentationHandleDLSegmentationHandleDLSegmentationHandleDLSegmentationHandleDLSegmentationHandle

    • the dictionary with the data information DLDatasetDLDatasetDLDatasetDLDatasetDLDataset

    • the dictionary with the training parameter 'TrainParam'"TrainParam""TrainParam""TrainParam""TrainParam"

    • the information, over how many epochs the training shall run.

    In case the procedure train_dl_model is used, the total loss as well as optional evaluation measures are visualized.

Evaluation of the trained model

In this part we evaluate the semantic segmentation model. The single steps are also shown in the HDevelop example segment_pill_defects_deep_learning_3_evaluate.hdev.

  1. Set the model parameters which may influence the evaluation, as e.g., 'batch_size'"batch_size""batch_size""batch_size""batch_size", using the operator

  2. The evaluation can conveniently be done using the procedure

    • evaluate_dl_model.

  3. The dictionary EvaluationResultsEvaluationResultsEvaluationResultsEvaluationResultsevaluationResults holds the asked evaluation measures. You can visualize your evaluation results using the procedure

    • dev_display_segmentation_evaluation.

Inference on new images

This part covers the application of a DL semantic segmentation model. The single steps are also shown in the HDevelop example segment_pill_defects_deep_learning_4_infer.hdev.

  1. Set the parameters as e.g., 'batch_size'"batch_size""batch_size""batch_size""batch_size" using the operator

  2. Generate a data dictionary DLSampleDLSampleDLSampleDLSampleDLSample for each image. This can be done using the procedure

    • gen_dl_samples_from_images.

  3. Preprocess the image as done for the training. We recommend to do this using the procedure

    • preprocess_dl_samples.

    When you saved the dictionary DLPreprocessParamDLPreprocessParamDLPreprocessParamDLPreprocessParamDLPreprocessParam during the preprocessing step, you can directly use it as input to specify all parameter values.

  4. Apply the model using the operator

  5. Retrieve the results from the dictionary 'DLResultBatch'"DLResultBatch""DLResultBatch""DLResultBatch""DLResultBatch". The regions of the particular classes can be selected using e.g., the operator thresholdthresholdThresholdThresholdThreshold on the segmentation image.

Data

We distinguish between data used for training and evaluation, and data for inference. The latter ones consist of bare images. The first ones consist of images with their information and ground truth annotations. You provide this information defining for each pixel, to which class it belongs (over the segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage, see below for further explanations).

As basic concept, the model handles data over dictionaries, meaning it receives the input data over a dictionary DLSampleDLSampleDLSampleDLSampleDLSample and returns a dictionary DLResultDLResultDLResultDLResultDLResult and DLTrainResultDLTrainResultDLTrainResultDLTrainResultDLTrainResult, respectively. More information on the data handling can be found in the chapter Deep Learning / Model.

Data for training and evaluation

The training data is used to train a network for your specific task. The dataset consists of images and corresponding information. They have to be provided in a way the model can process them. Concerning the image requirements, find more information in the section “Images” below. The information about the images and their ground truth annotations is provided over the dictionary DLDatasetDLDatasetDLDatasetDLDatasetDLDataset and for every sample the respective segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage, defining the class for every pixel.

Classes

The different classes are the sets or categories differentiated by the network. They are set in the dictionary DLDatasetDLDatasetDLDatasetDLDatasetDLDataset and are passed to the model via the operator set_dl_model_paramset_dl_model_paramSetDlModelParamSetDlModelParamSetDlModelParam.

In semantic segmentation, we call your attention to two special cases: the class 'background' and classes declared as 'ignore':

  • 'background' class: The networks treats the background class like any other class. It is also not necessary to have a background class. But if you have different classes in your dataset you are not interested in although they have to be learned by the network, you can set them all as 'background'. As a result, the class background will be more diverse. See the procedure preprocess_dl_samples for more information.

  • 'ignore' classes: There is the possibility to declare one or multiple classes as 'ignore'. Pixels assigned to a 'ignore' class are ignored by the loss as well as for all measures and evaluations. Please see the section “The Network and the Training Process” in the chapter Deep Learning for more information about the loss. The network does not classify any pixel into a class declared as 'ignore'. Also, the pixels labeled to belong to such a class will be classified by the network like every other pixel into a non-'ignore' class. In the example given in the image below, this means the network will classify also the pixels of the class 'border', but it will not classify any pixel into the class 'border'. You can declare a class as 'ignore' using the parameter 'ignore_class_ids'"ignore_class_ids""ignore_class_ids""ignore_class_ids""ignore_class_ids" of set_dl_model_paramset_dl_model_paramSetDlModelParamSetDlModelParamSetDlModelParam.

DLDatasetDLDatasetDLDatasetDLDatasetDLDataset

This dictionary serves as a database, this means, it stores all information about your data necessary for the network as, e.g., the names and paths to the images, the classes, ... Please see the documentation of Deep Learning / Model for the general concept and key entries. Keys only applicable for semantic segmentation concern the segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage (see the entry below). Over the keys segmentation_dirsegmentation_dirsegmentation_dirsegmentation_dirsegmentationDir and segmentation_file_namesegmentation_file_namesegmentation_file_namesegmentation_file_namesegmentationFileName you provide the information how they are named and where they are saved.

segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage

In order that the network can learn, how the member of different classes look like, you tell for each pixel of every image in the training dataset to which class it belongs. This is done by storing for every pixel of the input image the class encoded as pixel value in the corresponding segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage. These annotations are the ground truth annotations.

image/svg+xml image/svg+xml
(1) (2)
Schema of segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage. For visibility, gray values are used to represent numbers. (1) Input image. (2) The corresponding segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage providing the class annotations, 0: background (white), 1: orange, 2: lemon, 3: apple, and 4: border (black) as a separate class so we can declare it as 'ignore'.

You need enough training data to split it into three subsets, one used for training, one for validation and one for testing the network. These subsets are preferably independent and identically distributed (see the section “Data” in the chapter Deep Learning. For the splitting you can use the procedure split_dl_data_set.

Images

Regardless of the application, the network poses requirements on the images regarding the image dimensions, the gray value range, and the type. The specific values depend on the network itself, see the documentation of read_dl_modelread_dl_modelReadDlModelReadDlModelReadDlModel for the specific values of different networks. For a loaded network they can be queried with get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamGetDlModelParam. In order to fulfill these requirements, you may have to preprocess your images. Standard preprocessing of an entire sample and therewith also the image is implemented in preprocess_dl_samples. In case of custom preprocessing this procedure offers guidance on the implementation.

Network output

As training output, the operator will return a dictionary DLTrainResultDLTrainResultDLTrainResultDLTrainResultDLTrainResult with the current value of the total loss as well as values for all other losses included in your model.

As inference and evaluation output, the network will return a dictionary DLResultDLResultDLResultDLResultDLResult for every sample. For semantic segmentation, this dictionary will include for each input image the handles of the two following images:

image/svg+xml image/svg+xml
(1) (2)
A schema over different data images. For visibility, gray values are used to represent numbers. (1) segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage: also the pixels of classes declared as 'ignore' (see the figure above) get classified. (2) segmentation_confidencesegmentation_confidencesegmentation_confidencesegmentation_confidencesegmentationConfidence.

Model Parameters and Hyperparameters

Next to the general DL hyperparameters explained in Deep Learning, there is a further hyperparameter relevant for semantic segmentation:

For a semantic segmentation model, the model parameters as well as the hyperparameters (with the exception of 'class weights') are set using set_dl_model_paramset_dl_model_paramSetDlModelParamSetDlModelParamSetDlModelParam. The model parameters are explained in more detail in get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamGetDlModelParam.

Note, due to large memory usage, typically only small batch sizes are possible for training. As a consequence, training is rather slow and we advice to use a momentum higher than e.g., for classification. The HDevelop example segment_pill_defects_deep_learning_2_train.hdev provides good initial parameter values for the training of a segmentation network in HALCON.

'class weights'

With the hyperparameter 'class weights' you can assign each class the weight its pixels get during training. Giving the unique classes a different weight, it is possible to force the network to learn the classes with different importance. This is useful in cases where a class dominates the images, as e.g., defect detection, where the defects take up only a small fraction within an image. In such a case a network classifying every pixel as background (thus, 'not defect') would achieve generally good loss results. Assigning different weights to the distinct classes helps to re-balance the distribution. In short, you can focus the loss to train especially on those pixels you determine to be important.

The network obtains these weights over weight_imageweight_imageweight_imageweight_imageweightImage, an image which is created for every training sample. In weight_imageweight_imageweight_imageweight_imageweightImage, every pixel value corresponds to the weight the corresponding pixel of the input image gets during training. You can create these images with the help of the following two procedures:

This step has to be done before the training. Usually it is done during the preprocessing and it is part of the procedure preprocess_dl_dataset. Note, this hyperparameter is referred as class_weightsclass_weightsclass_weightsclass_weightsclassWeights or ClassWeightsClassWeightsClassWeightsClassWeightsclassWeights within procedures. An illustration, how such an image with different weights looks like, is shown in the figure below.

Note, giving a specific part of the image the weight 0.0, these pixels do not contribute to the loss (see the section “The network and its training” in Deep Learning for more information about the loss).

image/svg+xml image/svg+xml
(1) (2)
Schema for weight_imageweight_imageweight_imageweight_imageweightImage. For visibility, gray values are used to represent numbers. (1) The segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage defining the classes for every pixel within the image, 0: background (white), 1: orange, 2: lemon, 3: apple, and 4: border (black), declared as 'ignore'. (2) The corresponding weight_imageweight_imageweight_imageweight_imageweightImage providing class weights, background: 1.0, orange: 30.0, lemon: 75.0. Pixels of classes declared as 'ignore', here the class border, will be ignored and get the weight 0.0.

Evaluation measures for the Data from Semantic Segmentation

For semantic segmentation, the following evaluation measures are supported in HALCON. Note that for computing such a measure for an image, the related ground truth information is needed. All the measure values explained below for a single image (e.g., mean_ioumean_ioumean_ioumean_ioumeanIou) can also be calculated for an arbitrary number of images. For this, imagine a single, large image formed by the ensemble of the output images, for which the measure is computed. Note, all pixels of a class declared as 'ignore' are ignored for the computation of the measures.

pixel_accuracypixel_accuracypixel_accuracypixel_accuracypixelAccuracy
The pixel accuracy is simply the ratio of all pixels that have been predicted with the correct class-label to the total number of pixels.
image/svg+xml image/svg+xml image/svg+xml pixel accuracy =
(1) (2) (3)
Visual example of the pixel_accuracypixel_accuracypixel_accuracypixel_accuracypixelAccuracy: (1) The segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage defining the ground truth class for each pixel (see the section “Data” above). Pixels of a class declared as 'ignore' are drawn in black. (2) The output image, also the pixels of classes declared as 'ignore' get classified. (3) The pixel accuracy is the ratio between the total orange areas. Note, pixels labeled as part of a class declared as 'ignore' are ignored.
class_pixel_accuracyclass_pixel_accuracyclass_pixel_accuracyclass_pixel_accuracyclassPixelAccuracy

The per-class pixel accuracy considers only pixels of a single class. It is defined as the ratio between the correctly predicted pixels and the total number of pixels labeled with this class.

In case a class does not occur it gets a class_pixel_accuracyclass_pixel_accuracyclass_pixel_accuracyclass_pixel_accuracyclassPixelAccuracy value of -1 and does not contribute to the average value, mean_accuracymean_accuracymean_accuracymean_accuracymeanAccuracy.

mean_accuracymean_accuracymean_accuracymean_accuracymeanAccuracy

The mean accuracy is defined as the averaged per-class pixel accuracy, class_pixel_accuracyclass_pixel_accuracyclass_pixel_accuracyclass_pixel_accuracyclassPixelAccuracy, of all occuring classes.

class_iouclass_iouclass_iouclass_iouclassIou

The per-class intersection over union (IoU) gives for a specific class the ratio of correctly predicted pixels to the union of annotated and predicted pixels. Visually this is the ratio between the intersection and the union of the areas, see the image below.

In case a class does not occur it gets a class_iouclass_iouclass_iouclass_iouclassIou value of -1 and does not contribute to the mean_ioumean_ioumean_ioumean_ioumeanIou.

image/svg+xml image/svg+xml image/svg+xml IoU=
(1) (2) (3)
Visual example of the per-class IoU, class_iouclass_iouclass_iouclass_iouclassIou, here for the class apple only. (1) The segmentation_imagesegmentation_imagesegmentation_imagesegmentation_imagesegmentationImage defining the ground truth class for each pixel (see the section “Data”). Pixels of a class declared as 'ignore' are drawn in black. (2) The output image, also the pixels of classes declared as 'ignore' get classified. (3) The intersection over union is the ratio between the intersection and the union of the areas of pixels denoted as apple. Note, pixels labeled as part of a class declared as 'ignore' are ignored.
mean_ioumean_ioumean_ioumean_ioumeanIou

The mean IoU is defined as the averaged per-class intersection over union, class_iouclass_iouclass_iouclass_iouclassIou, of all occuring classes. Note that every occuring class has the same impact on this measure, independent of the number of pixels they contain.

frequency_weighted_ioufrequency_weighted_ioufrequency_weighted_ioufrequency_weighted_ioufrequencyWeightedIou

As for the mean IoU, the per-class IoU is calculated first. But the contribution of each occuring class to this measure is weighted by the ratio of pixels that belong to that class. Note that classes with many pixels can dominate this measure.

pixel_confusion_matrixpixel_confusion_matrixpixel_confusion_matrixpixel_confusion_matrixpixelConfusionMatrix

The concept of a confusion matrix is explained in the section “Supervising the training” within the chapter Deep Learning. It applies for semantic segmentation, where the instances are single pixels.


List of Sections