Deep OCR

List of Operators ↓

This chapter explains how to use deep-learning-based optical character recognition (Deep OCR).

With Deep OCR we want to detect and/or recognize text in an image. Deep OCR detects and recognizes connected characters, which will be referred to as 'words' (in contrast to OCR methods which are used to read single characters).

image/svg+xml
A possible example for deep-learning-based optical character recognition: Words in an image are detected and recognized.

A Deep OCR model can contain two components, which are dedicated to two distinct tasks, the detection, thus the localization of words, and the recognition of words. By default, a model is created with both components, but the model can also be limited to either task.

HALCON already provides pretrained components, which are suited for a multitude of applications without additional training as the model is trained on a varied dataset and can therefore cope with many different fonts. Information on the available character set and model parameters can be retrieved using get_deep_ocr_paramget_deep_ocr_paramGetDeepOcrParamGetDeepOcrParamGetDeepOcrParamget_deep_ocr_param. To further adjust the reading to a specific task, it is possible to retrain the recognition or detection component separately on a given application domain using deep learning operators. Note that only one component can be retrained at a time.

image/svg+xml
The detection can be fine-tuned for an application by retraining the Deep OCR model with custom data.
image/svg+xml 'lemon'
The recognition can be fine-tuned for an application by retraining the Deep OCR model with custom data.

The general workflow as well as the retraining are described in the following paragraphs.

General Workflow for Deep OCR Inference

This paragraph describes the workflow how to localize and read words using a Deep OCR model. An application scenario can be seen in the HDevelop example deep_ocr_workflow.hdev.

Creation of the Deep OCR model

Create a Deep OCR model containing either one or both of the two model components

  • detection_model and

  • recognition_model

using the operator create_deep_ocrcreate_deep_ocrCreateDeepOcrCreateDeepOcrCreateDeepOcrcreate_deep_ocr.

To use a retrained model component instead of the provided one, adjust the created model by setting the retrained model component as 'recognition_model'"recognition_model""recognition_model""recognition_model""recognition_model""recognition_model" or 'detection_model'"detection_model""detection_model""detection_model""detection_model""detection_model" using set_deep_ocr_paramset_deep_ocr_paramSetDeepOcrParamSetDeepOcrParamSetDeepOcrParamset_deep_ocr_param.

Inference

Model parameters regarding, e.g., the used devices, image dimensions, or minimum scores can be set using set_deep_ocr_paramset_deep_ocr_paramSetDeepOcrParamSetDeepOcrParamSetDeepOcrParamset_deep_ocr_param.

The Deep OCR model is applied on your acquired images using apply_deep_ocrapply_deep_ocrApplyDeepOcrApplyDeepOcrApplyDeepOcrapply_deep_ocr. The inference results depend on the used model components. See the operator reference of apply_deep_ocrapply_deep_ocrApplyDeepOcrApplyDeepOcrApplyDeepOcrapply_deep_ocr for details regarding which dictionary entries are computed for each model composite.

The inference results can be retrieved from the dictionary DeepOCRResultDeepOCRResultDeepOCRResultDeepOCRResultdeepOCRResultdeep_ocrresult. Some procedures are provided in order to visualize results and score maps:

  • Show location and/or recognized word using dev_display_deep_ocr_results.

  • Show location (and, if inferred, recognized word) on preprocessed image using dev_display_deep_ocr_results_preprocessed (if the model contains detection_model).

  • Show score maps using dev_display_deep_ocr_score_maps (if the model contains detection_model).

Training and Evaluation of the Model Components

This paragraph describes the retraining and evaluation of the recognition or detection components of a Deep OCR model using custom data. See also the HDevelop examples deep_ocr_recognition_training_workflow.hdev or deep_ocr_detection_training_workflow.hdev for an application scenario.

Preprocess the data

This part is about how to preprocess your data. See the section “Data” below for information on what data is to be provided at what stage of the Deep OCR workflow.

  1. The information that is to be obtained from the images of your training dataset needs to be transferred. This is done by the procedure

    • read_dl_dataset_ocr_recognition for the recognition component of a Deep OCR model.

    • read_dl_dataset_ocr_detection for the detection component of a Deep OCR model.

    It creates a dictionary DLDataset which serves as a database and stores all necessary information about your data. For more information about datasets, see the chapter Deep Learning / Model.

  2. Split the dataset represented by the dictionary DLDataset. This can be done using the procedure

    • split_dl_dataset.

  3. The network imposes several requirements on the images. These requirements (for example the image size and gray value range) can be retrieved with

    For this you need to read the model first by using

  4. Now you can preprocess your dataset. For this, you can use the procedure

    • preprocess_dl_dataset.

    To use this procedure, specify the preprocessing parameters as, e.g., the image size. Store all the parameter with their values in a dictionary DLPreprocessParam, for which you can use the procedure

    • create_dl_preprocess_param_from_model.

    We recommend to save this dictionary DLPreprocessParam in order to have access to the preprocessing parameter values later during the inference phase.

Training of the model

This part explains how to train the recognition or detection component of a Deep OCR model.

  1. Set the training parameters and store them in the dictionary TrainParam. This can be done using the procedure

    • create_dl_train_param.

  2. Train the model. This can be done using the procedure

    • train_dl_model.

    The procedure expects:

    • the model handle DLModelHandleDLModelHandleDLModelHandleDLModelHandleDLModelHandledlmodel_handle

    • the dictionary DLDataset containing the data information

    • the dictionary TrainParam containing the training parameters

Evaluation of the retrained model

In this part, we evaluate the Deep OCR model.

  1. Set the model parameters which may influence the evaluation.

  2. The evaluation can be done conveniently using the procedure

    • evaluate_dl_model.

    This procedure expects a dictionary GenParamEval with the evaluation parameters.

  3. The dictionary EvaluationResult holds the evaluation measures. To get a clue on how the retrained model performed against the pretrained model you can compare their evaluation values. To understand the different evaluation measures, see section “Evaluation Measures for Deep OCR Results”.

After a successful evaluation the retrained model can be used for inference (see section “General Workflow for Deep OCR Inference” above).

Data

This section gives information on the data that needs to be provided in different stages of the Deep OCR workflow.

We distinguish between data used for training and evaluation, consisting of images with their information about the instances, and data for inference, which are bare images. How the data needs to be provided is explained in the according sections below.

As a basic concept, the model handles data over dictionaries, meaning it receives the input data over a dictionary DLSampleDLSampleDLSampleDLSampleDLSampledlsample and returns a dictionary DLResultDLResultDLResultDLResultDLResultdlresult and DLTrainResultDLTrainResultDLTrainResultDLTrainResultDLTrainResultdltrain_result, respectively. More information on the data handling can be found in the chapter Deep Learning / Model.

Data for training and evaluation

The dataset consists of images and corresponding information. They have to be provided in a way the model can process them. Concerning the image requirements, find more information in the section “Images” below.

The training data is used to train and evaluate a network for your specific application. With the aid of this data the network can learn to detect or recognize text samples that resemble text that occurs during inference. The necessary information is given by providing the depicted word for each image.

How the data has to be formatted in HALCON for a DL model is explained in the chapter Deep Learning / Model. In short, a dictionary DLDatasetDLDatasetDLDatasetDLDatasetDLDatasetdldataset serves as a database for the information needed by the training and evaluation procedures.

The data for DLDatasetDLDatasetDLDatasetDLDatasetDLDatasetdldataset can be read using read_dl_dataset_ocr_recognition or read_dl_dataset_ocr_detection depending on which model type is used.

Dataset based on images with word labels

In this case, images with words that are labeled with rotated bounding boxes need to be provided. You can label your data using the MVTec Deep Learning Tool, available from the MVTec website. The dataset must be built as follows:

  • 'class_ids': class IDs

  • 'class_names': class names (Needs to contain the class 'word'. All other classes are ignored.)

  • 'image_dir': path to the image directory

  • 'samples': tuple of dictionaries, one for each sample

    • 'image_file_name': name of the image file

    • 'image_id': image ID

    • 'bbox_col': bounding box column coordinate

    • 'bbox_row': bounding box row coordinate

    • 'bbox_phi': bounding box angle

    • 'bbox_length1': first half edge length of the bounding box

    • 'bbox_length2': second half edge length of the bounding box

    • 'label_custom_data': list of dictionaries containing custom label data for each bounding box

      • 'text' word to be read

Dataset based on word crop images (only recognition)

In this case, only images that are cropped to a single word each are included in the dataset. The dataset must be built as follows:

  • 'image_dir': path to the image directory

  • 'samples': tuple of dictionaries, one for each sample

    • 'image_file_name': name of the image file

    • 'image_id': image ID

    • 'word': word to be read in the image

The example program deep_ocr_prelabel_dataset.hdev can provide assistance by prelabeling your data.

Your training data should cover the full range of characters that might occur during inference. If a character is not or only very rarely contained in the training dataset the model might not properly learn to recognize that character. To keep track of the character distribution within the dataset the procedure gen_dl_dataset_ocr_recognition_statistics is provided, which generates statistics on how often every single character is contained in your dataset.

You also want enough training data to split it into three subsets, used for training, validation and testing the network. These subsets are preferably independent and identically distributed, see the section “Data” in the chapter Deep Learning.

Images

The model poses requirements on the images, such as the dimensions, the gray value range, and the type. See the documentation of read_dl_modelread_dl_modelReadDlModelReadDlModelReadDlModelread_dl_model for the specific values of the trainable Deep OCR model. For a read model they can be queried with get_dl_model_paramget_dl_model_paramGetDlModelParamGetDlModelParamGetDlModelParamget_dl_model_param. In order to fulfill these requirements, you may have to preprocess your images. Standard preprocessing of an entire sample, including the image, is implemented in preprocess_dl_samples.

Requirements for images used for inference are described in apply_deep_ocrapply_deep_ocrApplyDeepOcrApplyDeepOcrApplyDeepOcrapply_deep_ocr.

Model output

The network output depends on the task:

training

As output, the operator will return a dictionary DLTrainResultDLTrainResultDLTrainResultDLTrainResultDLTrainResultdltrain_result with the current value of the total loss as well as values for all other losses included in your model.

inference and evaluation

As output, the network will return a dictionary DLResultDLResultDLResultDLResultDLResultdlresult for every sample. This dictionary will include the recognized word as well as the candidates and their confidences for every character of the word.

Evaluation Measures for Deep OCR Results

Deep OCR Detection

The following evaluation measures are supported in HALCON. To compute these metrics for testing or validation, ground truth annotation is needed.

  • Precision, Recall and F-score

    The performance of Deep OCR Detection is evaluated using precision and recall on word boxes. The evaluation uses the intersection over union (IoU) in order to compare ground truth and predicted word boxes. The default IoU threshold for a match is 0.5, it can be increased or decreased if needed.

    image/svg+xml image/svg+xml IoU=
    ( 1) ( 2)
    Visual example of the IoU. (1) The input image with the ground truth bounding box (orange) and the predicted bounding box (light blue). (2) The IoU is the ratio between the area intersection and the union.

    The precision is the proportion of true positives to all positives (true and false ones). Thus, it is a measure of how thrustworthy the detecor is.

    The recall is the proportion of the number of correctly detected words to all labeled words.

    To represent this with a single number, we compute the F-score, the harmonic mean of precision and recall.

  • Score of Angle Precision (SoAP)

    The SoAP value is a score for the precision of the inferred orientation angles. This score is determined by the angle differences between the inferred bounding boxes (I) and the corresponding ground truth annotations (GT): where the index runs over all inferred bounding boxes.

Deep OCR Recognition

The accuracy for a Deep OCR Recognition task is given as the percentage of correctly read words (CR) to the ground truth words (GT) of a dataset. The accuracy is then defined as:


List of Operators

apply_deep_ocrApplyDeepOcrapply_deep_ocrApplyDeepOcrapply_deep_ocr
Apply a Deep OCR model on a set of images for inference.
create_deep_ocrCreateDeepOcrcreate_deep_ocrCreateDeepOcrcreate_deep_ocr
Create a Deep OCR model.
get_deep_ocr_paramGetDeepOcrParamget_deep_ocr_paramGetDeepOcrParamget_deep_ocr_param
Return the parameters of a Deep OCR model.
read_deep_ocrReadDeepOcrread_deep_ocrReadDeepOcrread_deep_ocr
Read a Deep OCR model from a file.
set_deep_ocr_paramSetDeepOcrParamset_deep_ocr_paramSetDeepOcrParamset_deep_ocr_param
Set the parameters of a Deep OCR model.
write_deep_ocrWriteDeepOcrwrite_deep_ocrWriteDeepOcrwrite_deep_ocr
Write a Deep OCR model in a file.