This chapter explains how to use deep-learning-based optical character recognition (Deep OCR).
With Deep OCR we want to detect and/or recognize text in an image. Deep OCR detects and recognizes connected characters, which will be referred to as 'words' (in contrast to OCR methods which are used to read single characters).
A Deep OCR model can contain two components, which are dedicated to two distinct tasks, the detection, thus the localization of words, and the recognition of words. By default, a model is created with both components, but the model can also be limited to either task.
HALCON already provides pretrained components, which are suited for a
multitude of applications without additional training as
the model is trained on a varied dataset and can
therefore cope with many different fonts. Information on the
available character set and model parameters can be
retrieved using
.
To further adjust the reading to a specific task, it is possible to retrain
the recognition or detection component separately on a given application
domain using deep learning operators.
Note that only one component can be retrained at a time.
get_deep_ocr_param
The general workflow as well as the retraining are described in the following paragraphs.
This paragraph describes the workflow how to localize and read words using
a Deep OCR model. An application scenario can be seen in the HDevelop
example deep_ocr_workflow.hdev
.
Create a Deep OCR model containing either one or both of the two model components
detection_model
and
recognition_model
using the operator
.
create_deep_ocr
To use a retrained model component instead of the provided one,
adjust the created model by setting the retrained model component as
'recognition_model'
or 'detection_model'
using
.
set_deep_ocr_param
Model parameters regarding, e.g., the used devices, image dimensions,
or minimum scores can be set using
.
set_deep_ocr_param
The Deep OCR model is applied on your acquired images using
. The inference results depend on the used
model components. See the operator reference of apply_deep_ocr
for details regarding which dictionary entries are computed for
each model composite.
apply_deep_ocr
The inference results can be retrieved from the dictionary
.
Some procedures are provided in order to visualize results and score maps:
DeepOCRResult
Show location and/or recognized word using
dev_display_deep_ocr_results
.
Show location (and, if inferred, recognized word) on preprocessed
image using dev_display_deep_ocr_results_preprocessed
(if the model contains detection_model
).
Show score maps using dev_display_deep_ocr_score_maps
(if the model contains detection_model
).
This paragraph describes the retraining and evaluation of the recognition or
detection components of a Deep OCR model using custom data. See also the
HDevelop examples deep_ocr_recognition_training_workflow.hdev
or
deep_ocr_detection_training_workflow.hdev
for an application scenario.
This part is about how to preprocess your data. See the section “Data” below for information on what data is to be provided at what stage of the Deep OCR workflow.
The information that is to be obtained from the images of your training dataset needs to be transferred. This is done by the procedure
read_dl_dataset_ocr_recognition
for the recognition
component of a Deep OCR model.
read_dl_dataset_ocr_detection
for the detection
component of a Deep OCR model.
It creates a dictionary DLDataset
which serves as
a database and stores all necessary information about your data.
For more information about datasets, see the chapter
Deep Learning / Model.
Split the dataset represented by the dictionary
DLDataset
. This can be done using the procedure
split_dl_dataset
.
The network imposes several requirements on the images. These requirements (for example the image size and gray value range) can be retrieved with
For this you need to read the model first by using
Now you can preprocess your dataset. For this, you can use the procedure
preprocess_dl_dataset
.
To use this procedure, specify the preprocessing parameters as, e.g.,
the image size.
Store all the parameter with their values in a dictionary
DLPreprocessParam
, for which you can use the procedure
create_dl_preprocess_param_from_model
.
We recommend to save this dictionary DLPreprocessParam
in
order to have access to the preprocessing parameter values later
during the inference phase.
This part explains how to train the recognition or detection component of a Deep OCR model.
Set the training parameters and store them in the dictionary
TrainParam
.
This can be done using the procedure
create_dl_train_param
.
Train the model. This can be done using the procedure
train_dl_model
.
The procedure expects:
the model handle DLModelHandle
the dictionary DLDataset
containing the data
information
the dictionary TrainParam
containing the training
parameters
In this part, we evaluate the Deep OCR model.
Set the model parameters which may influence the evaluation.
The evaluation can be done conveniently using the procedure
evaluate_dl_model
.
This procedure expects a dictionary GenParamEval
with the
evaluation parameters.
The dictionary EvaluationResult
holds the evaluation
measures. To get a clue on how the retrained model performed
against the pretrained model you can compare their evaluation values.
To understand the different evaluation measures, see section
“Evaluation Measures for Deep OCR Results”.
After a successful evaluation the retrained model can be used for inference (see section “General Workflow for Deep OCR Inference” above).
This section gives information on the data that needs to be provided in different stages of the Deep OCR workflow.
We distinguish between data used for training and evaluation, consisting of images with their information about the instances, and data for inference, which are bare images. How the data needs to be provided is explained in the according sections below.
As a basic concept, the model handles data over dictionaries, meaning it
receives the input data over a dictionary
and
returns a dictionary DLSample
and DLResult
,
respectively. More information on the
data handling can be found in the chapter Deep Learning / Model.
DLTrainResult
The dataset consists of images and corresponding information. They have to be provided in a way the model can process them. Concerning the image requirements, find more information in the section “Images” below.
The training data is used to train and evaluate a network for your specific application. With the aid of this data the network can learn to detect or recognize text samples that resemble text that occurs during inference. The necessary information is given by providing the depicted word for each image.
How the data has to be formatted in HALCON for a DL model is explained
in the chapter Deep Learning / Model.
In short, a dictionary
serves as a database for
the information needed by the training and evaluation procedures.
DLDataset
The data for
can be read using
DLDataset
read_dl_dataset_ocr_recognition
or
read_dl_dataset_ocr_detection
depending on which model type is used.
In this case, images with words that are labeled with rotated bounding boxes need to be provided. You can label your data using the MVTec Deep Learning Tool, available from the MVTec website. The dataset must be built as follows:
'class_ids'
: class IDs
'class_names'
: class names
(Needs to contain the class 'word'. All other classes are ignored.)
'image_dir'
: path to the image directory
'samples'
: tuple of dictionaries, one for each sample
'image_file_name'
: name of the image file
'image_id'
: image ID
'bbox_col'
: bounding box column coordinate
'bbox_row'
: bounding box row coordinate
'bbox_phi'
: bounding box angle
'bbox_length1'
: first half edge length of the
bounding box
'bbox_length2'
: second half edge length of the
bounding box
'label_custom_data'
: list of dictionaries containing
custom label data for each bounding box
'text'
word to be read
In this case, only images that are cropped to a single word each are included in the dataset. The dataset must be built as follows:
'image_dir'
: path to the image directory
'samples'
: tuple of dictionaries, one for each sample
'image_file_name'
: name of the image file
'image_id'
: image ID
'word'
: word to be read in the image
The example program deep_ocr_prelabel_dataset.hdev
can provide
assistance by prelabeling your data.
Your training data should cover the full range of characters that
might occur during inference. If a character is not or only very rarely
contained in the training dataset the model might not properly learn to
recognize that character. To keep track of the character distribution
within the dataset the procedure
gen_dl_dataset_ocr_recognition_statistics
is provided, which
generates statistics on how often every single character is contained in
your dataset.
You also want enough training data to split it into three subsets, used for training, validation and testing the network. These subsets are preferably independent and identically distributed, see the section “Data” in the chapter Deep Learning.
The model poses requirements on the images, such as the dimensions,
the gray value range, and the type.
See the documentation of
for the specific values
of the trainable Deep OCR model.
For a read model they can be queried with read_dl_model
.
In order to fulfill these requirements, you may have to preprocess your
images.
Standard preprocessing of an entire sample, including the
image, is implemented in get_dl_model_param
preprocess_dl_samples
.
Requirements for images used for inference are described in
.
apply_deep_ocr
The network output depends on the task:
As output, the operator will return a dictionary
with the current value of the total loss as well as values for all
other losses included in your model.
DLTrainResult
As output, the network will return a dictionary
for every sample.
This dictionary will include the recognized word as well as the
candidates and their confidences for every character of the word.
DLResult
The following evaluation measures are supported in HALCON. To compute these metrics for testing or validation, ground truth annotation is needed.
Precision, Recall and F-score
The performance of Deep OCR Detection is evaluated using precision and recall on word boxes. The evaluation uses the intersection over union (IoU) in order to compare ground truth and predicted word boxes. The default IoU threshold for a match is 0.5, it can be increased or decreased if needed.
( 1) | ( 2) |
The precision is the proportion of true positives to all positives (true and false ones). Thus, it is a measure of how thrustworthy the detecor is.
The recall is the proportion of the number of correctly detected words to all labeled words.
To represent this with a single number, we compute the F-score, the harmonic mean of precision and recall.
Score of Angle Precision (SoAP)
The SoAP value is a score for the precision of the inferred orientation angles. This score is determined by the angle differences between the inferred bounding boxes (I) and the corresponding ground truth annotations (GT): where the index runs over all inferred bounding boxes.
The accuracy for a Deep OCR Recognition task is given as the percentage of correctly read words (CR) to the ground truth words (GT) of a dataset. The accuracy is then defined as:
apply_deep_ocr
create_deep_ocr
get_deep_ocr_param
read_deep_ocr
set_deep_ocr_param
write_deep_ocr