do_ocr_word_knn — Classify a related group of characters with an OCR classifier.
do_ocr_word_knn works like do_ocr_multi_class_knn insofar as it computes the best class for each of the characters given by the regions Character and the gray values Image with the OCR classifier OCRHandle, and returns the classes in Class and the corresponding confidences of the classes in Confidence. The confidences lie between 0.0 and 1.0. The larger the value, the more reliable is the classification of the single characters.
In contrast to do_ocr_multi_class_knn, do_ocr_word_knn treats the group of characters as an entity which yields a Word by concatenating the class names for each character region. This allows to restrict the allowed classification results on a textual level by specifying an Expression describing the expected word.
The Expression may restrict the word to belong to a predefined lexicon created using create_lexicon or import_lexicon, by specifying the name of the lexicon in angular brackets as in '<mylexicon>'. If the Expression is of any other form, it is interpreted as a regular expression with the same syntax as specified for tuple_regexp_match. Note that you will usually want to use an expression of the form '^...$' when using variable quantifiers like '*', to ensure that the entire word is used in the expression. Also note that in contrast to tuple_regexp_match, do_ocr_word_knn does not support passing extra options in an expression tuple.
If the word derived from the best class for each character does not match the Expression, do_ocr_word_knn attempts to correct it by considering the NumAlternatives best classes for each character. The alternatives used are identical to those returned by do_ocr_single_class_knn for a single character. It does so by testing all possible corrections for which the classification result is changed for at most NumCorrections character regions.
In case the Expression is a lexicon and the above procedure did not yield a result, the most similar word in the lexicon is returned as long as it requires less than NumCorrections edit operations for the correction (see suggest_lexicon).
The resulting word is graded by a Score between 0.0 (no correction found) and 1.0 (original word correct), which is determined by the number of corrected characters but also adds a minor penalty for ignoring the second best class or even all best classes (in case of lexica). Note that this is a combinatorial score which does not reflect the original Confidence of the best Class.
Characters to be recognized.
Gray values of the characters.
Handle of the OCR classifier.
Expression describing the allowed word structure.
Number of classes per character considered for the internal word correction.
Default value: 3
Suggested values: 3, 4, 5
Typical range of values: 1 ≤ NumAlternatives ≤ 5
Maximum number of corrected characters.
Default value: 2
Suggested values: 1, 2, 3, 4, 5
Typical range of values: 0 ≤ NumCorrections ≤ 5
Result of classifying the characters with the k-NN.
Number of elements: Class == Character
Confidence of the class of the characters.
Number of elements: Confidence == Character
Word text after classification and correction.
Measure of similarity between corrected word and uncorrected classification results.
The complexity of checking all possible corrections is of magnitude , where a is the number of alternatives, n is the number of character regions, and c is the number of allowed corrections. However, to guard against a near-infinite loop in case of large n, c is internally clipped to 5, 3, or 1 if a*n >= 30, 60, or 90, respectively.
If the parameters are valid, the operator do_ocr_word_knn returns the value 2 (H_MSG_TRUE). If necessary, an exception is raised.