Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (2024)

Key concepts, examples, and Python implementation of measuring Optical Character Recognition output quality

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (1)

(1) Importance of Evaluation Metrics
(2) Error Rates and Levenshtein Distance
(3) Character Error Rate (CER)
(4) Word Error Rate (WER)
(5) Python Example (with TesseractOCR and fastwer)

Great job in successfully generating output from your OCR model! You have done the hard work of labeling and pre-processing the images, setting up and running your neural network, and applying post-processing on the output.

The final step now is to assess how well your model has performed. Even if it gave high confidence scores, we need to measure performance with objective metrics. Since you cannot improve what you do not measure, these metrics serve as a vital benchmark for the iterative improvement of your OCR model.

In this article, we will look at two metrics used to evaluate OCR output, namely Character Error Rate (CER) and Word Error Rate (WER).

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (2)

The usual way of evaluating prediction output is with the accuracy metric, where we indicate a match (1) or a no match (0). However, this does not provide enough granularity to assess OCR performance effectively.

We should instead use error rates to determine the extent to which the OCR transcribed text and ground truth text (i.e., reference text labeled manually) differ from each other.

A common intuition is to see how many characters were misspelled. While this is correct, the actual error rate calculation is more complex than that. This is because the OCR output can have a different length from the ground truth text.

Furthermore, there are three different types of error to consider:

Substitution error: Misspelled characters/words
Deletion error: Lost or missing characters/words
Insertion error: Incorrect inclusion of character/words

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (3)

The question now is, how do you measure the extent of errors between two text sequences? This is where Levenshtein distance enters the picture.

Levenshtein distance is a distance metric measuring the difference between two string sequences. It is the minimum number of single-character (or word) edits (i.e., insertions, deletions, or substitutions) required to change one word (or sentence) into another.

For example, the Levenshtein distance between “mitten” and “fitting” is 3 since a minimum of 3 edits is needed to transform one into the other.

mitten → fitten (substitute m with f)
fitten → fittin (substitute e with i)
fittin → fitting (insert g at the end)

The more different the two text sequences are, the higher the number of edits needed, and thus the larger the Levenshtein distance.

(i) Equation

CER calculation is based on the concept of Levenshtein distance, where we count the minimum number of character-level operations required to transform the ground truth text (aka reference text) into the OCR output.

It is represented with this formula:

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (4)

where:

S = Number of Substitutions
D = Number of Deletions
I = Number of Insertions
N = Number of characters in reference text (aka ground truth)

Bonus Tip: The denominator N can alternatively be computed with:
N = S + D + C (where C = number of correct characters)

The output of this equation represents the percentage of characters in the reference text that was incorrectly predicted in the OCR output. The lower the CER value (with 0 being a perfect score), the better the performance of the OCR model.

(ii) Illustration with Example

Let’s look at an example:

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (5)

Ground Truth Reference Text: 809475127
OCR Transcribed Output Text: 80g475Z7

Several errors require edits to transform OCR output into the ground truth:

g instead of 9 (at reference text character 3)
Missing 1 (at reference text character 7)
Z instead of 2 (at reference text character 8)

With that, here are the values to input into the equation:

(iii) CER Normalization

One thing to note is that CER values can exceed 100%, especially with many insertions. For example, the CER for ground truth ‘ABC’ and a longer OCR output ‘ABC12345’ is 166.67%.

It felt a little strange to me that an error value can go beyond 100%, so I looked around and managed to come across an article by Rafael C. Carrasco that discussed how normalization could be applied:

Sometimes the number of mistakes is divided by the sum of the number of edit operations (i + s + d) and the number c of correct symbols, which is always larger than the numerator.

The normalization technique described above makes CER values fall within the range of 0–100% all the time. It can be represented with this formula:

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (6)

where C = Number of correct characters

(iv) What is a good CER value?

There is no single benchmark for defining a good CER value, as it is highly dependent on the use case. Different scenarios and complexity (e.g., printed vs. handwritten text, type of content, etc.) can result in varying OCR performances. Nonetheless, there are several sources that we can take reference from.

An article published in 2009 on the review of OCR accuracy in large-scale Australian newspaper digitization programs came up with these benchmarks (for printed text):

Good OCR accuracy: CER 1‐2% (i.e. 98–99% accurate)
Average OCR accuracy: CER 2-10%
Poor OCR accuracy: CER >10% (i.e. below 90% accurate)

For complex cases involving handwritten text with highly heterogeneous and out-of-vocabulary content (e.g., application forms), a CER value as high as around 20% can be considered satisfactory.

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (7)

If your project involves transcription of particular sequences (e.g., social security number, phone number, etc.), then the use of CER will be relevant.

On the other hand, Word Error Rate might be more applicable if it involves the transcription of paragraphs and sentences of words with meaning (e.g., pages of books, newspapers).

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (8)

The formula for WER is the same as that of CER, but WER operates at the word level instead. It represents the number of word substitutions, deletions, or insertions needed to transform one sentence into another.

WER is generally well-correlated with CER (provided error rates are not excessively high), although the absolute WER value is expected to be higher than the CER value.

For example:

Ground Truth: ‘my name is kenneth’
OCR Output: ‘myy nime iz kenneth’

From the above, the CER is 16.67%, whereas the WER is 75%. The WER value of 75% is clearly understood since 3 out of 4 words in the sentence were wrongly transcribed.

We have covered enough theory, so let’s look at an actual Python code implementation.

Click HERE to see the full demo Jupyter notebook

In the demo notebook, I ran the open-source TesseractOCR model to extract output from several sample images of handwritten text. I then utilized the fastwer package to calculate CER and WER from the transcribed output and ground truth text (which I labeled manually).

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (9)

In this article, we covered the concepts and examples of CER and WER and details on how to apply them in practice.

While CER and WER are handy, they are not bulletproof performance indicators of OCR models. This is because the quality and condition of the original documents (e.g., handwriting legibility, image DPI, etc.) play an equally (if not more) important role than the OCR model itself.

I welcome you to join me on a data science learning journey! Give this Medium page a follow to stay in the loop of more data science content, or reach out to me on LinkedIn. Have fun evaluating your OCR model!

Russian Car Plate Detection with OpenCV and TesseractOCRDetecting, recognizing, and extracting car license plate numbers with the power of computer vision (A step by step…towardsdatascience.com

The Dying ReLU Problem, Clearly ExplainedKeep your neural network alive by understanding the downsides of ReLUtowardsdatascience.com

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER) (2024)

FAQs

What is WER and CER? ›

Transkribus in practiceModels & Training. The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in a handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters, but also punctuations, spaces, etc.)

Discover More Details ›

How do you evaluate OCR accuracy? ›

Measuring OCR accuracy is done by taking the output of an OCR run for an image and comparing it to the original version of the same text. You can then either count how many characters were detected correctly (character level accuracy), or count how many words were recognized correctly (word level accuracy).

Read The Full Story ›

What is cer in NLP? ›

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word.

Discover More Details ›

How do you calculate error rate in word? ›

Basically, WER is the number of errors divided by the total words. To get the WER, start by adding up the substitutions, insertions, and deletions that occur in a sequence of recognized words. Divide that number by the total number of words originally spoken. The result is the WER.

What is good word error rate? ›

A 25% word error rate is about average for “off the shelf” speech recognition APIs like Amazon, Google, IBM Watson, and Nuance. The more technical, the more industry-specific, the more “accented” and the more noisy your speech data is, the less likely that a general speech recognition API (or humans) will do as well.

View Details ›

How do you calculate character error rate? ›

Char Error Rate

is the number of substitutions,
is the number of deletions,
is the number of insertions,
is the number of correct characters,
is the number of characters in the reference (N=S+D+C).

Find Out More ›

How do I get the best OCR results? ›

5 Ways to Improve OCR Accuracy

Good Quality of Source Images. Before using OCR, make sure you can read the images with your own eyes. ...
Right Size of Images. ...
Remove Noise / Denoise. ...
Increase Image Contrast. ...
De-skew Original Source.

Read On ›

How do you test for OCR? ›

Get the text contents of your tested application or a specific UI element to verify your tested application's data or state. Verify data displayed in a tabular form. Find the needed UI element in your tested application by its text contents and simulate user actions on it.

Explore More ›

What is OCR performance? ›

Optical Character Recognition (OCR) is a technique, used to convert scanned image into editable text format. Many different types of Optical Character Recognition (OCR) tools are commercially available today; it is a useful and popular method for different types of applications.

Keep Reading ›

What is wer in machine learning? ›

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one).

Read On ›

What is a character error? ›

The Character Error Rate (CER) compares, for a given page, the total number of characters (n), including spaces, to the minimum number of insertions (i), substitutions (s) and deletions (d) of characters that are required to obtain the Ground Truth result.

Discover More ›

What is match error rate? ›

Match Error Rate (MER) is the proportion of I/O word matches, which are errors, which means that is the probability of a given match being incorrect. The ranking behaviour of MER (2) is between that of WER and WIL [35].

What is a good WER? ›

A WER of 5-10% is considered to be good quality and is ready to use. A WER of 20% is acceptable, but you might want to consider additional training.

See Details ›

Is WER a percentage? ›

Word Error Rate (WER) is a common metric for measuring speech-to-text accuracy of automatic speech recognition (ASR) systems. Microsoft claims to have a word error rate of 5.1%. Google boasts a WER of 4.9%. For comparison, human transcriptionists average a word error rate of 4%.

What is translation error rate? ›

Translation Error Rate (TER) is a method used by Machine Translation specialists to determine the amount of Post-Editing required for machine translation jobs. The automatic metric measures the number of actions required to edit a translated segment inline with one of the reference translations.

Tell Me More ›

How do you improve Tesseract accuracy OCR? ›

Basically we:

use TIFF format since tesseract likes it more than JPG (decompressor related, who knows)
use lossless LZW TIFF compression.
Resample the image to 300dpi.
Use some black magic to remove unwanted colors.
Try to rotate the page if rotation can be detected.
Antialias the image.
Sharpen text.

How accurate is azure OCR? ›

Microsoft Azure Computer Vision OCR engine provides approximately 18% STP and 80% accuracy with data extraction.

Get More Info ›

Which OCR engine output is more efficient and faster? ›

Overall Results of OCR Text Accuracy with 90% confidence intervals Google Cloud Platform's Vision OCR tool has the greatest text accuracy by 98.0% when the whole data set is tested.