Â© 2009 Accusoft Corporation 1 | P a g e Improve OCR Accuracy on Color Documents Use Image Detergentâ„¢ to Clean Up Color Document Images Prior to OCR for Improved Results Abstract This white paper confirms that industry-standard practices to clean color document images can be improved to produce higher OCR accuracy. Image Detergentâ„¢ from Accusoft improves OCR accuracy by 5-10% more than a standard Smoothing filter. This white paper leads the reader through the testing that proves it. Standard smoothing algorithms provide a good way to reduce background noise and improve the appearance of scanned documents. However, they are also highly destructive to text and other data commonly found on a document image. The Image Detergent filter within the ScanFixÂ® Xpress software development kit (SDK) from Accusoft works on a different principle than other smoothing filters, and is intended specifically for use on color document images. This paper explores the impact of the Image Detergent smoothing filter on color document images containing various text and background colors. The two items quantitatively measured were OCR accuracy and cleaned up file size. OCR accuracy was measurably improved using Image Detergent, and file sizes for both lossless and lossy compression methods were significantly smaller for the images after processing with Image Detergent. Image Detergent is only available within ScanFix Xpress from Accusoft. A trial version can be downloaded here: http://www.accusoft.com/scanfix.htm. Figure 1 Before (top) and after (bottom) clips from an image cleaned with Image Detergent Introduction Noise is a common problem in all areas of digital signal processing, and the realm of document imaging is no exception. Noise in images typically shows up as specks or variations in color where none is desirable. An example of this can be seen above; the pink background in the top half of the image consists of many different shades of pink. This variation contains no useful Â© 2009 Accusoft Corporation 2 | P a g e information, and could be replaced by a solid color, as in the bottom half. Noise comes from many sources, including physical sources such as variations in paper color, or dirt and lint on the scanner or paper, as well as software sources, such as lossy compression. Lossy compression, such as used by the JPEG compression algorithm, produces very high degrees of compression, but at the cost of changing the pixels in the imageâ€”hence the â€œlossyâ€ description. One of the common side effects of the JPEG compression is specks that surround sharp transitions in color. Smoothing filters are a common way to remove noise from color images, by blending adjacent pixels together to dilute the impact of small changes in the colors of a region. This technique is known in signal processing as a low pass filter, because it allows low frequency data, which changes gradually over time, to pass through, and blocks high frequency data, which changes quickly. In image processing, a smoothing filter, like most other filters, is a type of morphological filter, meaning it changes the morphology or shape of the image. The morphological filters, in addition to the desirable behavior of reducing noise, also have the side effect of blurring and softening sharp edges, as shown below. Figure 2 A noisy signal (upper left) smoothed with progressively more aggressive smoothing filters (lower left, top right, lower right) shows less and less noise, but more and more flattening of the sharp edge in the original signal. This may not be a concern in images such as photographs, which have little high frequency data, but in document images, which are full of sharp edges and thin linesâ€”both of which contain many high frequenciesâ€”a low pass filter can be very destructive. The Image Detergent filter is designed to address exactly this situation. Rather than working morphologically, by blending pixels found in the same region in the image, Image Detergent works in the colorspace to provide smoothing. A colorspace is best envisioned as a cube, with an X, Y, and Z axis. In the most common colorspace, RGB, the axes correspond to the percentage of red, green, and blue that are mixed together to form any color in the colorspace. Image Detergent works in the RGB colorspace, taking colors that are in close Â© 2009 Accusoft Corporation 3 | P a g e proximity in the colorspace, and pulling them together. Since the location of the pixels in the image is not changed, no blurring of lines or edges occurs, but the regions of colors cleaned by Image Detergent end up smooth and noise-free. In this paper, the impact of Image Detergent on various document processing operations will be explored. The baseline data set consists of a number of documents, the â€œrawâ€ images, which have been captured on a color scanner and stored in an uncompressed format. These raw images will be filtered and compressed using different image processing methods, and the resulting modified images will be compared to determine the impacts of these image processing methods on optical character recognition and file size using common lossy and lossless compression algorithms. The Hypothesis for OCR Recognition Image Detergent will improve OCR recognition, and reduce the image size of compressed images. The Approach Binarize original images, apply OCR and measure recognition accuracy. Next, apply Image Detergent to those same original images, then binarize them, apply OCR, and measure recognition accuracy. Compare against initial results. Take the same approach on image file size. The Data Set The data set consists of a number of documents printed on a color laser printer on colored paper. Each document has the same text repeated in differing font faces using 14 and 12 point text. Paper color Font face Ink color White Arial Black Ivory Times New Roman Red Blue Courier Green Pink Blue Yellow The text consisted of the following, reproduced for all desired combinations: 14 point Color scanning gives you more information. Now you can use it. The quick brown fox jumped over the lazy hen. Now is the time for all good men to come to the aid of their country. ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 1234567890 ~!@#$%^&*()_+ Â© 2009 Accusoft Corporation 4 | P a g e The documents were scanned using an older FujitsuÂ® 4750C color scanner to simulate a common production environment. The documents were scanned at 150, 200, and 300 dpi, but the 200 and 300 dpi images were re-sampled to 150 dpi for this test. The images were stored in a lossless TIFF format. The total number of scanned pages used in the analysis was 180. The scanner default settings automatically adjusted the brightness on the Ivory and White paper, resulting in a background color whose average was very near pure white. The blue, pink, and yellow papers yielded colors of approximately 0xb0f0f8, 0xf8c8d8, and 0xf8f8c8. The resolution of 150 dpi was determined to be the borderline resolution for reliable OCR of 12 point and larger text, and the analysis focused on the 12 and 14 point text. By focusing on the borderline cases, changes in OCR accuracy are readily determined. The image shown in Figure 1 of this white paper consists of green ink on pink paper, and it provided a good illustration of the difficulties encountered when recognizing characters from a color image. Cleanup method The method used to generate the data for this analysis used three software development toolkits from Accusoft. Each toolkit is focused on a different imaging field, and all the toolkits are designed to easily work together. The ImagXpress toolkit is a general purpose photo and document imaging SDK, and it is used for opening and decompressing the images. ImagXpress supports a wide range of file formats (see www.accusoft.com/imagxpressformats.htm), including popular document imaging formats such as TIFF and JBIG2, and color formats such as JPEG and JPEG 2000. SmartZoneâ„¢, part of the FormSuiteâ„¢ SDK, is a zonal OCR tool with an easy to use programmer interface (www.accusoft.com/formsuite.htm). The third toolkit used in this test is ScanFix Xpress, the scanned image cleanup toolkit. Each image was loaded into memory using ImagXpress and subjected to two different processes. The first process was a simple binarization, using the ScanFix Binarize() method, with the minimum and maximum thresholds set to 128 and 255 respectively. The image was then processed using SmartZone, and the resulting text stored for comparison. The second process used the ScanFix Image Detergent method to clean up the image background color. The Image Detergent object was set up to clean four colors, corresponding to pure white, and the average background colors for the blue, pink, and yellow page as estimated using an industry standard utility program. This method was chosen for simplicity, and because it allowed a greater control over the radius than the AutoImage Detergent() method. The radius used for each color was around half of the distance from each color to its nearest neighbor, to prevent overlapping cleanup regions. Here is a diagram of the cleanup operation, showing the effected regions of the spectrum: Â© 2009 Accusoft Corporation 5 | P a g e Figure 2 Before and after spectrum showing the cleaned regions of the colorspace. After the Image Detergent operation, the image was binarized with the same settings as the first process. Test Results: By providing a cleaner input, the binarization process produced a better quality output, which increased the recognition rate when the images were read using SmartZone. OCR impact Figure 3 Binarizing alone (left) compared to Image Detergent followed by binarization (right). The thin areas of the characters proved difficult for the binarization process, but a cleaned image resulted in fewer broken characters. The OCR impact of the Image Detergent process was calculated by comparing the output from each process, and counting the number of characters correctly matched, and subtracting one for characters that were dropped, added, or misread. Overall, recognition rates were 2% higher for 14 point and 5% higher for 12 point text with the Image Detergent operation than without it. While the 2% increase seems small, it translated into a 63% increase in the number of blocks of 14 point text read with 95% or greater accuracy. For 12 point text, it corresponded to a 21% Â© 2009 Accusoft Corporation 6 | P a g e increase in blocks read with over 80% accuracy. The reason for this change was the reduction in noise, and relative increase in contrast between the background and foreground colors. Comparison with Standard Smoothing Algorithm Figure 4 Comparison of OCR results with raw image and images processed with a 3x3 Average filter and the Image Detergent filter. Compared to other smoothing algorithms however, the results are dramatically different. Replacing the Image Detergent filter with a standard smoothing filter (a 3x3 pixel average filter) also reduced the background noise, but at great cost to the machine readability of the characters in the documents. Readability dropped from an averate of 93% with Image Detergent to 41% with the average filter. The reason for this is that while smoothing does even out the background color, it also significantly deteriorates the sharpness of the characters. This decrease in sharpness leads to dramatically poorer performance when the image is binarized, as shown in Figures 4 and 5. 0 10 20 30 40 50 60 70 80 90 100 14 point 12 point Raw 3x3 Average filter ImageDetergent Â© 2009 Accusoft Corporation 7 | P a g e Figure 5 Image Detergent (top) compared to a 3 x 3 pixel average filter (bottom). Both clean the background, but the blurred text is far more difficult to binarize, as it has greatly reduced contrast with the background. Â© 2009 Accusoft Corporation 8 | P a g e Figure 6 JPEG compression reduces OCR accuracy on marginal images, but the use of Image Detergent can regain a significant percentage of the lost accuracy. As shown in Figure 6, when the test images were compressed as 50% quality JPEG images, the OCR quality went down for both the images with and without Image Detergent, as would be expected for such marginal images. However, performance on the non-Image Detergent images declined more; recognition dropped 12% on 14 and 13% on 12 point text when JPEG compression was used. When Image Detergent was applied to the JPEG images after decompression and before OCR, it significantly reduced the negative impact of the JPEG artifacts, reducing the impact of the JPEG degradation by as much as 70%, dropping those percentages to 7% and 9% respectively. Figure 7 shows the artifacts produced by JPEGâ€™s compression algorithm, and how Image Detergent removes many of those artifacts from the image. 0 10 20 30 40 50 60 70 80 90 100 14 pt 12 pt Raw JPEG JPEG + ID Â© 2009 Accusoft Corporation 9 | P a g e Figure 7 Original (top), 50% quality JPEG (middle), and cleaned JPEG (bottom). The JPEG compression actually cleans the background some, as the lossy compression algorithm discards the high frequency data, but it also creates noise in the form of compression artifacts, which appear as specks and brightness variance around the characters. Image Detergent helps clean up those artifacts. The Hypothesis for Reducing Output File Size Noise, while not desirable, is still data, and can significantly impact the file size in compressed image formats. Removing the noise from the image before it is compressed can reduce file sizes, in some cases very dramatically. The Approach For the file size comparison, each image was cleaned with Image Detergent using the same settings that were used for OCR testing, and the results were stored in a lossless format. The original and cleaned images were then compressed using two formats, a lossy and a lossless format. The lossy format used was JPEG, using a 50% quality level, comparable to the Â© 2009 Accusoft Corporation 10 | P a g e Independent JPEG Groupâ€™s reference implementation. The lossless format used was TIFF with LZW compression. The Data Set The images used were the same as were used for the OCR testing. Each image was compressed using the lossy and lossless formats, and the file sizes of the compressed original and cleaned images were compared to calculate the difference. The differences are shown in Figure 8, as a percentage of the original file size. A value of 90 indicates the compressed Image Detergent file was 90% the size of the compressed original file. Test Results Figure 4 Percentage file size reduction with Image Detergent. In all cases, the compressed file size was reduced. The most dramatic changes were in the Pink and Blue colored pages, which had the most noise. When comparing the sizes of the original and the cleaned images using JPEG compression at a 50% quality level, the cleaned images generated file sizes 10% smaller on average. With a lossless compression, such as LZW compressed TIFF, the change is far more dramatic, with an average reduction of 62% (though the LZW files are still significantly larger than the JPEG files).