Scanned documents are sometimes misaligned, contain scanning artefacts or come with low contrast. Such properties are typical root causes for issues related to Optical Character Recognition (OCR). Furthermore, misaligned scans make it difficult to reliably use position based data extraction.
Docparser comes with a built in 'Clean & Normalize' preprocessing feature for scanned documents to improve OCR accuracy. This option can be found in the settings of your document parser under 'Settings Preprocessing'. Depending on your documents and your use-case, you may want to activate any of the following options.
Please note that activating those options will increase the processing speed considerably. We recommend to activate preprocessing options only if necessary. Also, these preprocessing options are only applied when the document actually goes through Optical Character Recognition (OCR).
Remove Pixel Noise
The noise filter removes small clusters of pixels ("noise") and other scanning artefacts from your scan. We recommend to enable this option only when your scans are showing artefacts. Activating this filter and setting it to "heavy" might remove valid characters like dots and decimal points.
Stretch Contrasts
This preprocessing filter automatically increases the contrast of your scanned images and converts them to grayscale (mostly black and white). This option is very helpful if you have scans with low contrasts or if your documents contain light-grey areas which you want to have removed.
Deskew
Deskewing is the process of removing skew from images. Skew is scanning artefact that occurs when the camera of the scanner is misaligned, the paper was not placed completely flat or simply when the paper was slightly rotated when scanning.
Auto Zoom Content
This option tries to identify the content block of each page and tries to scale it to the full size of the page. This feature is helpful when you are dealing with documents having the same layout but are showing varying zoom levels.