Basic form design and capture in IBM Datacap
Taskmaster Capture
this post highlights some of the functionality that Taskmaster has available to assist
in ensuring good form design. This chapter also highlights general guidelines to follow.
PaperGray font
Datacap ships with its own font called PaperGray. This font is in the
C:\Datacap\support\fonts directory.
Forms that are designed with this font allow segregation of handwritten
characters without the use of lines, which can interfere with captured text. This
font allows, through use of the image enhance despeckle function, the ability to
remove the boxes easily and effectively with minimal impact to the written
characters. Although the use of lines enforces character separation, the line
removal process, in some situations, can affect the written characters that we are
attempting to capture.
Notice how the boxes have been removed with minimal impact to the text.
Barcodes
Use dimensional (1D) and two dimensional (2D) barcodes where possible. These
barcodes aid in identifying a document. They can also carry a large amount of
data and, in some situations, all the data you need from the form.
Ideally barcodes must be printed or attached so that they are square with the
page. Barcodes that are attached at extreme angles can be difficult to capture.
Colors
Use of color helps to create appealing looking forms. However, the colors must
be of a specific range that scanners can drop out (that is, removal of the
constrained text box lines that we need).
Most color scanners have a drop-out color that you can specify, which is red or
green. These scanners can perform almost the same output as the PaperGray
font after despeckling.
Some scanners can produce two images simultaneously. One image is a color
image that you can use for export. The other image is a bi-tonal (black and white)
TIFF image, with the color removed that we can use for processing.
Always test the color that you want to use before you print large quantities of
forms to ensure successful drop out.
Use of colored paper for forms can also affect scanning quality.
Fonts
Ideally use a 10–14 sized font to capture data. Smaller or larger fonts can start to
cause issues with the Optical Character Recognition (OCR) engines.
Resolution
The resolution a form that is scanned in can determine the quality of the OCR,
Intelligent Character Recognition (ICR), or Optical Mark Recognition (OMR)
results. A low-resolution image can make some characters illegible to the OCR,
ICR, and OMR engines and cause low confidence or incorrect reads. A higher
resolution, although better quality, can pick up additional marks on the form,
increasing the number of incorrect reads. This resolution also increases the size
of the image that is being stored.
Determine the resolution on a case-by-case basis. However, a general rule is to
use 200–300 dpi. Always use at least 300 dpi for OCR/A.
Layout
Use of constrained text boxes for handwritten recognition is important in
establishing good results. The text boxes help to define the area where text will
reside, the number of characters expected, and potentially the type of character,
that is numeric or alphabetic. It also defines the size of characters that is
required. These text boxes must be of adequate size so that the person who is
completing the form can write legibly. Use the PaperGray font to create
constrained boxes (see “PaperGray font” on page 290).
Try to get the person who is completing the form to use black ink and to write in
clear, well-formed uppercase characters were possible. Therefore, include
instructions in a noticeable area of the form that advise the person completing
the form to follow these guidelines. These guidelines can assist in improving the
accuracy of the ICR engine.
Ensure that OMR check boxes are of adequate size. Check boxes must not be
too small and so close that the person who completing the form selects multiple
check boxes. The check boxes must not be too large so that the person only
selects a small portion of the box.
Where possible, do not to place constrained text boxes or OMR fields close to the
edge of the form. When a form is scanned, a slight misalignment can lead to
parts of the image not scanning correctly, resulting in a loss of data.
Constrained text boxes that contain hint characters can also cause issues when
scanning if the hint character is not properly removed.
Scanning
To obtain good capture results, use a suitable scanner. When scanning from
multiple scanners, the results can differ. The quality from one scanner can be
worse than the quality of other scanners. Poor scanning quality can lead to
poorly recognized documents. Therefore, make sure that scanners undergo
routine maintenance to ensure that they are all working optimally and are not
outputting poor quality images.
As indicated earlier, try to not use color on the forms. Although this practice is
preferred, certain colors that can be dropped out are permissible. Some
scanners can drop out colors at scan time, meaning that they never make it to the
original image.
Separator sheets
When using barcode separator sheets, print the separator sheets on lightly
colored paper. By using lightly colored paper, they can be easily removed from
the scanned batches and reused.
Test any colored paper that you use for separator sheets to ensure that the
background appears white. A light blue or yellow color works well with most
scanners.
If you are using both document and attachment separator sheets, use of a
different color for each sheet allows for easier sorting upon separator removal.
Always use the first generation or barcode separator sheets. Repeated
photocopying of the original causes eventual degradation of the barcode to the
point where it might become unreadable.
Hi Thrinath
ReplyDeleteAt my organisation, we are trying to do a PoC with IBM DataCap. However, the issue is that the current form which we have has textboxes for the content. Whilst this is a good practice as it would prod people not to write in cursive, and hence increase the legibility and the readability of characters; we are facing an issue that the boundaries of each of the text box is being captured and read as consecutive "1s"
For example, if I write my name as "ADITI" in a textbox, it captures it onto the system as "A1D1I1T1". What could be the plausible workaround for this problem? Please share your views.
Thanks
Hi,
ReplyDeleteI too face these type of problems...these all are because of low quality images.use better quality scanned images(300dpi) and apply the the image enhancement settings effectively..use the icr_c for add fingerprint rule set , for page-id rule set and in Recognize rule set
for these type of image i am changing the line removal settings as fallows
True
85
3
30
20
50
hope it helps u.
good luck.
ReplyDeleteNice information.. Thanks for sharing. Addmen offers Omr answer sheet formats for practice of any compitative exam before final paper this will useful for students.
best omr software