Saturday, June 4, 2011

GENERATING STRUCTURED CONTENT FROM UNSTRUCTURED DOCUMENTS


  • In a typical Taskmaster application, documents start as a batch of unidentified image files – one image per page. A single batch may contain a mix of document types, and each document may contain a different number of pages of different types. There is nothing within the page image that identifies the page type or any of the data on the page. In other words, the page images do not contain any structured content.



  • Before Taskmaster can begin to extract data it must identify the individual page types. There are several ways to do this, but the most common technique is called fingerprint matching (described later in this section).Taskmaster then maps pages to documents and fields to pages, using the information in the document hierarchy. After identifying the fields and their locations within each page, Taskmaster can then extract the data and store it in a structured format, known as the runtime batch hierarchy.


No comments:

Post a Comment