Abstract:
This paper presents a feature-based system which utilizes domain knowledge to segment and classify scanned image documents. Documents usually consists of a mixture of text and image. Text block possesses an interesting property that the x-profile or y-profile of text block is a periodic pattern. Image block possesses generate the connectivity histogram by summing the number of dark pixels with the same connectivity value. Initially, one-scan run-length smearing algorithm (RLSA) with block merging is proposed to segment the document. After segmentation process, the next task is to classify the segmented block. The classification task is then performed based on the rules induced from the features or primitives associated with each document. In this system, proper use of domain knowledge is proved to be effective in accelerating the segmentation speed and decreasing the classification error.