Abstract:
The Web is increasingly becoming a very
large information source. However, the
information is visually structured such that it is
easy for humans to recognize data records and
presentation patterns, but not for computers. As
web sites are getting more complicated, the
construction of web information extraction
system becomes more troublesome and timeconsuming.
Hence, tools for the mining of data
regions, data records and data items need to be
developed in order to provide value added
services. Large number of techniques has been
proposed to address this problem, but all of them
have inherent limitations. In this paper, we
propose an approach for automatic data record
extraction method from web page, which we call
Vision based Extraction of data Record (VER).
The approach is based on the observation that
visual similarity of the data record in web
document. Firstly, we adopt VIPS (Vision-based
Page Segmentation) algorithm to partition a web
page into semantic blocks. Then, blocks are
clustered by proposed block clustering method
according to the appearance similarity. Among
these clusters, we identify data region and finally
extract data record from data region.