Abstract:
As web sites are getting more complicated,
the construction of web information extraction
systems becomes more troublesome and timeconsuming.
A common theme is the difficulty in
locating the segments of a page in which the target
information is contained, which we call the
informative blocks. So discriminating informative
blocks from the noisy blocks and then extracting the
informative blocks from web page is an important
task. In this paper, we propose a method that utilizes
both the visual features and semantic information to
extract information block. First, the VIPS (Visionbased
Page Segmentation) algorithm is used to
partition a web page into semantic blocks with a
hierarchy structure. Then spatial features (such as
position, size) and content feature (the number of
image and links) are extracted to construct feature
vector for each block. Secondly based on these
feature, the blocks with similar content structures
and spatial structures are clustered by means of
similarity computation. After clustering blocks with
similar structures, determine the cluster with the
largest size and nearest distance to the centre of
page as informative block.