Discovering Informative Content Blocks for Efficient Web Data Extraction

Hlaing, Nwe Nwe; Nyunt, Thi Thi Soe

UCSYRR Home
/
Conferences
/
Local Conference on Parallel and Soft Computing
/
Fifth Local Conference on Parallel and Soft Computing
/
View Item

dc.contributor.author	Hlaing, Nwe Nwe
dc.contributor.author	Nyunt, Thi Thi Soe
dc.date.accessioned	2019-07-25T04:33:38Z
dc.date.available	2019-07-25T04:33:38Z
dc.date.issued	2010-12-16
dc.identifier.uri	http://onlineresource.ucsy.edu.mm/handle/123456789/1265
dc.description.abstract	As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and timeconsuming. A common theme is the difficulty in locating the segments of a page in which the target information is contained, which we call the informative blocks. So discriminating informative blocks from the noisy blocks and then extracting the informative blocks from web page is an important task. In this paper, we propose a method that utilizes both the visual features and semantic information to extract information block. First, the VIPS (Visionbased Page Segmentation) algorithm is used to partition a web page into semantic blocks with a hierarchy structure. Then spatial features (such as position, size) and content feature (the number of image and links) are extracted to construct feature vector for each block. Secondly based on these feature, the blocks with similar content structures and spatial structures are clustered by means of similarity computation. After clustering blocks with similar structures, determine the cluster with the largest size and nearest distance to the centre of page as informative block.	en_US
dc.language.iso	en	en_US
dc.publisher	Fifth Local Conference on Parallel and Soft Computing	en_US
dc.subject	Vision-based Page Segmentation	en_US
dc.subject	Information Extraction	en_US
dc.subject	Block Clustering	en_US
dc.title	Discovering Informative Content Blocks for Efficient Web Data Extraction	en_US
dc.type	Article	en_US