Automatic Extraction of Data Record from Web Page based on Visual Features

Hlaing, Nwe Nwe; Nyunt, Thi Thi Soe

UCSYRR Home
/
Conferences
/
International Conference on Computer Applications (ICCA)
/
Ninth International Conference On Computer Applications (ICCA 2011)
/
View Item

Automatic Extraction of Data Record from Web Page based on Visual Features

Hlaing, Nwe Nwe; Nyunt, Thi Thi Soe

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/149

Date: 2011-05-05

Abstract:

The Web is increasingly becoming a very large information source. However, the information is visually structured such that it is easy for humans to recognize data records and presentation patterns, but not for computers. As web sites are getting more complicated, the construction of web information extraction system becomes more troublesome and timeconsuming. Hence, tools for the mining of data regions, data records and data items need to be developed in order to provide value added services. Large number of techniques has been proposed to address this problem, but all of them have inherent limitations. In this paper, we propose an approach for automatic data record extraction method from web page, which we call Vision based Extraction of data Record (VER). The approach is based on the observation that visual similarity of the data record in web document. Firstly, we adopt VIPS (Vision-based Page Segmentation) algorithm to partition a web page into semantic blocks. Then, blocks are clustered by proposed block clustering method according to the appearance similarity. Among these clusters, we identify data region and finally extract data record from data region.

Show full item record