Informative Content Extraction for Web Page using Text Density and Visionbased Page Segmentation (VIPS) Algorithm Integration

Mon, Ei Phyu Phyu; Yuzana

UCSYRR Home
/
Conferences
/
Local Conference on Parallel and Soft Computing
/
Eighth Local Conference on Parallel and Soft Computing
/
View Item

dc.contributor.author	Mon, Ei Phyu Phyu
dc.contributor.author	Yuzana
dc.date.accessioned	2019-07-19T15:03:43Z
dc.date.available	2019-07-19T15:03:43Z
dc.date.issued	2017-12-27
dc.identifier.uri	http://onlineresource.ucsy.edu.mm/handle/123456789/1097
dc.description.abstract	Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc.Irrelevant content in the Web page is treated as noisy content. This noisy content is typically not related to the main subjects of the webpages. A method is necessary to extract the informative content and discard the noisy content from Web pages. This system is used an integration of textual and visual importance features to extract the informative contents from Web pages. Initially a web page is converted into Document Object Model (DOM) tree. For each node in the DOM tree, textual and visual importance is calculated. Textual importance and visual importance is combined to form hybriddensity.DensitySumis calculated and used in content extraction algorithm to extract the informative content from Web pages. The algorithm is tested with various web domains and styles of web pages. Performance of web content extraction is obtained by calculating precision and recall.	en_US
dc.language.iso	en	en_US
dc.publisher	Eighth Local Conference on Parallel and Soft Computing	en_US
dc.title	Informative Content Extraction for Web Page using Text Density and Visionbased Page Segmentation (VIPS) Algorithm Integration	en_US
dc.type	Article	en_US