Extracting Informative Content from Web Pages Using Content Extraction Algorithm

Hlaing, Yu Wai

UCSYRR Home
/
Conferences
/
International Conference on Computer Applications (ICCA)
/
Eleventh International Conference On Computer Applications (ICCA 2013)
/
View Item

dc.contributor.author	Hlaing, Yu Wai
dc.date.accessioned	2019-07-12T04:41:42Z
dc.date.available	2019-07-12T04:41:42Z
dc.date.issued	2013-02-26
dc.identifier.uri	http://onlineresource.ucsy.edu.mm/handle/123456789/844
dc.description.abstract	Apart from the main content blocks, almost all web pages on the Internet contain such blocks as navigation, copyright information, privacy notices, and advertisements, which are not related to the topic of the web page. These blocks are called noisy blocks, and the main content blocks are called informative blocks. The information contained in the noisy blocks can seriously harm Web mining and searching. So discriminating informative blocks from the noisy blocks and then extracting the information contained in the informative blocks is an important task. In this paper, the problem of automatically extracting the web information (unsupervised IE) without any learning examples or other similar human input is studied. Firstly, web pages are segmented into several raw chunks. Then removed the noisy blocks based on product features. Content extraction is based on the relation among punctuation mark density, length of information text and anchor text density. This approach requires no human intervention, no prior knowledge of the input HTML page and no training set are required.	en_US
dc.language.iso	en	en_US
dc.publisher	Eleventh International Conference On Computer Applications (ICCA 2013)	en_US
dc.subject	Web Mining	en_US
dc.subject	Information Extraction (IE)	en_US
dc.subject	Unsupervised IE	en_US
dc.subject	Informative Blocks	en_US
dc.title	Extracting Informative Content from Web Pages Using Content Extraction Algorithm	en_US
dc.type	Article	en_US