UCSY's Research Repository

Constructing and Implementing a New DOM-based Content Extraction Algorithm

Show simple item record

dc.contributor.author Moong, Nang Kham Line
dc.date.accessioned 2019-08-05T12:34:53Z
dc.date.available 2019-08-05T12:34:53Z
dc.date.issued 2009-12-30
dc.identifier.uri http://onlineresource.ucsy.edu.mm/handle/123456789/1775
dc.description.abstract The Internet explosion has made enormous Information sources published as HTML pages on the internet. However, there are many redundant pages as being known web pages noise on the Web. For instance, almost all dot com present a large amount of noise such as service channels, navigation panels, copyright and privacy announcement, advertisements, etc. Such noises can seriously harm Web Mining, Information retrieval and Information extraction. In this paper, a new algorithm is proposed and how it can be used to deal with Web page noises is also presented. The proposed algorithm matches DOM trees to classify which nodes are noises and which are contents and, after classification, cluster into their group respectively. Finally, only the content group is extracted from the page. The resulting contents are useful for both users and systems. The proposed technique leads to boost up the performance of Web Content Extraction. en_US
dc.language.iso en en_US
dc.publisher Fourth Local Conference on Parallel and Soft Computing en_US
dc.title Constructing and Implementing a New DOM-based Content Extraction Algorithm en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository



Browse

My Account

Statistics