UCSY's Research Repository

Main Content Extraction from Dynamic Web Pages

Show simple item record

dc.contributor.author San, Pan Ei
dc.contributor.author Aye, Nilar
dc.date.accessioned 2019-08-13T15:09:42Z
dc.date.available 2019-08-13T15:09:42Z
dc.date.issued 2015-03
dc.identifier.issn 2393-2835
dc.identifier.uri http://onlineresource.ucsy.edu.mm/handle/123456789/2130
dc.description.abstract Web pages not only contain main content, but also other elements such as navigation panels, advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extract only the relevant contents from web page. Main textual contents are just included in HTML source code which makes up the files. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, and copyright notices in web pages. The system removes boilerplate and extracts main content. In this system, there are two phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags, one line may not contain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or non- content. en_US
dc.language.iso en en_US
dc.publisher International Journal of Advances in Electronics and Computer Science en_US
dc.relation.ispartofseries Volume-2, Issue-3;pp.1-5
dc.subject Content Extraction en_US
dc.subject Line-Block en_US
dc.subject TKD en_US
dc.subject TTR en_US
dc.subject ATTR en_US
dc.title Main Content Extraction from Dynamic Web Pages en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository



Browse

My Account