UCSY's Research Repository

Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page

Show simple item record

dc.contributor.author San, Pan Ei
dc.contributor.author Aye, Nilar
dc.date.accessioned 2019-07-02T08:25:12Z
dc.date.available 2019-07-02T08:25:12Z
dc.date.issued 2014-02-17
dc.identifier.uri http://onlineresource.ucsy.edu.mm/handle/123456789/90
dc.description.abstract Web Information Extraction systems becomes more complex and time-consuming. Web page contains many informative blocks and noise blocks. Noise blocks are navigational elements, templates and advertisements that are not the main content blocks of the web page; it can be defined noisy blocks or boilerplate text. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. This paper proposes a Web Page cleaning and main content block extraction approach and purposes of improving the accuracy and efficiency of web content mining. The system uses structural features and the shallow text features as such as number of words, link density, and average word length can be used to classify the main content or boilerplate text from the web page. And then the system extracts main content block using three parameters such as Title keyword, Keyword Frequency based Block selection and position features. The relevant content blocks are identified as the high important level by similarity of block contents to other blocks. Experiments show that Web Page cleaning based on shallow features lead to more accurate and efficient classification results for boilerplate or other content than existing approaches. en_US
dc.language.iso en en_US
dc.publisher Twelfth International Conference On Computer Applications (ICCA 2014) en_US
dc.subject Boilerplate Detection en_US
dc.subject Decision Tree en_US
dc.subject Shallow Text features en_US
dc.subject Web Content Mining en_US
dc.title Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository



Browse

My Account

Statistics