Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page

San, Pan Ei; Aye, Nilar

UCSYRR Home
/
Conferences
/
International Conference on Computer Applications (ICCA)
/
Twelfth International Conference On Computer Applications (ICCA 2014)
/
View Item

dc.contributor.author	San, Pan Ei
dc.contributor.author	Aye, Nilar
dc.date.accessioned	2019-07-02T08:25:12Z
dc.date.available	2019-07-02T08:25:12Z
dc.date.issued	2014-02-17
dc.identifier.uri	http://onlineresource.ucsy.edu.mm/handle/123456789/90
dc.description.abstract	Web Information Extraction systems becomes more complex and time-consuming. Web page contains many informative blocks and noise blocks. Noise blocks are navigational elements, templates and advertisements that are not the main content blocks of the web page; it can be defined noisy blocks or boilerplate text. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. This paper proposes a Web Page cleaning and main content block extraction approach and purposes of improving the accuracy and efficiency of web content mining. The system uses structural features and the shallow text features as such as number of words, link density, and average word length can be used to classify the main content or boilerplate text from the web page. And then the system extracts main content block using three parameters such as Title keyword, Keyword Frequency based Block selection and position features. The relevant content blocks are identified as the high important level by similarity of block contents to other blocks. Experiments show that Web Page cleaning based on shallow features lead to more accurate and efficient classification results for boilerplate or other content than existing approaches.	en_US
dc.language.iso	en	en_US
dc.publisher	Twelfth International Conference On Computer Applications (ICCA 2014)	en_US
dc.subject	Boilerplate Detection	en_US
dc.subject	Decision Tree	en_US
dc.subject	Shallow Text features	en_US
dc.subject	Web Content Mining	en_US
dc.title	Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page	en_US
dc.type	Article	en_US