Abstract:
Web Information Extraction systems
becomes more complex and time-consuming. Web
page contains many informative blocks and noise
blocks. Noise blocks are navigational elements,
templates and advertisements that are not the main
content blocks of the web page; it can be defined
noisy blocks or boilerplate text. This boilerplate text
typically is not related to the main content, may
deteriorate search precision and thus needs to be
detected properly. This paper proposes a Web Page
cleaning and main content block extraction approach
and purposes of improving the accuracy and
efficiency of web content mining. The system uses
structural features and the shallow text features as
such as number of words, link density, and average
word length can be used to classify the main content
or boilerplate text from the web page. And then the
system extracts main content block using three
parameters such as Title keyword, Keyword
Frequency based Block selection and position
features. The relevant content blocks are identified as
the high important level by similarity of block
contents to other blocks. Experiments show that Web
Page cleaning based on shallow features lead to more
accurate and efficient classification results for
boilerplate or other content than existing approaches.