Abstract:
Web page typically contains many
information blocks. They are navigation panels,
copyright and privacy notices and advertisements.
These blocks are useful for business purposes.
These blocks are called as the noisy blocks which
can harm web data mining. And so, eliminating
these noises is of great importance. The noisy
blocks usually share some common contents and
presentation styles. The main contents of web page
are different in the common presentation styles.
Based on this observation, a site style tree (SST) is
presented in this system to capture the common
presentation styles and actual contents. An
information based algorithm is used to determine
which parts of the SST represent noises and which
parts represent the main contents of the site.
Experimental results show that eliminating noisy
information on web pages will be effective for web
data mining. The system shows how much noisy
information blocks can be removed from web
pages depending upon file size. The users can
choose desired web page and this system will
eliminate unnecessary noise by using noise
detection and web page cleaning algorithm.