Abstract:
Most of the Web page typically contains clutter
unlike conventional data or text. It usually has such
noise data as navigation panels, copyright and
privacy notices, and advertisement. These noise
data can seriously harm for Web miners by
extracting whole document rather than the
informative content and also retrieve non-relevant
results. So, eliminating these noise patterns is great
important. In this paper, we propose an effective
technique to detect and remove various noise
patterns from Web document to enhance Web
mining. Our system first builds DOM tree structure
for an incoming Web page and then split it into subtrees
to detect noise data. We also apply back
propagation neural network algorithm to classify
various noise patterns, data patterns and mixture
patterns in current Web page. The classification
result of neural network is used for eliminating
various noise patterns. The proposed technique is
evaluated on several commercial Web sites and
News Web sites to show the performance and
improvement of our approach.