Abstract:
The rapid expansion of the Internet has made
Web a popular place for disseminating and
collecting information from the web. The noisy
items in web pages are one of the major
problems to extract the main contents. It is also
important how to detect noises and distinguish
valuable information from noisy data within a
single Web page. In this paper, we propose a
noise detection technique is based on the
Document Object Model (DOM) tree. In DOM
tree, weight of each node calculated by tf-idf
scheme is added in entropy measure to get the
respective value, which will be compared with a
threshold value. Those less than threshold value
are regarded as noise. Experimental results on a
range of datasets using precision and recall
measure show that our framework can improve
noise detection accuracy.