dc.description.abstract |
Nowadays, a large number of web pages
contained useful information is often
accompanied by a large amount of noise such as
banner advertisements, navigation bars,
copyright notices, etc. These noise data can
seriously harm for web miners by extracting
whole document rather than the informative
content and also retrieve non-relevant results. It
is also important to distinguish valuable
information from noisy data within a single web
page. The web pages are constructed not only
main contents information like product
information in shopping domain, job information
in a job domain but also advertisements bar,
static content like navigation panels, copyright
sections, etc. When web documents are
processed, the main content is surrounded by
noise in the retrieved data. To tackle these
issues, a noise elimination process is described
by using html tags and main content is retrieved
by using gomory-hu tree. |
en_US |