Abstract:
The web is a large amount of data and difficult to
search information or data of user interest (IT
academic field). Therefore, it needs to categorize for
meet user’s interesting field easily. Web page
categorization help improve the quality of web search.
In this paper, we proposed a framework for web data
extraction by using categorized web pages to improve
data extraction accuracy and result. Firstly, the
numbers of test web pages are defined as inputs. We
use page segmentation algorithm (VIPS) to perform
segmentation these pages to achieve content structure
for web page cleaning and to evaluate informative or
main content block. These main contents are
categorized by using Support Vector Machine (SVM)
which gives accurate and efficient result. These
categorized web pages are stored into the database
(IT library) to output data accurately when user query.