Abstract:
The web content classification system
classifies the noise or content from HTML web pages.
The system proposes the Content Extraction
algorithm using content features to remove the
boilerplate and to extract the main content from the
web page. After observation the HTML tags, one line
may not contain a piece of complete information and
long texts are distributed in close lines, this system
uses Text-Block Concept to determine the distance of
any two neighbor lines with text and Feature
Extraction such as Text Density (TD), anchor Anchor
Link Density (ALD) and a new feature Title Keywords
Density (TKD) classifies noise or content. After
extracting the features, the system uses the C4.8
decision tree method to classify the block is content or
non-content by using above features. After extracting
the main contents, the system uses a new
classification algorithm, Ant Colony Algorithm
(ACO) that is able to solve discrete problems and
discreteness of text document’s features. Texts are
classified by crawling of class population ants which
have class information with them to find an optimal
path matching during it iterates in the algorithm.
Finally, the system gains more interest as the classifier improves its performance with experience.