Web Content Classification using Content Features and Ant Colony Optimization Algorithm

Aye, Nilar; San, Pan Ei

UCSYRR Home
/
Conferences
/
International Conference on Computer Applications (ICCA)
/
Fourteenth International Conference On Computer Applications (ICCA 2016)
/
View Item

Web Content Classification using Content Features and Ant Colony Optimization Algorithm

Aye, Nilar; San, Pan Ei

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/343

Date: 2016-02-25

Abstract:

The web content classification system classifies the noise or content from HTML web pages. The system proposes the Content Extraction algorithm using content features to remove the boilerplate and to extract the main content from the web page. After observation the HTML tags, one line may not contain a piece of complete information and long texts are distributed in close lines, this system uses Text-Block Concept to determine the distance of any two neighbor lines with text and Feature Extraction such as Text Density (TD), anchor Anchor Link Density (ALD) and a new feature Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses the C4.8 decision tree method to classify the block is content or non-content by using above features. After extracting the main contents, the system uses a new classification algorithm, Ant Colony Algorithm (ACO) that is able to solve discrete problems and discreteness of text document’s features. Texts are classified by crawling of class population ants which have class information with them to find an optimal path matching during it iterates in the algorithm. Finally, the system gains more interest as the classifier improves its performance with experience.

Show full item record