Abstract:
Clustering is an essential data mining task with numerous applications. Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. This system uses efficient graph clustering algorithm to group online scientific literature. The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. Our approach to clustering uses the citation patterns of the CiteSeer database to form previously established clusters (soft clusters). The soft clusters, in turn, can be compared to one another in terms of the papers that they have in common. Similar soft clusters are merged by Ward’s agglomerative hierarchical clustering method. In the end we find the collections of documents that are all related to one another by their citation patterns. By approaching in this manner, we can rapidly calculate clusters for datasets with tens of thousands of documents.