Abstract:
We propose a methodology for clustering XML
documents on the basis of their structural
similarities. This research combines the methods of
common XPath and K-means clustering that improve
the efficiency for those XML documents with many
different structures. The common XPath is used for
searching similarities between huge numbers of XML
documents’ paths. K-means clustering algorithm is
essentially used to accurate clusters. In order to
cluster the documents’ paths we indicate the steps by
step methods. The first step includes frequent
structure mining for searching similarities between
the huge amounts of XML documents’ structures by
using the F-P growth method. The second step builds
dimensional feature vector matrix by using extracted
paths. Based on the set of common path vectors
collected, we compute the structure similarity
between the XML documents. And the last step
utilizes the K-means clustering algorithm is used to
create accurate clusters which are based on the idea
of using path based clustering, which groups the
documents according to their common XPaths, i.e.
their frequent structures. The quality of clustering
can be measured on the dissimilarity of document
structures. Also, experimental evaluation performed
on both synthetic and real data shows the
effectiveness of our approach.