Abstract:
The task of detecting duplicate records that
represents the same real world object in multiple
data sources, commonly known as duplicate
detection and it is relevant in data cleaning and
data integration applications. Numerous approaches
both for duplicate detection in relational and XML
data exist. As XML becomes increasingly popular
for data representation, algorithms to detect
duplicates in XML documents are required.
Previous domain independent solutions to this
problem relied on standard textual similarity
functions (e.g., edit distance, cosine metric) between
objects. However, such approaches result in large
numbers of false positives if we want to identify
domain-specific abbreviations and conventions.
In this paper, we present a generalized
framework for duplicate detection, specialized to
XML. The aim of this research is to develop an
efficient algorithm for detecting duplicate in
complex XML documents and to reduce number of
false positive by using hash function algorithm.