Adaptive Duplicate Detection in XML Document Based on Hash Function

Lwin, Thandar; Nyunt, Thi Thi Soe

UCSYRR Home
/
Conferences
/
Local Conference on Parallel and Soft Computing
/
Fourth Local Conference on Parallel and Soft Computing
/
View Item

Adaptive Duplicate Detection in XML Document Based on Hash Function

Lwin, Thandar; Nyunt, Thi Thi Soe

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/1914

Date: 2009-12-30

Abstract:

The task of detecting duplicate records that represents the same real world object in multiple data sources, commonly known as duplicate detection and it is relevant in data cleaning and data integration applications. Numerous approaches both for duplicate detection in relational and XML data exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in XML documents are required. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between objects. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we present a generalized framework for duplicate detection, specialized to XML. The aim of this research is to develop an efficient algorithm for detecting duplicate in complex XML documents and to reduce number of false positive by using hash function algorithm.

Show full item record