Abstract:
Duplicate Record Detection is a multiple
record search process that represents the same
physical entity in a dataset. It is also known as the
record linkage (or) entity matching [1]. The databases
contain very large datasets. Datasets contain
duplicate records that do not share a common key or
contain errors such as incomplete information,
transcription errors and missing or differing standard
formats (non-standardized abbreviations) in the
detailed schemas of records from multiple databases.
So, the duplicate detection needs to complete its
process in a very shorter time. Duplicate detection
requires an algorithm for determining whether
records are duplicate records or not.
In this paper, calculate a similarity metric that is
commonly used to find similar field items and use the
Duplicate Count Strategy Multi-Record Increase
(DCS++) Algorithm for approximately duplicate
records detection over publication xml dataset.