Abstract:
Nowadays, Duplicate Record Detection is a multiple record search process that
represents the same physical entity in a dataset. It is also known as the record linkage
(or) entity matching. The databases contain a very large dataset. Datasets contain
duplicate records that do not share a common key or contain errors such as incomplete
information, transcription errors and missing or differing standard formats (nonstandardized abbreviations) in the detailed schemas of records from multiple databases.
Therefore, the duplicate detection needs to complete its process in a very shorter time.
Duplicate detection requires an algorithm for determining whether records are duplicate
records or not.
In this system, the researcher calculates a similarity metric that is commonly
used to find similar field items and uses the Duplicate Count Strategy-Multi Record
Increase (DCS++) Algorithm for approximately duplicate records detection over
publication xml dataset.