Duplicate Record Detection in Data Cleaning Using DCS++ Algorithm

Phyo, Yin Yin; Win, Thidar

Duplicate Record Detection in Data Cleaning Using DCS++ Algorithm

Phyo, Yin Yin; Win, Thidar

URI: https://onlineresource.ucsy.edu.mm/handle/123456789/2586

Date: 2021-06

Abstract:

Nowadays, Duplicate Record Detection is a multiple record search process that represents the same physical entity in a dataset. It is also known as the record linkage (or) entity matching. The databases contain a very large dataset. Datasets contain duplicate records that do not share a common key or contain errors such as incomplete information, transcription errors and missing or differing standard formats (nonstandardized abbreviations) in the detailed schemas of records from multiple databases. Therefore, the duplicate detection needs to complete its process in a very shorter time. Duplicate detection requires an algorithm for determining whether records are duplicate records or not. In this system, the researcher calculates a similarity metric that is commonly used to find similar field items and uses the Duplicate Count Strategy-Multi Record Increase (DCS++) Algorithm for approximately duplicate records detection over publication xml dataset.

Show full item record