Duplicate Record Detection in Data Cleaning Using DCS++ Algorithm

Phyo, Yin Yin; Win, Thidar

UCSYRR Home
/
Journals
/
National Journal of Parallel and Soft Computing (2020)
/
View Item

Duplicate Record Detection in Data Cleaning Using DCS++ Algorithm

Phyo, Yin Yin; Win, Thidar

URI: https://onlineresource.ucsy.edu.mm/handle/123456789/2584

Date: 2021-01

Abstract:

Duplicate Record Detection is a multiple record search process that represents the same physical entity in a dataset. It is also known as the record linkage (or) entity matching [1]. The databases contain very large datasets. Datasets contain duplicate records that do not share a common key or contain errors such as incomplete information, transcription errors and missing or differing standard formats (non-standardized abbreviations) in the detailed schemas of records from multiple databases. So, the duplicate detection needs to complete its process in a very shorter time. Duplicate detection requires an algorithm for determining whether records are duplicate records or not. In this paper, calculate a similarity metric that is commonly used to find similar field items and use the Duplicate Count Strategy Multi-Record Increase (DCS++) Algorithm for approximately duplicate records detection over publication xml dataset.

Show full item record