Abstract:
The process of detecting and removing
database defects and duplicates is referred to as
data cleaning. The fundamental issue of duplicate
detection is that inexact duplicates in a database
may refer to the same real world object due to
errors and missing data. Duplicate elimination is
hard because it is caused by different types of
errors like typographical errors, missing values,
abbreviations and different representations of the
same logical value. If the database contain
duplicate records, it is difficult to analyze the
database as well as difficult to extract the required
data. To get qualified data, data cleaning must be
performed. This paper proposes a system to resolve
dirty data in the database and ensuring to get clean
data. This paper concentrates on the duplicate data
problem and this study can be smoothed by using
token-based data cleaning technique.