Abstract:
Data Cleaning is a process for determining whether two or more records defined differently in a database, actually represent the same real world object. During data cleaning, multiple records representing the same real life object are identified, assigned only one unique database identification, and only one copy of exact duplicate records is retained. Token formation algorithm will be efficient in handling the noisy data by expanding abbreviation, removing unimportant characters and eliminating duplicates. Attribute selection algorithm is used for the attribute selecting before the token formatting. This algorithm and token formation algorithm is used for data cleaning to reduce a complexity of data cleaning process and to clean data flexibly and effortlessly without any confusion. This paper uses smart token to increase the speed of the cleaning process and improve the quality of the data.