Abstract:
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and / or they contain errors that make duplicate matching a difficult task. A major problem that arises from integrating different databases is the existence of duplicates. Data cleaning is the process for identifying two or more records within the database, which represent the same real world object (duplicates), so that a unique representation for each object is adopted. This system addresses the data cleaning problem of detecting duplicate records that are approximate duplicates, but not exact duplicates. It uses Priority Queue algorithm with Smith Waterman algorithm for computing minimum edit-distance similarity values to recognize pairs of approximately duplicates and then eliminate the detected duplicate records. And, we also determine the performance evaluation with the lowest FP %( false positive percentage) and FN %( false negative percentage) as the best result.