dc.description.abstract |
Record matching is the task of identifying records that match the same real world entity. Detecting data records that are approximate duplicates, is an important task. Datasets may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases. This paper describes a record matching algorithm, is based on the multi-pass sorted neighborhood method for publication datasets. It also detects data duplication over publication xml database, produces a higher percentage of correct duplicates and a lower percentage of false positive, on multiple key sorting pass. Multi-pass approach is used, which is based on the combination of keys. Since no single key is sufficient to catch all matching records, combining results of individual passes produces more accurate results at lower cost. According to experimental results, multi-pass approach is at lowest false positive error (FPE) and lowest false negative error (FNE). |
en_US |