Abstract:
Today’s storage systems have a major issue for the long-term storage of massive amounts of unstructured data. Availability and reliability are the basic properties of the most storage system. Replication which is the simplest redundancy scheme can help the storage system to achieve continuous access. But too much redundancy will not improve the data availability when the amount of replication reaches a certain point. In this paper, an efficient data deduplication method in large-scale distributed storage system is presented. Since good data indexing is very helpful for duplicate detection, the deduplication scheme with Bloom filter array is used for the sake of space and look-up efficiency in distributed storage system.