Abstract:
There is a vast amount of duplicated or redundant data in storage systems. The
existing data deduplication attempted to reduce the storage spaces in file-level, sub file-level data storage in terms of byte-level. There is also a need to reduce content
level data deduplication, especially in Myanmar language contents. This study aims
to deduplicate the data for sentences written in Burmese. The system accepts
Myanmar sentences as input and uses Text Splitter to segment the input file into
chunks according to the whitespace. Input the separated chunks into the ChunkID
generator to generate the ChunkID by applying the Secure Hash Algorithm (SHA1).
The system will search for duplicate phrases, and then it will work on reducing those
duplicate phrases. The system is implemented with python in Visual Code IDE.
According to the tested result, the system can dedupe the duplicated data which are
written in Myanmar language with the file type .txt and .docx, especially, it is work
well in .txt file for both deduplication and reconstruction process.