Abstract:
Hadoop distributed file system (HDFS) was
originally designed for large files. HDFS stores each
small file as one separate block although the size of
several small files is lesser than the size of block size.
Therefore, a large number of blocks are created with
massive small files. When the large number of small
files is accessed, NameNode often becomes the
bottleneck. The problem of storing and accessing
large number of small files is named as small file
problem. In order to solve this issue in HDFS, an
approach of merging small files on HDFS is
proposed. In this paper, small files are merged into a
larger file based on the agglomerative hierarchical
clustering mechanism to reduce NameNode memory
consumption. This approach will provide small files
for cloud storage.