Abstract:
Apache Hadoop is an open-source software
framework for distributed storage and distributed
processing of very large data sets on computer
clusters built from commodity hardware. The Hadoop
Distributed File System (HDFS) is the underlying file
system of a Hadoop cluster. The default HDFS data
placement strategy works well in homogeneous
cluster. But it performs poorly in heterogeneous
clusters because of the heterogeneity of the nodes
capabilities. It may cause overload in some
computing nodes and reduce Hadoop performance.
Therefore, Hadoop Distributed File System (HDFS)
has to rely on load balancing utility to balance data
distribution. As a result, data can be placed evenly
across the Hadoop cluster. But it may cause the
overhead of transferring unprocessed data from slow
nodes to fast nodes because each node has different
computing capacity in heterogeneous Hadoop
cluster. In order to solve these problems, a
data/replica placement policy based on storage
utilization and computing capacity of each data node
in heterogeneous Hadoop Cluster is proposed. The
proposed policy tends to reduce the overload of some
computing nodes as well as reduce overhead of data
transmission between different computing nodes.