Abstract:
Apache Hadoop is a distributed platform for
storing, processing and analyzing of big data on
commodity machines. Hadoop has tunable parameters
and they affect the performance of MapReduce
applications significantly. In order to improve the
performance, tuning the Hadoop configuration
parameters is an effective approach. Performance
optimization is usually based on memory utilization,
disk I/O rate, CPU utilization and network traffic. In
this paper, the effect of MapReduce performance is
experimented and analyzed by varying the number of
concurrent containers (cc) per machine on yarn-based
pseudo-distributed mode. In this experiment, we also
measure the impact of performance by using different
suitable Hadoop Distributed File System (HDFS)
block size. From our experiment, we found that tuning
cc per node improve performance compared to default
parameter setting. We also observed the further
performance improvement via optimizing cc along
with different HDFS block size.