Abstract:
Apache Hadoop exposes 180+ configuration
parameters for all types of applications and clusters,
10-20% of which has a great impact on performance
and efficiency of the execution. The optimal
configuration settings for one application may not be
suitable for another one leading to poor system
resources utilization and long application completion
time. Further, optimizing many parameters is a time
consuming and a challenging job because
configuration parameters and search space are huge,
and users require good knowledge of Hadoop
framework. The issue is that the user should adjust at
least the important parameters, e.g. the number of
map tasks that can run in parallel for a given
application. This paper introduces the parameter
optimization algorithm to the key application level
parameter based on input data size and dynamic
resource capabilities at any given time for a given
application to improve execution time and resource
utilization with nearly zero optimization overhead.