Abstract:
Many real world areas from different sources
generate the big data with large volume of high
velocity, complex and variable data. Big data
becomes a challenge when they are difficult to
process and extract knowledge using traditional
analysis tools. Therefore the scalable machine
learning algorithms are needed for processing such
big data. Recently Hadoop MapReduce framework
has been adapted for parallel computing. MapReduce
may not fit for most of the real world data
applications. For large scale machine learning on
distributed system, Spark has finally become much
more viable beyond MapReduce. Although both of
these frameworks are Apache-hosted data analytic
framework, their performance varies significantly
based on the use case under their implementation.
This paper aims to analyze the performance of
scalable Naïve Bayes classifier (SNB) which is
implemented on MapReduce and Beyond MapReduce
over different real world datasets. The comparison
results show that SNB on Beyond MapReduce
provides minimal processing time than SNB on
MapReduce for efficiently big data classification.