Abstract:
Nowadays, a large amount of digital data is generated from everywhere, every
second of the day. One of the challenges is the volume of generated data with high
dimensionality. Most of traditional machine learning algorithms are not good in training
time and classification result to find hidden insights from these high dimensional data.
Backpropagation Neural Network, one of the most popular Artificial Neural Networks,
is widely used in many classification applications. To reduce the data dimension,
feature selection is needed to consider. MapReduce is a software framework for writing
applications which are run on Hadoop that supports rapid computation and processing
of Big Data.
First, the data preprocessing is performed by substituting missing values. And
then, the dimension of data is reduced using Chi-square feature selection method. After
that, Backpropagation Neural Network with MapReduce paradigm is used for
classification. For this MapReduce-based Neural Network classifier, it is constructed
using one and two hidden layers. The outputs of the proposed system are the
performance measures which involve the training time, accuracy and number of
selected features. The experiments have made with feature selection and without feature
selection. Then, the results are compared with the results obtained from WEKA tool
and Conventional Backpropagation Neural Network. Six different datasets (Thyroid
Disease Diagnosis, Diabetics Diagnosis, Insurance Classification, Intrusion Detection,
Customer Churn Prediction and Human Activity Recognition) are used as case study.
Based on the experimental results, the MapReduce-based Neural Network algorithm
gives the superior efficiency in training time faster than the WEKA tool in large dataset.
And it is also found that feature selection can retain a suitably accuracy in representing
the original features by selection a minimal feature subset from a problem domain. The
proposed system is implemented by Java programming language on Linux platform.