Abstract:
Nowadays, data is extremely growing very fast to become “BIG DATA”, any
voluminous amount of structured, semi-structured and unstructured data, which has
high potential to be mined for valuable information in decision making process.
Analyzing on big data using traditional data analysis methods has become the key
challenge in data analytics research. In addition, high-dimensional data analytics has
been a great attention in big data era because the dimensions of datasets are
continuously growing in size. It creates a critical issue to reduce efficiently a subset of
dimensions from all diverse and raw data dimensions which will fulfill valuable
information in decision making process. With increasing volumes of data, classical
dimensionality reduction algorithms which are designed to work well with small-scale
data usually face scalability bottleneck. Although Principal Component Analysis
(PCA) could be applied as a dimensionality reduction algorithm in high-dimensional
data, it is absolutely required to transform as scalable PCA (sPCA) for highdimensional big data.
With the purpose of constructing efficient prediction model, Multiple Linear
Regression (MLR), the redundant and irrelevant features or data dimensions are
highly potential to increase noises and biases which can hinder the prediction process
of the model. In this research, two-stage dimension reduction approach is proposed
for the MLR model. Firstly, scalable Principal Component Analysis (sPCA) is
proposed to solve the storage and computational problems of PCA by reducing the
number of redundant dimensions without much loss of information. To examine the
reduced feature subset resulted from sPCA stage whether correlated or not with the
output variable of MLR model, Pearson Correlation Coefficient (PCC) is also applied
to reduce the number of irrelevant dimensions. Although the high dimensions of input
voluminous data matrix have been reduced, it is still a big issue to solve how to split
or decompose this voluminous matrix containing large amount of observations or data
records. Therefore, “QR Decomposition” is proposed to decompose large-scale matrix
X into a Q and R product of an orthogonal matrix Q and an upper triangular matrix R
for MLR model.
In this research, the high-dimensional data reduction providing predictive big
data analytics is implemented on distributed big data analytics platform, “Clouderaiii
Distribution Hadoop (CDH)” using Multi Node Cloudera Cluster using three
computing nodes or VMs which all are interconnected with Cloudera Manager. Three
diverse high-dimensional big data sources are applied not only evaluating the
proposed approaches but also achieving predictive analysis results from the system.
Firstly, geospatial big data, OpenStreetMap in XML format (OSM XML) is used to
obtain “One-way Roads” prediction. Then, high-resolution or high-dimensional
representation of images from MS-Celeb-A, a large-scale face attributes dataset are
utilized to predict “Number of Faces” in these images. Finally, the raw, unstructured
text data via “DeliciousMIL” dataset from UCI is applied as input text documents to
obtain “Number of Documents (Education, Science && Technology, Culture &&
History)” prediction results.
According to the evaluation analysis, the proposed sPCA can efficiently
perform dimension reduction process with increasing size or number of data
dimensions for diverse data types. It also shows the good scalability performance
while the traditional PCA offers “Out of Memory” results. Applying the proposed
two-stage approach (sPCA and PCC) achieves the victory of accuracy in 99 percent
(%) for “One-way Roads” prediction. Furthermore, QR Decomposition approach
providing MLR model offers faster execution time for the system. Therefore, the
proposed system provides better scalability, prediction accuracy, and faster execution
time in predictive analytics on high-dimensional big data.