Predictive Big Data Analytics on High-Dimensional Data

Khine, Kyi Lai Lai

Predictive Big Data Analytics on High-Dimensional Data

Khine, Kyi Lai Lai

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/2456

Date: 2019-10

Abstract:

Nowadays, data is extremely growing very fast to become “BIG DATA”, any voluminous amount of structured, semi-structured and unstructured data, which has high potential to be mined for valuable information in decision making process. Analyzing on big data using traditional data analysis methods has become the key challenge in data analytics research. In addition, high-dimensional data analytics has been a great attention in big data era because the dimensions of datasets are continuously growing in size. It creates a critical issue to reduce efficiently a subset of dimensions from all diverse and raw data dimensions which will fulfill valuable information in decision making process. With increasing volumes of data, classical dimensionality reduction algorithms which are designed to work well with small-scale data usually face scalability bottleneck. Although Principal Component Analysis (PCA) could be applied as a dimensionality reduction algorithm in high-dimensional data, it is absolutely required to transform as scalable PCA (sPCA) for highdimensional big data. With the purpose of constructing efficient prediction model, Multiple Linear Regression (MLR), the redundant and irrelevant features or data dimensions are highly potential to increase noises and biases which can hinder the prediction process of the model. In this research, two-stage dimension reduction approach is proposed for the MLR model. Firstly, scalable Principal Component Analysis (sPCA) is proposed to solve the storage and computational problems of PCA by reducing the number of redundant dimensions without much loss of information. To examine the reduced feature subset resulted from sPCA stage whether correlated or not with the output variable of MLR model, Pearson Correlation Coefficient (PCC) is also applied to reduce the number of irrelevant dimensions. Although the high dimensions of input voluminous data matrix have been reduced, it is still a big issue to solve how to split or decompose this voluminous matrix containing large amount of observations or data records. Therefore, “QR Decomposition” is proposed to decompose large-scale matrix X into a Q and R product of an orthogonal matrix Q and an upper triangular matrix R for MLR model. In this research, the high-dimensional data reduction providing predictive big data analytics is implemented on distributed big data analytics platform, “Clouderaiii Distribution Hadoop (CDH)” using Multi Node Cloudera Cluster using three computing nodes or VMs which all are interconnected with Cloudera Manager. Three diverse high-dimensional big data sources are applied not only evaluating the proposed approaches but also achieving predictive analysis results from the system. Firstly, geospatial big data, OpenStreetMap in XML format (OSM XML) is used to obtain “One-way Roads” prediction. Then, high-resolution or high-dimensional representation of images from MS-Celeb-A, a large-scale face attributes dataset are utilized to predict “Number of Faces” in these images. Finally, the raw, unstructured text data via “DeliciousMIL” dataset from UCI is applied as input text documents to obtain “Number of Documents (Education, Science && Technology, Culture && History)” prediction results. According to the evaluation analysis, the proposed sPCA can efficiently perform dimension reduction process with increasing size or number of data dimensions for diverse data types. It also shows the good scalability performance while the traditional PCA offers “Out of Memory” results. Applying the proposed two-stage approach (sPCA and PCC) achieves the victory of accuracy in 99 percent (%) for “One-way Roads” prediction. Furthermore, QR Decomposition approach providing MLR model offers faster execution time for the system. Therefore, the proposed system provides better scalability, prediction accuracy, and faster execution time in predictive analytics on high-dimensional big data.

Show full item record