dc.description.abstract |
In the current big explosion era, data is increasing dramatically every year.
Gaining critical business insights by querying and analyzing this vast amount of data
is becoming a challenge for conventional data mining techniques. It is not fit for
processing big data beyond the capabilities of traditional systems. Massive samples
and features of big data create issues such as heavy computational cost and
algorithmic instability because it brings the curse of dimensionality. Predictive
analytics is the enabler of big data, using machine learning algorithms to extract
useful knowledge from large amounts of data and makes more formidable efforts.
Effective and reliable results of the predictive analytics system depend on the quality
of the predictive model.
This research aims to develop the efficient Predictive Big data Analytics, PBA
system, for providing the valuable information and making a better business decision
in an efficient and timely manner. To achieve this goal, PBA system with different
architectures on big data analytics platforms is implemented by examining the bulk of
big data. Firstly, scalability test is carried out by analyzing the performance of
machine learning on traditional and big data analytics platform for reducing the
generalization error and processing the massive data. The processing performance of
analytics engines (MapReduce and Apache Spark) is conducted using a scalable
machine learning algorithm and then Spark processing engine is selected to provide
computationally efficient and relatively easy to implement the PBA system. For
developing a scalable and high-performance PBA system, model selection is
performed by evaluation the performance of four different machine learning
algorithms (Random Forest, Gradient Boosting, Decision Trees and Linear
Regression). The efficient PBA system is established based on the powerful machine
learning technique, Scalable Random Forests (SRF). To get the prediction model with
high accuracy, Hyperparameters Optimization in SRF is performed. In addition to
mitigating data quality challenges, reducing the high dimensions of data improves
operational efficiency by minimizing computational and storage costs. Real-time PBA
system is developed to achieve high predictive powers in real-time manner. The
different U.S stock data from eight companies are captured in real-time and predicts
whether the stock prices will rise or fall relative to the price n days ago. In RPBA3
system, the features of stock datasets are considered as input feature variables based
on the calculation of technical indicators for helping the investors to buy or sell the
stocks. Experimental results indicated that the prediction accuracy of the proposed
PBA system is better than the RF algorithm from Spark's scalable machine learning
library. The important finding of this research is that the combination of SRF's
hyperparameters optimization and dimensionality reduction technology can
considerably improve the efficiency and effectiveness of the system in terms of
accuracy and computational time. |
en_US |