Abstract:
Nowadays, big data is widely used in healthcare for prediction of diseases. Breast
cancer is the most occurred cancer disease in the world that occurs in a woman. If this
disease is detected in early stages, there will be a better chance for curing. In this system,
a scalable and fault tolerant pipeline model is proposed for analyzing big cancer data and
predicting the cancerous cells. Nowadays, a large amount of digital data is generated from
everywhere, every second of the day. One of the challenges is the volume of generated data
with high dimensionality. Most of traditional machine learning algorithms are not good in
training time and classification result to find hidden insights from these high dimensional
data. This model is developed on Apache Spark Framework using Random Forest
algorithm and the used data source is Wisconsin Diagnosis Breast Cancer Dataset of the
University of California at Irvine (UCI) Machine Learning Repository. This system is
implemented using Apache Spark-based Random Forest algorithm in order to compare
with Naïve Bayes in terms of accuracy, precision, recall and f-measure. The analysis of
evaluation results describes the achievement of the proposed system with the accuracy of
98.2% in the Big Data Analytics Environment. The proposed system is implemented by
Scala programming language on Linux platform.