dc.description.abstract |
Thousands of gene expressions can be monitored using microarray technology in
a variety of biological circumstances. Microarray data has the number of features is very
large with respect to their samples and also has the nature of high-dimensionality. Due to
the high-dimensionality, multiclass and complexity of gene expression data, there are
many unknown and undiscovered functional relations in the physical delivery system used
for collecting the data itself. Analyzing a microarray high-dimensional dataset, identifying
the specific and intriguing genes that are responsible for the cause of cancer is critical.
Generally, the selected attributes are not normalized in term of representatively per class
which can impact the process of classification. To make up for the chi-square problem
which caused the absence of attributes under some classes, an upgrade chi-square
algorithm is proposed to balance the selection of the number of gene attributes per class.
The proposed model is implemented using five microarray datasets like Leukemia, four
classes Tumor and DLBCL cancer. The proposed method calculates the chi square value
of each gene attributes for each of classes. Attributes belonging to the same class are
sorted by Chi-square value and the features with the highest values in each class are
selected. According to the gene attribute selection threshold value, the top number of
attributes belong to each class is selected by the ratio values of their gene’s records. After
choosing the necessary features, the following step of this research is the implementation
of different scalable classifiers in an efficient way. The useful classifiers Logistic
Regression (LR), Random Forest and Naïve Bayes are evaluated on the scalable
framework Spark. The proposed scalable models are tested on a Spark with the outcomes
analyzed. The collected results show that the execution of scalable framework is much
more efficient than traditional systems for processing large datasets. To evaluate the
performance of the proposed system, UpgCHI is compared with three other univariate
feature selection methods: original Chi-square, Linear Regression and ANOVA. The
results show that the upgrade chi-square algorithm provides better performance on scalable
frameworks in terms of classification accuracy, precision, recall and F-1score. |
en_US |