UCSY's Research Repository

AN UPGCHI FEATURE SELECTION METHOD FOR MULTICLASS MICROARRAY DATA WITH APACHE SPARK FRAMEWORK

Show simple item record

dc.contributor.author THANK, LWIN MAY
dc.date.accessioned 2024-07-11T05:13:52Z
dc.date.available 2024-07-11T05:13:52Z
dc.date.issued 2024-06
dc.identifier.uri https://onlineresource.ucsy.edu.mm/handle/123456789/2803
dc.description.abstract Thousands of gene expressions can be monitored using microarray technology in a variety of biological circumstances. Microarray data has the number of features is very large with respect to their samples and also has the nature of high-dimensionality. Due to the high-dimensionality, multiclass and complexity of gene expression data, there are many unknown and undiscovered functional relations in the physical delivery system used for collecting the data itself. Analyzing a microarray high-dimensional dataset, identifying the specific and intriguing genes that are responsible for the cause of cancer is critical. Generally, the selected attributes are not normalized in term of representatively per class which can impact the process of classification. To make up for the chi-square problem which caused the absence of attributes under some classes, an upgrade chi-square algorithm is proposed to balance the selection of the number of gene attributes per class. The proposed model is implemented using five microarray datasets like Leukemia, four classes Tumor and DLBCL cancer. The proposed method calculates the chi square value of each gene attributes for each of classes. Attributes belonging to the same class are sorted by Chi-square value and the features with the highest values in each class are selected. According to the gene attribute selection threshold value, the top number of attributes belong to each class is selected by the ratio values of their gene’s records. After choosing the necessary features, the following step of this research is the implementation of different scalable classifiers in an efficient way. The useful classifiers Logistic Regression (LR), Random Forest and Naïve Bayes are evaluated on the scalable framework Spark. The proposed scalable models are tested on a Spark with the outcomes analyzed. The collected results show that the execution of scalable framework is much more efficient than traditional systems for processing large datasets. To evaluate the performance of the proposed system, UpgCHI is compared with three other univariate feature selection methods: original Chi-square, Linear Regression and ANOVA. The results show that the upgrade chi-square algorithm provides better performance on scalable frameworks in terms of classification accuracy, precision, recall and F-1score. en_US
dc.language.iso en en_US
dc.publisher University of Computer Studies, Yangon en_US
dc.subject UpgCHI Feature Selection Method en_US
dc.title AN UPGCHI FEATURE SELECTION METHOD FOR MULTICLASS MICROARRAY DATA WITH APACHE SPARK FRAMEWORK en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository



Browse

My Account

Statistics