Abstract:
Each and every day, malicious software writers continue to create new variants,
new innovation, new infection, and more obfuscated malware by using packing and
encrypting techniques. Malicious software classification and detection play an
important role and a big challenge for cyber security research. Due to the increasing
rate of false alarm, the accurate classification and detection of malware is a big
necessity issue to be solved.
This research provides the classification system to differentiate malware from
benign and classify malicious types. This research contributes the Malicious Sample
Names Extraction (MSNE) procedure and Naming Malicious Samples using the
Regular Expression (NMS_RE) technique have been contributed to label the malicious
samples. This research also contributes the prominent Malware Feature Extraction
Algorithm (MFEA) to point out the dominant features based on the generated report
files. The features are API, DLL, and PROCESS called by malicious and benign
executables through automated analysis. During the experiments, data cleansing for
extracted raw data, applying the n-gram technique, and representing and preparing the
malicious dataset have been performed to provide the malware classification system.
This research work makes use of two malicious datasets for malware
classification. The Benign Malware Classification (BMC) dataset is used for binary
class classification system to identify malicious or not and Benign Malware Family
Classification (BMFC) dataset is used for multi-class classification system to identify
malware family. Chi-Square and Principal Component Analysis (PCA) feature
selection methods have been applied in this system to select the best features.
Classification algorithms like k-Nearest Neighbor (kNN), Random Forest (RF) and
Support Vector Classification (SVC) have been used for multi-class and binary class
classification. The proposed approach is able to classify the malicious and benign
executable files effectively.
This research work provides malware classification using Machine Learning
(ML) classifiers. The findings from the experiment prove that the extracted API_DLL
features provide the best evaluation metrics in terms of accuracy, confusion matrix
(CM), True Positive Rate (TPR), False Positive Rate (FPR), and Receiver Operating
Characteristic (ROC) curve area.