Abstract:
The malicious website becomes the hub sector in the cybercrime component of the internet. Attackers delivered malicious URLs to target users via links, emails, or advertisements. Many of the previous research has analyzed URL phishing detection with several approaches to reduce the risk. In this work, we have investigated the lexical structure of the URL as input for the classification models. The system has employed the Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), and Artificial Neural Network (ANN) as evaluators for detecting malicious URLs. The datasets are collected from the Phish Tank website to build the proposed system. The approach has adopted static lexical features with imbalanced dataset for safer and faster extraction. Evaluation of the classifiers achieved the accuracy of 88%, 87%, and 88% respectively. The detection rate is high, a false positive rate is 0.13%, and false negative rate is 0.07% in XGBoost. The results show that the imbalanced nature of phishing URL affects the detection system performance.