Integration of natural language processing methods and machine learning model for malicious webpage detection based on web contents

Shaheetha Liaquathali, Vadivazhagan Kadirvelu

Abstract


Malicious actors continually exploit vulnerabilities in web systems to distribute malware, launch phishing attacks, steal sensitive information, and perpetrate various forms of cybercrime. Traditional signature-based methods for detecting malicious webpages often struggle to keep pace with the rapid evolution of malware and cyber threats. As a result, there is a growing demand for more advanced and proactive approaches that can effectively identify malicious web content based on its characteristics and behavior. Detection based on web content is crucial because malicious webpages can be designed to mimic legitimate ones, making them difficult to identify through traditional means. By analyzing the content of webpages, it becomes possible to uncover patterns, anomalies, and malicious intent that may not be evident from surface-level inspection. The proposed approach integrates a pretrained Word2Vec model with seven distinct machine learning classifiers to enhance malicious webpage detection. Initially, web contents (documents) are encoded using the Word2Vec model, followed by the computation of average Word2Vec embeddings for each document. Subsequently, each classifier is trained on the extracted average Word2Vec embedding features. The results demonstrate that the Word2Vec model significantly enhances the detection accuracy, achieving an accuracy of 94.8% and an F1-score of 94.9% with the random forest classifier, and an accuracy of 94.6% and an F1-score of 94.7% with the extreme gradient boosting classifier.

Keywords


Cybercrime; Ensemble; Machine learning; Malicious webpage detection; Phishing; Stacking; Word2Vec embedding

Full Text:

PDF


DOI: http://doi.org/10.11591/ijra.v14i1.pp47-57

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IAES International Journal of Robotics and Automation (IJRA)
ISSN 2089-4856, e-ISSN 2722-2586
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

Web Analytics Made Easy - Statcounter IJRA Visitor Statistics