Abstract:
Email is a widely used method of transferring data and information via digital devices. Email is widely used for data communication between servers by millions of users worldwide. Unwanted emails, or spam, have become a problem for big corporations and organizations as the number of spam grows rapidly each year. Spam is both unpleasant and perhaps destructive. Spam consumes computer, server, and network resources, creating bottlenecks and slowing down digital devices. Furthermore, consumers spend a significant amount of time deleting unsolicited emails. Spam emails are a huge cybersecurity concern, with research indicating that 91% of internet assaults begin from malicious emails, while 97% of users fail to accurately detect them. To reduce these dangers, this work proposes a machine learning-based solution to spam email categorization. The dataset is preprocessed with Natural Language Processing (NLP) techniques before features are extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization approach. To find the best machine learning model for spam detection, five models are trained and assessed: Logistic Regression, Random Forest Classifier, Decision Tree Classifier, K-Nearest Neighbours Classifier, Multi-Layer Perceptron (MLP) Classifier. The models are compared using performance criteria such accuracy, precision, recall and F1-score. The findings show that it has the best accuracy, which makes it a dependable automatic spam filtering solution. By offering an effective spam categorisation model that can shield users from phishing and other online dangers, this study helps to improve email security.