Email spam, are also called as junk emails, are unsolicited messages sent in bulk by email (spamming).
In this Data Science Project I will show you how to detect email spam using Machine Learning technique called Natural Language Processing and Python.
So this program will detect if an email is spam (1) or not (0)
Import the libraries :
import numpy as np import pandas as pd import nltk from nltk.corpus import stopwords import string
Load the data and print the first 5 rows :
You can download this data set from here.
df = pd.read_csv("emails.csv") df.head()
#Output text spam 0 Subject: naturally irresistible your corporate... 1 1 Subject: the stock trading gunslinger fanny i... 1 2 Subject: unbelievable new homes made easy im ... 1 3 Subject: 4 color printing special request add... 1 4 Subject: do not have money , get software cds ... 1
Now let’s explore the data and get the number of rows & columns :
#Output- (5728, 2)
To get the column names in the data set :
#Output- Index([‘text’, ‘spam’], dtype=’object’)
To check for duplicates and remove them :
#Output- (5695, 2)
To see the number of missing data for each column :
Now Download the stop words
Stop words in natural language processing, are useless words (data).
# download the stopwords package nltk.download("stopwords")
Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words.
def process(text): nopunc = [char for char in text if char not in string.punctuation] nopunc = ''.join(nopunc) clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')] return clean # to show the tokenization df['text'].head().apply(process)
#Output 0 [Subject, naturally, irresistible, corporate, ... 1 [Subject, stock, trading, gunslinger, fanny, m... 2 [Subject, unbelievable, new, homes, made, easy... 3 [Subject, 4, color, printing, special, request... 4 [Subject, money, get, software, cds, software,... Name: text, dtype: objec
Now convert the text into a matrix of token counts :
from sklearn.feature_extraction.text import CountVectorizer message = CountVectorizer(analyzer=process).fit_transform(df['text'])
Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value.
#split the data into 80% training and 20% testing from sklearn.model_selection import train_test_split xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0) # To see the shape of the data print(message.shape)
Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features.
# create and train the Naive Bayes Classifier from sklearn.naive_bayes import MultinomialNB classifier = MultinomialNB().fit(xtrain, ytrain)
To see the classifiers prediction and actual values on the data set :
[0 0 0 … 0 0 0]
[0 0 0 … 0 0 0]
Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score.
# Evaluating the model on the training data set from sklearn.metrics import classification_report, confusion_matrix, accuracy_score pred = classifier.predict(xtrain) print(classification_report(ytrain, pred)) print() print("Confusion Matrix: \n", confusion_matrix(ytrain, pred)) print("Accuracy: \n", accuracy_score(ytrain, pred))
#Output precision recall f1-score support 0 1.00 1.00 1.00 3457 1 0.99 1.00 0.99 1099 accuracy 1.00 4556 macro avg 0.99 1.00 1.00 4556 weighted avg 1.00 1.00 1.00 4556 Confusion Matrix: [[3445 12] [ 1 1098]] Accuracy: 0.9971466198419666
It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (
ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text.
#print the predictions print(classifier.predict(xtest)) #print the actual values print(ytest.values)
[1 0 0 … 0 0 0]
[1 0 0 … 0 0 0]
Now let’s evaluate the model on the test data set :
# Evaluating the model on the training data set from sklearn.metrics import classification_report, confusion_matrix, accuracy_score pred = classifier.predict(xtest) print(classification_report(ytest, pred)) print() print("Confusion Matrix: \n", confusion_matrix(ytest, pred)) print("Accuracy: \n", accuracy_score(ytest, pred))
#Output precision recall f1-score support 0 1.00 0.99 0.99 870 1 0.97 1.00 0.98 269 accuracy 0.99 1139 macro avg 0.98 0.99 0.99 1139 weighted avg 0.99 0.99 0.99 1139 Confusion Matrix: [[862 8] [ 1 268]] Accuracy: 0.9920983318700615
The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data.