Email spam Detection with Machine Learning

Email spam, are also called as junk emails, are unsolicited messages sent in bulk by email (spamming).

In this Data Science Project I will show you how to detect email spam using Machine Learning technique called Natural Language Processing and Python.

So this program will detect if an email is spam (1) or not (0)

Import the libraries :

import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

Load the data and print the first 5 rows :

You can download this data set from here.

df = pd.read_csv("emails.csv")
df.head()
#Output                                      text	 spam
0	Subject: naturally irresistible your corporate...	1
1	Subject: the stock trading gunslinger fanny i...	1
2	Subject: unbelievable new homes made easy im ...	1
3	Subject: 4 color printing special request add...	1
4	Subject: do not have money , get software cds ...	1

Now let’s explore the data and get the number of rows & columns :

df.shape

#Output- (5728, 2)

To get the column names in the data set :

df.columns

#Output- Index([‘text’, ‘spam’], dtype=’object’)

To check for duplicates and remove them :

df.drop_duplicates(inplace=True)
print(df.shape)

#Output- (5695, 2)

To see the number of missing data for each column :

print(df.isnull().sum())

#Output-
text 0
spam 0
dtype: int64

Now Download the stop words

Stop words in natural language processing, are useless words (data).

# download the stopwords package
nltk.download("stopwords")

Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words.

def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean
# to show the tokenization
df['text'].head().apply(process)
#Output
0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: objec

Now convert the text into a matrix of token counts :

from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])

Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value.

#split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)

Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features.

# create and train the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)

To see the classifiers prediction and actual values on the data set :

print(classifier.predict(xtrain))
print(ytrain.values)

#Output-
[0 0 0 … 0 0 0]
[0 0 0 … 0 0 0]

Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score.

# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))
#Output
                precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]
Accuracy: 
 0.9971466198419666

It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (xtest &  ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text.

#print the predictions
print(classifier.predict(xtest))
#print the actual values
print(ytest.values)

#Output-
[1 0 0 … 0 0 0]
[1 0 0 … 0 0 0]

Now let’s evaluate the model on the test data set :

# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))
#Output
                precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]
Accuracy: 
 0.9920983318700615

The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data.

Follow us on Instagram for all your Queries

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1433

13 Comments

  1. there error in
    df = pd.read_csv(“emails.csv”)
    the error said :

    FileNotFoundError Traceback (most recent call last)
    in ()
    —-> 1 df = pd.read_csv(“emails.csv”)
    2 df.head()

    4 frames
    /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    2008 kwds[“usecols”] = self.usecols
    2009
    -> 2010 self._reader = parsers.TextReader(src, **kwds)
    2011 self.unnamed_cols = self._reader.unnamed_cols
    2012

    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

    pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

    FileNotFoundError: [Errno 2] No such file or directory: ’emails.csv’

    can you help me find out what the problem is?

Leave a Reply