
This Article is based on SMS Spam detection classification with Machine Learning. I will be using the multinomial Naive Bayes implementation.
This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input.
On the other handĀ Gaussian Naive BayesĀ is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.
Also, read – 10 Machine Learning Projects to Boost your Portfolio.
SMS Spam Detection
Lets Start by importing the libraries
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import nltk
Download and read the data set
import pandas df_sms = pd.read_csv('spam.csv',encoding='latin-1') df_sms.head()

Dropping the unwanted columns Unnamed:2, Unnamed: 3 and Unnamed:4
df_sms = df_sms.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1) df_sms = df_sms.rename(columns={"v1":"label", "v2":"sms"}) df_sms.head()

Checking the maximum length of SMS
print(len(df_sms))
Number of observations in each label spam and ham
df_sms.label.value_counts()
ham 4825 spam 747 Name: label, dtype: int64
df_sms.describe()
label sms count 5572 5572 unique 2 5169 top ham Sorry, I'll call later freq 4825 30
df_sms['length'] = df_sms['sms'].apply(len) df_sms.head()
label sms length 0 ham Go until jurong point, crazy.. Available only ... 111 1 ham Ok lar... Joking wif u oni... 29 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 3 ham U dun say so early hor... U c already then say... 49 4 ham Nah I don't think he goes to usf, he lives aro... 61
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df_sms['length'].plot(bins=50, kind='hist')

df_sms.hist(column='length', by='label', bins=50,figsize=(10,4))

df_sms.loc[:,'label'] = df_sms.label.map({'ham':0, 'spam':1}) print(df_sms.shape) df_sms.head()
(5572, 3) label sms length 0 0 Go until jurong point, crazy.. Available only ... 111 1 0 Ok lar... Joking wif u oni... 29 2 1 Free entry in 2 a wkly comp to win FA Cup fina... 155 3 0 U dun say so early hor... U c already then say... 49 4 0 Nah I don't think he goes to usf, he lives aro... 61
Bag of Words Approach
What we have here in our data set is a large collection of text data (5,572 rows of data). Most Machine Learning algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.
We need a way to represent text data for machine learning algorithm and the bag-of-words model helps us to achieve that task. It is a way of extracting features from the text for use in machine learning algorithms.
In this approach, we use the tokenized words for each observation and find out the frequency of each token.
Using a process which we will go through now, we can convert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.
For example:
Lets say we have 4 documents as follows:
[‘Hello, how are you!’, ‘Win money, win from home.’, ‘Call me now’, ‘Hello, Call you tomorrow?’]
Our objective here is to convert this set of text to a frequency distribution matrix, as follows:
Here as we can see, the documents are numbered in the rows, and each word is a column name, with the corresponding value being the frequency of that word in the document.
Lets break this down and see how we can do this conversion using a small set of documents.
To handle this, we will be using sklearns count vectorizer method which does the following:
- It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
- It counts the occurrence of each of those tokens.
Implementation of Bag of Words Approach
Step 1: Convert all strings to their lower case form.
documents = ['Hello, how are you!', 'Win money, win from home.', 'Call me now.', 'Hello, Call hello you tomorrow?'] lower_case_documents = [] lower_case_documents = [d.lower() for d in documents] print(lower_case_documents)
['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']
Step 2: Removing all punctuations
sans_punctuation_documents = [] import string for i in lower_case_documents: sans_punctuation_documents.append(i.translate(str.maketrans("","", string.punctuation))) sans_punctuation_documents
['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']
Step 3: Tokenization
preprocessed_documents = [[w for w in d.split()] for d in sans_punctuation_documents] preprocessed_documents
[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]
Step 4: Count frequencies
frequency_list = [] import pprint from collections import Counter frequency_list = [Counter(d) for d in preprocessed_documents] pprint.pprint(frequency_list)
[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}), Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}), Counter({'call': 1, 'me': 1, 'now': 1}), Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]
Implementing Bag of Words in scikit-learn
Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the document-term matrix generation happens. We have created a sample document set ‘documents’.
documents = [‘Hello, how are you!’, ‘Win money, win from home.’, ‘Call me now.’, ‘Hello, Call hello you tomorrow?’]
from sklearn.feature_extraction.text import CountVectorizer count_vector = CountVectorizer()
Data preprocessing with CountVectorizer()
In above step, we implemented a version of the CountVectorizer() method from scratch that entailed cleaning our data first.
This cleaning involved converting all of our data to lower case and removing all punctuation marks.
CountVectorizer() has certain parameters which take care of these steps for us. They are:
lowercase = True
The lowercase parameter has a default value of True which converts all of our text to its lower case form.
token_pattern = (?u)\b\w\w+\b
The token_pattern parameter has a default regular expression value of (?u)\b\w\w+\b which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.
stop_words
The stop_words parameter, if set to english will remove all words from our document set that match a list of English stop words which is defined in scikit-learn.
Considering the size of our dataset and the fact that we are dealing with SMS messages and not larger text sources like e-mail, we will not be setting this parameter value.
count_vector.fit(documents) count_vector.get_feature_names()
['are', 'call', 'from', 'hello', 'home', 'how', 'me', 'money', 'now', 'tomorrow', 'win', 'you']
doc_array = count_vector.transform(documents).toarray() doc_array
array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1], [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0], [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]])
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names()) frequency_matrix

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df_sms['sms'], df_sms['label'],test_size=0.20, random_state=1)
# Instantiate the CountVectorizer method count_vector = CountVectorizer() # Fit the training data and then return the matrix training_data = count_vector.fit_transform(X_train) # Transform testing data and return the matrix. testing_data = count_vector.transform(X_test)
Implementation of Naive Bayes Machine Learning Algorithm
I will use sklearnsĀ sklearn.naive_bayesĀ method to make predictions on our dataset for SMS Spam Detection.
Specifically, we will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features. It takes in integer word counts as its input.
On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.
from sklearn.naive_bayes import MultinomialNB naive_bayes = MultinomialNB() naive_bayes.fit(training_data,y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictions = naive_bayes.predict(testing_data)
Evaluating our SMS Spam Detection Model
Now that we have made predictions on our test set, our next goal is to evaluate how well our model is doing. There are various mechanisms for doing so, but first let’s do quick recap of them.
Accuracy measures how often the classifier makes the correct prediction. Itās the ratio of the number of correct predictions to the total number of predictions (the number of test data points).
Precision tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification), in other words it is the ratio of
[True Positives/(True Positives + False Positives)]
Recall(sensitivity) tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of
[True Positives/(True Positives + False Negatives)]
For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren’t, accuracy by itself is not a very good metric.
We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score.
For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.
We will be using all 4 metrics to make sure our model does well. For all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print('Accuracy score: {}'.format(accuracy_score(y_test, predictions))) print('Precision score: {}'.format(precision_score(y_test, predictions))) print('Recall score: {}'.format(recall_score(y_test, predictions))) print('F1 score: {}'.format(f1_score(y_test, predictions)))
Accuracy score: 0.9847533632286996 Precision score: 0.9420289855072463 Recall score: 0.935251798561151 F1 score: 0.9386281588447652