Spam Comments Detection with Machine Learning

Spam comments detection means classifying comments as spam or not spam. YouTube is one of the platforms that uses Machine Learning to filter spam comments automatically to save its creators from spam comments. If you want to learn how to detect spam comments with Machine Learning, this article is for you. In this article, I will take you through the task of Spam comments detection with Machine Learning using Python.

Spam Comments Detection

Detecting spam comments is the task of text classification in Machine Learning. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content.

To detect spam comments with Machine Learning, we need labelled data of spam comments. Luckily, I found a dataset on Kaggle about YouTube spam comments which will be helpful for the task of spam comments detection. You can download the dataset from here.

In the section below, you will learn how to detect spam comments with machine learning using the Python programming language.

Spam Comments Detection using Python

Let’s start this task by importing the necessary Python libraries and the dataset:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

data = pd.read_csv("Youtube01-Psy.csv")
print(data.sample(5))
                                COMMENT_ID          AUTHOR  \
287    z13vhnh5ewvdyzh3o23bjz55lxbwjznor04    diego acosta   
43     z12jvnua2tifirkvk23cfjtpxwmgxfch004   Didier Drogba   
265    z13ucxdzemugi1v5n04ccjloko25drfb4js  Haley Harmicar   
322    z13uffbajziyw5cfp23bwbw5auytzdl5b04   Juris Dumagan   
89   z12pzpvbfl2igbwhe04cihtpuwymvr5gvsg0k     NstyIC Gold   

                    DATE                                            CONTENT  \
287  2014-11-08T10:05:27  If I get 100 subscribers, I will summon Freddy...   
43   2014-01-20T06:57:25  http://www.twitch.tv/jaroadc come follow and w...   
265  2014-11-08T05:35:42  9 year olds be like, 'How does this have 2 bil...   
322  2014-11-12T11:03:25            I think he was drunk during this :) x)   
89   2014-11-03T20:41:23  Ching Ching ling long ding ring yaaaaaa Ganga ...   

     CLASS  
287      1  
43       1  
265      0  
322      0  
89       0  

We only need the content and class column from the dataset for the rest of the task. So let’s select both the columns and move further:

data = data[["CONTENT", "CLASS"]]
print(data.sample(5))
                                               CONTENT  CLASS
160  CHECK MY CHANNEL FOR MY NEW SONG 'STATIC'!! YO...      1
157              Follow me on Twitter @mscalifornia95      1
336  To everyone joking about how he hacked to get ...      0
329  FOLLOW MY COMPANY ON TWITTER  thanks.  https:/...      1
79   Hi there~I'm group leader of Angel, a rookie K...      1

The class column contains values 0 and 1. 0 indicates not spam, and 1 indicates spam. So to make it look better, I will use spam and not spam labels instead of 1 and 0:

data["CLASS"] = data["CLASS"].map({0: "Not Spam",
                                   1: "Spam Comment"})
print(data.sample(5))
                                               CONTENT         CLASS
161           Incmedia.org where the truth meets you.  Spam Comment
335  Hey guys can you check my YouTube channel I kn...  Spam Comment
134                              ❤️ ❤️ ❤️ ❤️ ❤️❤️❤️❤️      Not Spam
209  How can this music video get 2 billion views w...      Not Spam
45   ....subscribe......  ......to my........  .......  Spam Comment

Training a Classification Model

Now let’s move further by training a classification Machine Learning model to classify spam and not spam comments. As this problem is a problem of binary classification, I will use the Bernoulli Naive Bayes algorithm to train the model:

x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

cv = CountVectorizer()
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, 
                                                test_size=0.2, 
                                                random_state=42)

model = BernoulliNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))
0.9857142857142858

Now let’s test the model by giving spam and not spam comments as input:

sample = "Check this out: https://thecleverprogrammer.com/" 
data = cv.transform([sample]).toarray()
print(model.predict(data))
['Spam Comment']
sample = "Lack of information!" 
data = cv.transform([sample]).toarray()
print(model.predict(data)) 
['Not Spam']

So this is how you can train a Machine Learning model for the task of spam detection using Python.

Summary

Spam comments detection means classifying comments as spam or not spam. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content. I hope you liked this article on detecting spam comments with Machine Learning. Feel free to ask valuable questions in the comments section below.

Default image
Aman Kharwal

Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder

Articles: 1261

Leave a Reply