Fake News Detection Model

Fake News is one of the major concerns in our society right now. It is a very widespread issue that even the most leading media sometimes gets with the trap of Fake News. If it’s difficult for media channels to detect fake news then it’s next to difficult for a general citizen.

As a part of a Machine Learning project, in this article, I will show you Fake News Detection with Machine Learning. I will use all the misinformation that we heard from some previous months about coronavirus. So at the end of this article, you will be able to create a fake news detection model on coronavirus.

The data I will use in this article is collected from more than 1000 news articles and posts on social media platforms on coronavirus. You can download the dataset from here.

Now I will Import all the libraries that we need for Fake News Detection; then I will Import the dataset using the pandas library in Python. Then I will prepare the data using the python pandas:

from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) from nltk.tag import pos_tag from nltk import word_tokenize from collections import Counter import textstat from lexicalrichness import LexicalRichness import plotly.express as px import plotly.figure_factory as ff import plotly.graph_objects as go from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.svm import LinearSVC from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score pd.set_option('display.max_columns', 500) df = pd.read_csv('data/corona_fake.csv') df.loc[df['label'] == 'Fake', ['label']] = 'FAKE' df.loc[df['label'] == 'fake', ['label']] = 'FAKE' df.loc[df['source'] == 'facebook', ['source']] = 'Facebook' df.text.fillna(df.title, inplace=True) df.loc[5]['label'] = 'FAKE' df.loc[15]['label'] = 'TRUE' df.loc[43]['label'] = 'FAKE' df.loc[131]['label'] = 'TRUE' df.loc[242]['label'] = 'FAKE' df = df.sample(frac=1).reset_index(drop=True) df.title.fillna('missing', inplace=True) df.source.fillna('missing', inplace=True)

I will create a number of new features based on the titles and body of news articles. Now let’s go through all the features one by one.

Capital Letters in the Title

Now I will count the number of capital letters in the Title of each article; then I will count the percentage of each capital letter in the body of all the articles. Counting the Capital letters will allow is to know what is the Title based on so that we can use that data for use as a categorical feature.

df['title_num_uppercase'] = df['title'].str.count(r'[A-Z]') df['text_num_uppercase'] = df['text'].str.count(r'[A-Z]') df['text_len'] = df['text'].str.len() df['text_pct_uppercase'] = df.text_num_uppercase.div(df.text_len) x1 = df.loc[df['label']=='TRUE']['title_num_uppercase'] x2 = df.loc[df['label'] == 'FAKE']['title_num_uppercase'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Distribution of Uppercase in title', template="plotly_white") fig.show()
Image for post
fig = go.Figure() fig.add_trace(go.Box(y=x1, name='TRUE', marker_color = 'rgb(0, 0, 100)')) fig.add_trace(go.Box(y=x2, name = 'FAKE', marker_color = 'rgb(0, 200, 200)')) fig.update_layout(title_text='Box plot of Capital Letter in title', template="plotly_white") fig.show()
Image for post

On average, the fake news has a huge amount of words in capital letters in the title of the articles. This gives the impression that the people who write fake news try to impress the audience by the titles of their articles.

Stop Words

Stop Words are the words that the search engines are programmed to ignore. For example – a the, at, on, which etc. Now I will count the number of stop words present in the title of each article:

df['title_num_stop_words'] = df['title'].str.split().apply(lambda x: len(set(x) & stop_words)) df['text_num_stop_words'] = df['text'].str.split().apply(lambda x: len(set(x) & stop_words)) df['text_word_count'] = df['text'].apply(lambda x: len(str(x).split())) df['text_pct_stop_words'] = df['text_num_stop_words'] / df['text_word_count'] x1 = df.loc[df['label']=='TRUE']['title_num_stop_words'] x2 = df.loc[df['label'] == 'FAKE']['title_num_stop_words'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Distribution of Stop Words in title', template="plotly_white") fig.show()
Image for post
fig = go.Figure() fig.add_trace(go.Box(y=x1, name='TRUE', marker_color = 'rgb(0, 0, 100)')) fig.add_trace(go.Box(y=x2, name = 'FAKE', marker_color = 'rgb(0, 200, 200)')) fig.update_layout(title_text='Box plot of Stop Words in title', template="plotly_white") fig.show()
Image for post

The titles of the articles of fake news contain fewer stop words then the titles of real news.

Proper Noun

Now I will count the number of proper noun in the title of each article:

df.drop(['text_num_uppercase', 'text_len', 'text_num_stop_words', 'text_word_count'], axis=1, inplace=True) df['token'] = df.apply(lambda row: nltk.word_tokenize(row['title']), axis=1) df['pos_tags'] = df.apply(lambda row: nltk.pos_tag(row['token']), axis=1) tag_count_df = pd.DataFrame(df['pos_tags'].map(lambda x: Counter(tag[1] for tag in x)).to_list()) df = pd.concat([df, tag_count_df], axis=1).fillna(0).drop(['pos_tags', 'token'], axis=1) df = df[['title', 'text', 'source', 'label', 'title_num_uppercase', 'text_pct_uppercase', 'title_num_stop_words', 'text_pct_stop_words', 'NNP']].rename(columns={'NNP': 'NNP_title'}) x1 = df.loc[df['label']=='TRUE']['NNP_title'] x2 = df.loc[df['label'] == 'FAKE']['NNP_title'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Number of Proper nouns in title', template="plotly_white") fig.show()
Image for post
fig = go.Figure() fig.add_trace(go.Box(y=x1, name='TRUE', marker_color = 'rgb(0, 0, 100)')) fig.add_trace(go.Box(y=x2, name = 'FAKE', marker_color = 'rgb(0, 200, 200)')) fig.update_layout(title_text='Box plot of Proper nouns in title', template="plotly_white") fig.show()
Fake News Detection Model

The titles of the fake news articles have more use of proper nouns than the titles of real news.

Analysis of Titles of Fake News Articles

The above analysis showed us that the authors of fake news write very less amount of stop words and focus on a catchy title by increasing the number of proper nouns in their titles.

Classifying Features

To classify the features of fake news and real news, we need to compute a lot of content-based features within the body of articles, let’s go through all the features one by one.

I will use a part-of-speech tagger and keep a count on how many times each tag is written in the articles:

df['token'] = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) df['pos_tags'] = df.apply(lambda row: nltk.pos_tag(row['token']), axis=1) tag_count_df = pd.DataFrame(df['pos_tags'].map(lambda x: Counter(tag[1] for tag in x)).to_list()) df = pd.concat([df, tag_count_df], axis=1).fillna(0).drop(['pos_tags', 'token'], axis=1)

Number of negations and interrogatives in the body of each article:

df['num_negation'] = df['text'].str.lower().str.count("no|not|never|none|nothing|nobody|neither|nowhere|hardly|scarcely|barely|doesn't|isn't|wasn't|shouldn't|wouldn't|couldn't|won't|can't|don't") df['num_interrogatives_title'] = df['title'].str.lower().str.count("what|who|when|where|which|why|how") df['num_interrogatives_text'] = df['text'].str.lower().str.count("what|who|when|where|which|why|how")

Now, I will use textstat library in python to store the readability score of each article:

reading_ease = [] for doc in df['text']: reading_ease.append(textstat.flesch_reading_ease(doc)) smog = [] for doc in df['text']: smog.append(textstat.smog_index(doc)) kincaid_grade = [] for doc in df['text']: kincaid_grade.append(textstat.flesch_kincaid_grade(doc)) liau_index = [] for doc in df['text']: liau_index.append(textstat.coleman_liau_index(doc)) readability_index = [] for doc in df['text']: readability_index.append(textstat.automated_readability_index(doc)) readability_score = [] for doc in df['text']: readability_score.append(textstat.dale_chall_readability_score(doc)) difficult_words = [] for doc in df['text']: difficult_words.append(textstat.difficult_words(doc)) write_formula = [] for doc in df['text']: write_formula.append(textstat.linsear_write_formula(doc)) gunning_fog = [] for doc in df['text']: gunning_fog.append(textstat.gunning_fog(doc)) text_standard = [] for doc in df['text']: text_standard.append(textstat.text_standard(doc)) df['flesch_reading_ease'] = reading_ease df['smog_index'] = smog df['flesch_kincaid_grade'] = kincaid_grade df['automated_readability_index'] = readability_index df['dale_chall_readability_score'] = readability_score df['difficult_words'] = difficult_words df['linsear_write_formula'] = write_formula df['gunning_fog'] = gunning_fog df['text_standard'] = text_standard

Now I will calculate the Type-Token ratio, which is the total number of unique words in an article:

ttr = [] for doc in df['text']: lex = LexicalRichness(doc) ttr.append(lex.ttr) df['ttr'] = ttr

Now, I will store the amount of type of words, including – tentative words, casual words, emotional words, and powerful words in each article:

df['num_powerWords_text'] = df['text'].str.lower().str.count('improve|trust|immediately|discover|profit|learn|know|understand|powerful|best|win|more|bonus|exclusive|extra|you|free|health|guarantee|new|proven|safety|money|now|today|results|protect|help|easy|amazing|latest|extraordinary|how to|worst|ultimate|hot|first|big|anniversary|premiere|basic|complete|save|plus|create') df['num_casualWords_text'] = df['text'].str.lower().str.count('make|because|how|why|change|use|since|reason|therefore|result') df['num_tentativeWords_text'] = df['text'].str.lower().str.count('may|might|can|could|possibly|probably|it is likely|it is unlikely|it is possible|it is probable|tends to|appears to|suggests that|seems to') df['num_emotionWords_text'] = df['text'].str.lower().str.count('ordeal|outrageous|provoke|repulsive|scandal|severe|shameful|shocking|terrible|tragic|unreliable|unstable|wicked|aggravate|agony|appalled|atrocious|corruption|damage|disastrous|disgusted|dreadful|eliminate|harmful|harsh|inconsiderate|enraged|offensive|aggressive|frustrated|controlling|resentful|anger|sad|fear|malicious|infuriated|critical|violent|vindictive|furious|contrary|condemning|sarcastic|poisonous|jealous|retaliating|desperate|alienated|unjustified|violated')

Exploratory Data Analysis for Fake News

Capital Letters in the body of each article:

x1 = df.loc[df['label']=='TRUE']['text_pct_uppercase'] x2 = df.loc[df['label'] == 'FAKE']['text_pct_uppercase'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Percentage of Capital Letter in Article body', template="plotly_white") fig.show()
Image for post

Stop Words in the body of each article:

x1 = df.loc[df['label']=='TRUE']['text_pct_stop_words'] x2 = df.loc[df['label'] == 'FAKE']['text_pct_stop_words'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Percentage of Stop Words in Text', template="plotly_white") fig.show()
Fake News Detection Model

The Number of proper nouns used in the body of each article:

x1 = df.loc[df['label']=='TRUE']['NNP'] x2 = df.loc[df['label'] == 'FAKE']['NNP'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Number of Proper noun in Article Body', template="plotly_white") fig.show()
Fake News Detection Model

The Number of Negative Words in the body of each article:

x1 = df.loc[df['label']=='TRUE']['num_negation'] x2 = df.loc[df['label'] == 'FAKE']['num_negation'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Number of Negations in Text', template="plotly_white") fig.show()
Fake News Detection Model

Type-Token Ratio of each article:

x1 = df.loc[df['label']=='TRUE']['ttr'] x2 = df.loc[df['label'] == 'FAKE']['ttr'] group_labels = ['TRUE', 'FAKE'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Type-token ratio in Article Bodies', template="plotly_white") fig.show()
Image for post

Harvard Health Publishing Vs. Natural News

Always remember that the Natural News is a conspiracy theory. I will label all the collected article as Harvard News and Natural news to explore more features, as the features I counted above are not that much helpful, so we will classify more features:

x1 = df.loc[df['source']=='https://www.health.harvard.edu/']['text_pct_stop_words'] x2 = df.loc[df['source']=='https://www.naturalnews.com/']['text_pct_stop_words'] group_labels = ['Health Harvard', 'Natural News'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Percentage of Stop Words in Article Bodies', template="plotly_white") fig.show()
Fake News

Within my expectations, The article of the Natural News use very less stop words than the Harvard Health publishing.

x1 = df.loc[df['source']=='https://www.health.harvard.edu/']['ttr'] x2 = df.loc[df['source'] == 'https://www.naturalnews.com/']['ttr'] group_labels = ['Harvard', 'Natural News'] colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)'] fig = ff.create_distplot( [x1, x2], group_labels,colors=colors) fig.update_layout(title_text='Type-token ratio in Article Bodies', template="plotly_white") fig.show()
Fake News

Fake News Detection Model

Now, I will use a Support Vector Machine classification model to fit all the features that we have gone throughout this article to build a Fake News Detection model.

X, y = df.drop(['title', 'text', 'source', 'label', 'text_standard'], axis = 1), df['label'] scaler = StandardScaler() scaler.fit(X) X = scaler.transform(X) svc=LinearSVC(dual=False) scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') print(scores)
[0.77777778 0.88888889 0.79487179 0.86324786 0.80172414 0.87931034
 0.85344828 0.87068966 0.86206897 0.82758621]
print(scores.mean())
0.841961391099322

When 10-fold cross validation is done we can see 10 different score in each iteration, and then we compute the mean score. Take all the values of C parameter and check out the accuracy score:

C_range=list(range(1,26)) acc_score=[] for c in C_range: svc = LinearSVC(dual=False, C=c) scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') acc_score.append(scores.mean()) C_values=list(range(1,26)) fig = go.Figure(data=go.Scatter(x=C_values, y=acc_score)) fig.update_layout(xaxis_title='Value of C for SVC', yaxis_title='Cross Validated Accuracy', template='plotly_white',xaxis = dict(dtick = 1)) fig.show()
Fake News Detection Model

The Figure above shows that our model has built with a great accuracy of 84.2 percent, then it drops down to 83.8 percent and remains constant.

Also, Read: Training Models Across Multiple Devices.

So this was a Fake News Detection model that I trained using the Support Vector Machine algorithm in Machine Learning. Feel free to ask your valuable questions in the comments section below.

Follow Us:

Default image
Aman Kharwal

I am a programmer from India, and I am here to guide you with Data Science, Machine Learning, Python, and C++ for free. I hope you will learn a lot in your journey towards Coding, Machine Learning and Artificial Intelligence with me.

Leave a Reply