Sentiment analysis is the process by which all of the content can be quantified to represent the ideas, beliefs, and opinions of entire sectors of the audience. The implications of sentiment analysis are hard to underestimate to increase the productivity of the business. Sentiment Analysis is one of those common NLP tasks that every Data Scientist need to perform.
For example, you are a student in an online course and you have a problem. You post it on the class forum. The sentiment analysis would be able to not only identify the topic you are struggling with, but also how frustrated or discouraged you are, and tailor their comments to that sentiment. This is already happening because the technology is already there.
Sentiment Analysis with Machine Learning
Hope you understood what sentiment analysis means. Now I’m going to introduce you to a very easy way to analyze sentiments with machine learning. The data I’ll be using includes 27,481 tagged tweets in the training set and 3,534 tweets in the test set. You can easily download the data from here. Now let’s start with this task by looking at the data using pandas:
import pandas as pd
training = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
print("Training data: \n",training.head())
print("Test Data: \n",test.head())
Code language: Python (python)
Training data: textID ... sentiment 0 cb774db0d1 ... neutral 1 549e992a42 ... negative 2 088c60f138 ... negative 3 9642c003ef ... negative 4 358bd9e861 ... negative [5 rows x 4 columns] Test Data: textID text sentiment 0 f87dea47db Last session of the day http://twitpic.com/67ezh neutral 1 96d74cb729 Shanghai is also really exciting (precisely -... positive 2 eee518ae67 Recession hit Veronique Branquinho, she has to... negative 3 01082688c6 happy bday! positive 4 33987a8ee5 http://twitpic.com/4w75p - I like it!! positive
Data processing
For the sake of simplicity, we don’t want to go overboard on the data cleaning side, but there are a few simple things we can do to help our machine learning model identify the sentiments. The data cleaning process is as follows:
- Remove all hyperlinks from tweets
- Replace common contractions
- Remove punctuation
As a process of data preparation, we can create a function to map the labels of sentiments to integers and return them from the function:
import re contractions_dict = {"can`t": "can not", "won`t": "will not", "don`t": "do not", "aren`t": "are not", "i`d": "i would", "couldn`t": "could not", "shouldn`t": "should not", "wouldn`t": "would not", "isn`t": "is not", "it`s": "it is", "didn`t": "did not", "weren`t": "were not", "mustn`t": "must not", } def prepare_data(df:pd.DataFrame) -> pd.DataFrame: df["text"] = df["text"] \ .apply(lambda x: re.split('http:\/\/.*', str(x))[0]) \ .str.lower() \ .apply(lambda x: replace_words(x,contractions_dict)) df["label"] = df["sentiment"].map( {"neutral": 1, "negative":0, "positive":2 } ) return df.text.values, df.label.values def replace_words(string:str, dictionary:dict): for k, v in dictionary.items(): string = string.replace(k, v) return string train_tweets, train_labels = prepare_data(train_df) test_tweets, test_labels = prepare_data(test_df)
Tokenization
Now we need to tokenize each tweet into a single fixed-length vector – specifically a TFIDF integration. To do this we can use Tokenizer() built into Keras, suitable for training data:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_tweets)
train_tokenized = tokenizer.texts_to_matrix(train_tweets,mode='tfidf')
test_tokenized = tokenizer.texts_to_matrix(test_tweets,mode='tfidf')
Code language: Python (python)
Machine Learning Model for Sentiment Analysis
Now, I will train our model for sentiment analysis using the Random Forest Classification algorithm provided by Scikit-Learn:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=500, min_samples_leaf=2,oob_score=True,n_jobs=-1,)
forest.fit(train_tokenized,train_labels)
print(f"Train score: {forest.score(train_tokenized,train_labels)}")
print(f"OOB score: {forest.oob_score_}")
Code language: Python (python)
Train score: 0.7672573778246788
OOB score: 0.6842545758887959
Evaluating the Model on Test Set
Scikit-Learn makes it easy to use both the classifier and the test data to produce a confusion matrix algorithm showing performance on the test set as follows:
print("Test score: ",forest.score(test_tokenized,test_labels))
Code language: Python (python)
Test score: 0.687889077532541
Also, Read – Data Science VS. Data Engineering.
The accuracy rate is not that great because most of our mistakes happen when predicting the difference between positive and neutral and negative and neutral feelings, which in the grand scheme of errors is not the worst thing to have. Although fortunately, we rarely confuse positive with a negative feeling and vice versa.
I hope you liked this article on Sentiment Analysis, feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.
Also, Read – Natural Language Processing Tutorial.
Hi! i am doing sentiment analysis on news headlines to evaluate govt performance. I need to know how did you annotate dataset.
Maybe this could help you:
https://thecleverprogrammer.com/2020/05/09/data-science-project-on-text-and-annotations/
Reply soon if this doesn’t help, I will create a tutorial on it soon.