Sentiment Analysis

Sentiment analysis is the process by which all of the content can be quantified to represent the ideas, beliefs, and opinions of entire sectors of the audience. The implications of sentiment analysis are hard to underestimate to increase the productivity of the business. Sentiment Analysis is one of those common NLP tasks that every Data Scientist need to perform.

For example, you are a student in an online course and you have a problem. You post it on the class forum. The sentiment analysis would be able to not only identify the topic you are struggling with, but also how frustrated or discouraged you are, and tailor their comments to that sentiment. This is already happening because the technology is already there.

Sentiment Analysis with Machine Learning

Hope you understood what sentiment analysis means. Now I’m going to introduce you to a very easy way to analyze sentiments with machine learning. The data I’ll be using includes 27,481 tagged tweets in the training set and 3,534 tweets in the test set. You can easily download the data from here. Now let’s start with this task by looking at the data using pandas:

import pandas as pd training = pd.read_csv("train.csv") test = pd.read_csv("test.csv") print("Training data: \n",training.head()) print("Test Data: \n",test.head())
Code language: Python (python)
Training data: 
        textID  ... sentiment
0  cb774db0d1  ...   neutral
1  549e992a42  ...  negative
2  088c60f138  ...  negative
3  9642c003ef  ...  negative
4  358bd9e861  ...  negative

[5 rows x 4 columns]
Test Data: 
        textID                                               text sentiment
0  f87dea47db  Last session of the day   neutral
1  96d74cb729   Shanghai is also really exciting (precisely -...  positive
2  eee518ae67  Recession hit Veronique Branquinho, she has to...  negative
3  01082688c6                                        happy bday!  positive
4  33987a8ee5    - I like it!!  positive

Data processing

For the sake of simplicity, we don’t want to go overboard on the data cleaning side, but there are a few simple things we can do to help our machine learning model identify the sentiments. The data cleaning process is as follows:

  1. Remove all hyperlinks from tweets
  2. Replace common contractions
  3. Remove punctuation

As a process of data preparation, we can create a function to map the labels of sentiments to integers and return them from the function:

import re contractions_dict = {"can`t": "can not", "won`t": "will not", "don`t": "do not", "aren`t": "are not", "i`d": "i would", "couldn`t": "could not", "shouldn`t": "should not", "wouldn`t": "would not", "isn`t": "is not", "it`s": "it is", "didn`t": "did not", "weren`t": "were not", "mustn`t": "must not", } def prepare_data(df: pd.DataFrame) -> pd.DataFrame: df["text"] = df["text"] \ .apply(lambda x: re.split('http:\/\/.*', str(x))[0]) \ .str.lower() \ .apply(lambda x: replace_words(x, contractions_dict)) df["label"] = df["sentiment"].map( {"neutral": 1, "negative": 0, "positive": 2} ) return df.text.values, df.label.values def replace_words(string: str, dictionary: dict): for k, v in dictionary.items(): string = string.replace(k, v) return string train_tweets, train_labels = prepare_data(training) test_tweets, test_labels = prepare_data(test)
Code language: Python (python)


Now we need to tokenize each tweet into a single fixed-length vector – specifically a TFIDF integration. To do this we can use Tokenizer() built into Keras, suitable for training data:

from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(train_tweets) train_tokenized = tokenizer.texts_to_matrix(train_tweets,mode='tfidf') test_tokenized = tokenizer.texts_to_matrix(test_tweets,mode='tfidf')
Code language: Python (python)

Machine Learning Model for Sentiment Analysis

Now, I will train our model for sentiment analysis using the Random Forest Classification algorithm provided by Scikit-Learn:

from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators=500, min_samples_leaf=2,oob_score=True,n_jobs=-1,),train_labels) print(f"Train score: {forest.score(train_tokenized,train_labels)}") print(f"OOB score: {forest.oob_score_}")
Code language: Python (python)

Train score: 0.7672573778246788
OOB score: 0.6842545758887959

Evaluating the Model on Test Set

Scikit-Learn makes it easy to use both the classifier and the test data to produce a confusion matrix algorithm showing performance on the test set as follows:

print("Test score: ",forest.score(test_tokenized,test_labels))
Code language: Python (python)

Test score: 0.687889077532541

Also, Read – Data Science VS. Data Engineering.

The accuracy rate is not that great because most of our mistakes happen when predicting the difference between positive and neutral and negative and neutral feelings, which in the grand scheme of errors is not the worst thing to have. Although fortunately, we rarely confuse positive with a negative feeling and vice versa.

I hope you liked this article on Sentiment Analysis, feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Also, Read – Natural Language Processing Tutorial.

Follow Us:

Default image
Aman Kharwal

I am a programmer from India, and I am here to guide you with Data Science, Machine Learning, Python, and C++ for free. I hope you will learn a lot in your journey towards Coding, Machine Learning and Artificial Intelligence with me.


  1. Hi! i am doing sentiment analysis on news headlines to evaluate govt performance. I need to know how did you annotate dataset.

Leave a Reply