Sentiment Analysis

Sentiment analysis is the process by which all of the content can be quantified to represent the ideas, beliefs, and opinions of entire sectors of the audience. The implications of sentiment analysis are hard to underestimate to increase the productivity of the business. Sentiment Analysis is one of those common NLP tasks that every Data Scientist need to perform.

For example, you are a student in an online course and you have a problem. You post it on the class forum. The sentiment analysis would be able to not only identify the topic you are struggling with, but also how frustrated or discouraged you are, and tailor their comments to that sentiment. This is already happening because the technology is already there.

Sentiment Analysis with Machine Learning

Hope you understood what sentiment analysis means. Now I’m going to introduce you to a very easy way to analyze sentiments with machine learning. The data I’ll be using includes 27,481 tagged tweets in the training set and 3,534 tweets in the test set. You can easily download the data from here. Now let’s start with this task by looking at the data using pandas:

import pandas as pd
training = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
print("Training data: \n",training.head())
print("Test Data: \n",test.head())Code language: Python (python)
Training data: 
        textID  ... sentiment
0  cb774db0d1  ...   neutral
1  549e992a42  ...  negative
2  088c60f138  ...  negative
3  9642c003ef  ...  negative
4  358bd9e861  ...  negative

[5 rows x 4 columns]
Test Data: 
        textID                                               text sentiment
0  f87dea47db  Last session of the day  http://twitpic.com/67ezh   neutral
1  96d74cb729   Shanghai is also really exciting (precisely -...  positive
2  eee518ae67  Recession hit Veronique Branquinho, she has to...  negative
3  01082688c6                                        happy bday!  positive
4  33987a8ee5             http://twitpic.com/4w75p - I like it!!  positive

Data processing

For the sake of simplicity, we don’t want to go overboard on the data cleaning side, but there are a few simple things we can do to help our machine learning model identify the sentiments. The data cleaning process is as follows:

  1. Remove all hyperlinks from tweets
  2. Replace common contractions
  3. Remove punctuation

As a process of data preparation, we can create a function to map the labels of sentiments to integers and return them from the function:

import re

contractions_dict = {"can`t": "can not",
                     "won`t": "will not",
                     "don`t": "do not",
                     "aren`t": "are not",
                     "i`d": "i would",
                     "couldn`t": "could not",
                     "shouldn`t": "should not",
                     "wouldn`t": "would not",
                     "isn`t": "is not",
                     "it`s": "it is",
                     "didn`t": "did not",
                     "weren`t": "were not",
                     "mustn`t": "must not",
                     }


def prepare_data(df:pd.DataFrame) -> pd.DataFrame:
    
    df["text"] = df["text"] \
              .apply(lambda x: re.split('http:\/\/.*', str(x))[0]) \
              .str.lower() \
              .apply(lambda x: replace_words(x,contractions_dict))
        
    df["label"] = df["sentiment"].map(
                        {"neutral": 1, "negative":0, "positive":2 }
                        )
    return df.text.values, df.label.values
def replace_words(string:str, dictionary:dict):
    for k, v in dictionary.items():
        string = string.replace(k, v)
    return string
train_tweets, train_labels = prepare_data(train_df)
test_tweets, test_labels = prepare_data(test_df)

Tokenization

Now we need to tokenize each tweet into a single fixed-length vector – specifically a TFIDF integration. To do this we can use Tokenizer() built into Keras, suitable for training data:

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_tweets)
train_tokenized = tokenizer.texts_to_matrix(train_tweets,mode='tfidf')
test_tokenized = tokenizer.texts_to_matrix(test_tweets,mode='tfidf')Code language: Python (python)

Machine Learning Model for Sentiment Analysis

Now, I will train our model for sentiment analysis using the Random Forest Classification algorithm provided by Scikit-Learn:

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=500, min_samples_leaf=2,oob_score=True,n_jobs=-1,)
forest.fit(train_tokenized,train_labels)
print(f"Train score: {forest.score(train_tokenized,train_labels)}")
print(f"OOB score: {forest.oob_score_}")Code language: Python (python)

Train score: 0.7672573778246788
OOB score: 0.6842545758887959

Evaluating the Model on Test Set

Scikit-Learn makes it easy to use both the classifier and the test data to produce a confusion matrix algorithm showing performance on the test set as follows:

print("Test score: ",forest.score(test_tokenized,test_labels))Code language: Python (python)

Test score: 0.687889077532541

Also, Read – Data Science VS. Data Engineering.

The accuracy rate is not that great because most of our mistakes happen when predicting the difference between positive and neutral and negative and neutral feelings, which in the grand scheme of errors is not the worst thing to have. Although fortunately, we rarely confuse positive with a negative feeling and vice versa.

I hope you liked this article on Sentiment Analysis, feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Also, Read – Natural Language Processing Tutorial.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

2 Comments

Leave a Reply