A data scientist has to spend a lot of time preparing a dataset for any data science task because the data we get has a lot of errors, and sometimes it is not labeled. Adding labels to a dataset is very important before you can use it to solve a problem. One of those problems where adding labels to a dataset is very important is sentiment analysis, where you get the data as reviews or comments from users, and you need to add labels to it to prepare it for sentiment analysis. So, if you want to learn how to label unlabeled data, this article is for you. In this article, I will present a tutorial on how to add labels to a dataset for sentiment analysis using Python.
Add Labels to a Dataset for Sentiment Analysis
To add labels to unlabeled data for sentiment analysis, we can use the Vader sentiment model which is one of the best approaches for sentiment analysis. We can access it using the NLTK library in Python. Let’s import the necessary Python libraries and an unlabeled dataset that we need for the task of adding labels to a data for sentiment analysis:
import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer nltk.download("vader_lexicon") import pandas as pd data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/reviews%20data.csv") data = data.dropna() print(data.head())
Review 0 nice hotel expensive parking got good deal sta... 1 ok nothing special charge diamond member hilto... 2 nice rooms not 4* experience hotel monaco seat... 3 unique, great stay, wonderful time hotel monac... 4 great stay great stay, went seahawk game aweso...
So this dataset contains only one column, I will now move to the task of adding labels to the dataset. I will start by adding four new columns to this dataset as Positive, Negative, Neutral, and Compound by calculating the sentiment scores of the column containing textual data:
sentiments = SentimentIntensityAnalyzer() data["Positive"] = [sentiments.polarity_scores(i)["pos"] for i in data["Review"]] data["Negative"] = [sentiments.polarity_scores(i)["neg"] for i in data["Review"]] data["Neutral"] = [sentiments.polarity_scores(i)["neu"] for i in data["Review"]] data['Compound'] = [sentiments.polarity_scores(i)["compound"] for i in data["Review"]] data.head()
As you can see in the above output, we have added four new columns containing the sentiment scores of the Review column. Now the next task is to add labels by categorizing these scores. According to the industry standards, if the compound score of sentiment is more than 0.05, then it is categorized as Positive, and if the compound score is less than -0.05, then it is categorized as Negative, otherwise, it’s neutral. So with this information, I will add a new column in this dataset which will include all the sentiment labels:
score = data["Compound"].values sentiment =  for i in score: if i >= 0.05 : sentiment.append('Positive') elif i <= -0.05 : sentiment.append('Negative') else: sentiment.append('Neutral') data["Sentiment"] = sentiment data.head()
Now let’s have a look at the frequencies of all the labels:
Positive 18831 Negative 1569 Neutral 91
So now we are ended up with six columns in this dataset which is now labeled. The Review column was the only initial column in the dataset, we added four columns containing the sentiment scores, and at last, we added a new column containing labels according to the sentiment scores. If you only want the text and label columns, you can remove all other columns and save your dataset. To save your new labeled data, you can execute the command mentioned below:
So this is how you can add labels to an unlabeled dataset for sentiment analysis using the Python programming language. Adding labels to an unlabeled dataset is very important before we can use it for solving a problem. I hope you liked this article on how to add labels to a dataset for sentiment analysis. Feel free to ask your valuable questions in the comments section below.