NLP For WhatsApp Chats

Natural Language Processing or NLP is a field of Artificial Intelligence which focuses on enabling the systems for understanding and processing the human languages. In this article, I will use NLP to analyze my WhatsApp Chats. For some privacy reasons, I will use Person 1, Person 2 and so on in my WhatsApp Chats.

Get The Whatsapp Data for NLP

If you have never exported your whatsapp chats before, don’t worry it’s very easy. For NLP of WhatsApp chats, you need to extract the whatsapp chats from your smartphone. You just need to open any chat in your whatsapp then select the export chat option. The text file you will get as a return will look like this:

["[02/07/2017, 5:47:33 pm] Person_1: Hey there! This is the first message",
 "[02/07/2017, 5:48:24 pm] Person_1: This is the second message",
 "[02/07/2017, 5:48:44 pm] Person_1: Third…",
 "[02/07/2017, 8:10:52 pm] Person_2: Hey Person_1! This is the fourth message",
 "[02/07/2017, 8:14:11 pm] Person_2: Fifth …etc"]Code language: Python (python)

I will use two different approaches for the NLP of WhatsApp Chats. First, by focusing on the fundamentals of NLP and the other is by using the datetime stamp at the starting of every conversation.

Formatting Whatsapp Chats for NLP

To analyze our whatsapp conversations, initially, our conversation needs to be formatted in the form of data. This involved a few basic steps in achieving the formation of data by creating a dictionary, constructed within two keys with each of the respective values with a list of the person tokenized conversations.

ppl=defaultdict(list)

for line in content:
    try:
        person = line.split(':')[2][7:]
        text = nltk.sent_tokenize(':'.join(line.split(':')[3:]))
        ppl[person].extend(text)   # If key exists (person), extend list with value (text),
                                   # if not create a new key, with value added to list
    except:
        print(line)  # in case reading a line fails, examine why
        passCode language: Python (python)
ppl = {'Person_1' : ['This is message 1', 'Another message',
'Hi Person_2', ... , 'My last tokenised message in the chat'] ,
 'Person_2':['Hello Person_1!', 'How's it going?', 'Another messsage',
  ...]}Code language: Python (python)

Classification of Dialogues

The classification of tokenized conversations will ne be achieved by training a Naive Bayes Classification model or the training set with some pre-categorized chat styles conversations:

posts = nltk.corpus.nps_chat.xml_posts()

def extract_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

fposts = [(extract_features(p.text), p.get('class')) for p in posts]
test_size = int(len(fposts) * 0.1)
train_set, test_set = fposts[test_size:], fposts[:test_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)Code language: Python (python)

Our trained model can be tested by using a test set or even by user input. Our model is trained in a way that can classify any tokenized sentence into different categories like Greetings, Statements, Emotions, questions, etc.

classifier.classify(extract_features('Hi there!'))Code language: Python (python)

‘Greet’

classifier.classify(extract_features('Do you want to watch a film later?'))Code language: Python (python)

ynQuestion’

Now let’s run the model on WhatsApp data for counting the occurrences of each category of the tokenized conversations:

ax = df.T.plot(kind='bar', figsize=(10, 7), legend=True,
               fontsize=16, color=['y','g'])

ax.set_title("Frequency of Message Categories", fontsize= 18)
ax.set_xlabel("Message Category", fontsize=14)
ax.set_ylabel("Frequency", fontsize=14)

#plt.savefig('plots/cat_message')   # uncomment to save
plt.show()Code language: Python (python)
NLP for Whatsapp

NLP for WhatsApp Chats Emotions

We all use emojis, everyone, not only on WhatsApp but with any other chatting platform. Now let’s see what emojis are being used in most of the conversations.

def extract_emojis(str):
  return ''.join(c for c in str if c in emoji.UNICODE_EMOJI)

for key, val in ppl.items():
    emojis=extract_emojis(str(ppl[key]))
    count = Counter(emojis).most_common()[:10]

    print("{}'s emojis:\n {} \n".format(key, emojis))
    print("Most common: {}\n\n".format(count))Code language: Python (python)
Person_1's emojis:
 😏🕺🏼🍻😮🤤😭😏💁🏼😏👏🙏🐳🐋😏😱🙄😳☺😭🚀💫⭐✨💥🍕🍕😏😊😘🙄💭😭😭😭😭😏✅😱😏😭🙄😘😘😘😘😭😭😭😭😭😭🍸😘😘😅😘😭👏💪😭🙅♂🙆♂🙋♂💁♂😘🎉🎉🎉🎉🎉🎉🎉🎉🎉😊😘🙄😴😉🕺🏼😭😎😭🙄😘😘😘👏😩😭😭😭😭😭😭😭😭😭😭😭😭😭😭😭😭😭😭😭😭📞🎉😘😀😚😱👏🏏😏🚂🤓👏🙄🙌😘😘😏😭😭🙌😏😔😭😘🤰🏼😘🙄🙄😰🙋🏼♀😭🙄😍🤓👏😭😭😭😭😭😘🍕💩☹🙋🏼♀😘😴🚲😘😘😘😭☹😗😙😚😚🤔🤝🍻🎂✈😘👌😰😘🔺🔥😩😘💨😚😱😢😭😭😭😭😭😭😭😭😭😭😗🤔🤔🤔🤔🤔🤔🤔👀👏😇😗😚😘🙄☹😘😩😚😇⚡💥🔥☹😭😩😭😰😱😅😅😍😞👏👏👏👏👏👏😘😘😊😘😘😍😘🙄😏😘😘🙄😘👀😘😘👀😘😘😘🥕😘😘😘😘😘😘😘😘😭😘😘🖕🏻😘🌇😘😘😘🙄😪🤧😘🥚😘😘😘😘😘😘😘😘😘😘😱😘😭😭😘🆘❌‼⭕♨🚫⛔🚷🖍📌📍✂📕📮🔻☎⏰🚨🚒🚗🥊🏓🍷🌶🍅🍎☄🌹🎒👠⛑😎😘😘😘😘😙👀🙄😭😭😭😭😭😭😘😘😘🥚😘🙄🙄😘 

Most common: [('😘', 77), ('😭', 68), ('🙄', 16), ('👏', 13), ('😏', 11), ('🎉', 10), ('🤔', 8), ('🏼', 6), ('😱', 6), ('😚', 6)]


Person_2's emojis:
 😁🙂🤓😅😀👍😂😬👻😁😂✌😴😬😬🙄🎉✌😂😪😒😬😐😬😁😬😁😏🤢😁😒😁😏😘😒😅😂💪👊😬😏💁♂😴😬😅😏😆🐬🙁😬🐬😁😁✌😁😁👊👮😕✌😁😁😐✌😱😩😬✌✌😂😘💇♂😁😁😁😅🙂😬🙁😁😁😕😴😁😏😁😘😅😴🙂🎉🎉🎉😁🚀🚀🚀😁😱✌🍕🍕😏👍😂😁😑😘🙄😁😘😬😂😁🎉🎉🎉✌☺😑😁😬🙂😱😂✌☺😁👊😁👊👍😏💁🏼😅😁😁😁😕✌🤓😂😘😁😁✌✌😘🙁😘😁🎉✌😘😘😘😘😅😁😁😁😁😂🙁😏😔✌😘😁😐😁✌🙂👍😘😬😁✌😂🙋🏼😎😁🤓💩😂😘😐😏✌🙂✌😘✌😁🤔✌🏋🏼♀😬🙂😁👊😁✌😁😁😏🤜🤛☹⚡😬🎯💪😁☹😞👋🙂😘😴😁😁🎉😁✌🙂😘😬✌👍😁💃👍👍👍👍😢☹🙁🙁👋😏😬😁✌😘🙁👍🙌🤓😏🎉💁♂😁😑😁😁😁🎉😁☹😕😢😬✌😞😬✌😬👍😁😏😁👍👍👊😁😧😘😪😁🎉🎉🎉😕👍😁👉😁👊😏😁😁😂😂😂🤳👌😁👌🙋🏼♀👋😐😐😁🙁😕👊😁🤔🤗🤙👍😬🤔🎉🎅🏻👍😁😁😁🤚😘🤚👍👊🙁🙁🙁🙄😘🙋🏼♀🤣😘🎉😬🙁😖💁♂😂😒🎉😗👏🤔🤐🙄👊😘😉😘🙂☹💰😏🎉😑😬👍👍👎🙋♂💁♂😁😁🙂☹🤔🦄🦄😬😆😴😁😁😁😍🏄♀👀😁🏄♀👍😬👊😬🤔😁🙄👌👍😫☹🤗😩👀😁💰🤔👍😁😰😳😣😟😘👀🤗🙂😅👍🤔🙂😁😁😣🕺😮🙂☹☹😑🤘☹😬🍳😘😬😘🤘🙋♂🙁🍓😢😁😂😂😂😁😘🐑😚😚😚🤞😁🙄😁🙋♂😴😘👍😁👊😑😒👍😑😬👍👍👍😕☹😟💇♀👏🎉😏😁😚🤔👍👍😁😏👍😁😚😁🎉😬🙂😬😁🔥🤝☹🙌😏💁♂😁😁😁😁😁🙁😭🙂😬😘🙂😁😬👍☺🙁😂👀👌🙌😁💁🏼♀😁😬👍😕🙂😗😁😕🙁👀😁👏🎉😩😕🙁😊😴🤞😚😩😩😩😁😬👍👍😬😚😁😱👻👽😑😁😴🤒😁🙁👊🤓☹😁🤙😁👽👊😊🤙😁☹🙄😇🙂😁😩☹😚😏👍🙁👋😟😁☹😚🤔😧🙁☹🙃🙂👋🙂👍👍😁🤙👍💰🙂😢🤙💰😚👍🤔🤣🤣🤣🎉😢😏😬🤓👊💁♂😁😁😁👍🔥🤙😁👉😗😁⚡💆♀⚡👏😚😘🤔☹🤝😢😳😳😉👍☺👊☹⚡⚡⚡☹☹☹👍☹😚🔥🔥😢💰😁😬👊🤔👻🙌💁🏼♀😒😫👍👊😇🙂🤔🤙☹😪😉👍😁💪😭😁💩🤤😚☹☹👊🤙😚😘🙏🤥😁👍👍😚🤗😁🙄🙄😁👍😁😯😚👍🙄🙌🤔😁😘👍👊😱😏👍😘😁🎉😭😁😚😘😴👍😏🤔🤔😏🤢😘😭😭😚😬👍😘👊👌😘😁😁😚👋😁✋☝😭🤔👍😘🤙💁🏼♀😘😘👍👀👋😘😘😘😘😘😘😘😘🙁🙁👍😘😁😚👊👍😬👍👍🎉👍😋😘😘😘😘😘😘😘😘☹😘😁👍😁🤙👏👍😚😘😘😘😘😘😘👍💁🏼♀👍😘😏🤔👍👍👍😘👍😁😘👊👍👍👍👍☹👍👍👍👍😘👍😴🤙😘😘😘😘😘😘😘😘😕👊👍👍😁😘😚👆💁♀😴😘👊😥👊👍😅🙂👊🤙😘😘😘😘😘😘😘😘😲😘👍🤔😫🤣🍳😎😚😢😯💃👍🙄👍👍💇♂👊😚😚😘👍🙄😘😚😘😢🛎😚🙏😂😘😘😘👌👍🤷♂😂👍😕👍😘😘😘😘👏👊😅😉💤👍😁😚👍🤙🤓🤗😘😁💃😏😘😘😬💁♂😂☹😁👍😘 

Most common: [('😁', 138), ('😘', 103), ('👍', 91), ('😬', 42), ('👊', 29), ('☹', 29), ('😚', 28), ('✌', 27), ('😏', 25), ('🙂', 24)]

It’s very interesting to visualize how one person uses more emojis than the other person. This is the only way we express our emotions while having a whatsapp conversation.

Sentiment Against Time

The plotting of sentiments against the datetime is not as easy as it looks. As there are many different sentiments on the same day, so the first step is to calculate the mean sentiment for each day and then grouping by datetime. So let’s see how we can do this:

df= pd.DataFrame(final).T  # convert dictionary to a dataframe, makes process of plotting straightforward
df.columns = ['pol', 'name', 'date', 'token']
df['pol'] = df['pol'].apply(lambda x : float(x)) # convert polarity to a float
df3 = df.groupby(['date'], as_index=False).agg('mean')
df3['name'] = 'Combined'
final =pd.concat([df2, df3])
final['date'] = pd.to_datetime(final.date, format='%d/%m/%Y') # need to chnage 'date' to a datetime object
final = final.sort_values('date')
final['x'] = final['date'].rank(method='dense', ascending=True).astype(int)
final[:6]Code language: Python (python)
datenamepolx
172017-07-02Combined0.3211621
342017-07-02Person_10.2984901
352017-07-02Person_20.3417731
592017-07-03Person_20.2494892
292017-07-03Combined0.2714582
582017-07-03Person_10.3373672

Even plotting the average of sentiments for each day will prove to be very messy. So let’s simply take a rolling average of 10 days, and then plot the average sentiment score:

sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(12,8))
colours=['b','y','g']
i=0

for label, df in final.groupby('name'):
    
    new=df.reset_index()
    new['rol'] = new['pol'].rolling(10).mean() # rolling mean calculation on a 10 day basis
    
    g = new.plot(x='date', y='rol', ax=ax, label=label, color=colours[i]) # rolling mean plot
    plt.scatter(df['date'].tolist(), df['pol'], color=colours[i], alpha=0.2) # underlying scatter plot
    
    i+=1

ax.set_ybound(lower=-0.1, upper=0.4)
ax.set_xlabel('Date', fontsize=15)
ax.set_ylabel('Compound Sentiment', fontsize=15)

g.set_title('10 Day Rolling Mean Sentiment', fontsize=18)Code language: Python (python)
rolling mean

Frequency of Chats

Now let’s have a look at the frequency of whatsapp chats which is not a part of NLP for Whatsapp but it is a part of time series analysis. We can use time series here to see the frequency of chats. First, need to create a colour pallete ordered by the total number of messages for each day.

Also, read – PyTorch for Deep Learning.

pal = sns.cubehelix_palette(7, rot=-.25, light=.7)

Ordered list of days according to total message count

days_freq = list(df.day.value_counts().index)
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']Code language: Python (python)

This is essentially the current order of colours:

lst = list(zip(days, pal[::-1]))
lstCode language: Python (python)
[('Monday', [0.12071162840208301, 0.14526386650440642, 0.2463679091477368]),
 ('Tuesday', [0.18152581198633005, 0.24364059111738742, 0.37281834227732574]),
 ('Wednesday', [0.2426591079772084, 0.3511228226876375, 0.4852103253459974]),
 ('Thursday', [0.30463866738797124, 0.45571986933681846, 0.5751187147066701]),
 ('Friday', [0.37810168111401876, 0.5633546614344814, 0.6530658354036274]),
 ('Saturday', [0.46091631066717925, 0.662287611911293, 0.7165315069314769]),
 ('Sunday', [0.5632111255041908, 0.758620966612444, 0.7764634182455044])]

Reorder colours according to their index position in the ‘days_freq’ list:

pal_reorder=[]

for i in days:
    #print(i)
    j=0
    for day in days_freq:
        
        if i == day:
            #print(lst[j][1])
            pal_reorder.append(lst[j][1])
        j+=1
pal_reorder   # colours ordered according to total message count for the dayCode language: Python (python)
[[0.30463866738797124, 0.45571986933681846, 0.5751187147066701],
 [0.18152581198633005, 0.24364059111738742, 0.37281834227732574],
 [0.12071162840208301, 0.14526386650440642, 0.2463679091477368],
 [0.2426591079772084, 0.3511228226876375, 0.4852103253459974],
 [0.37810168111401876, 0.5633546614344814, 0.6530658354036274],
 [0.5632111255041908, 0.758620966612444, 0.7764634182455044],
 [0.46091631066717925, 0.662287611911293, 0.7165315069314769]]
sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
pal = sns.cubehelix_palette(7, rot=-.25, light=.7)
g = sns.FacetGrid(df[(df.float_time > 8)], row="day", hue="day",   # change "day" to year_month if required
                  aspect=10, size=1.5, palette=pal_reorder, xlim=(7,24))

# Draw the densities in a few steps
g.map(sns.kdeplot, "float_time", clip_on=False, shade=True, alpha=1, lw=1.5, bw=.2)
g.map(sns.kdeplot, "float_time", clip_on=False, color="w", lw=3, bw=.2)
g.map(plt.axhline, y=0, lw=1, clip_on=False)

# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, 0.1, label, fontweight="bold", color=color, 
            ha="left", va="center", transform=ax.transAxes, size=18)

g.map(label, "float_time")
g.set_xlabels('Time of Day', fontsize=30)
g.set_xticklabels(fontsize=20)
# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-0.5)
g.fig.suptitle('Message Density by Time and Day of the Week, Shaded by Total Message Count', fontsize=22)   
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)Code language: Python (python)
NLP for Whatsapp

Also, read – 10 Machine Learning Projects to Boost your Portfolio.

I hope you liked this article on NLP for WhatsApp chats. Feel free to ask your valuable questions in the comments section. Don’t forget to subscribe for my daily newsletters below to get email notifications if you like my work.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

5 Comments

Leave a Reply