Categories
By Aman Kharwal

NLP For WhatsApp Chats

Natural Language Processing or NLP is a field of Artificial Intelligence which focuses on enabling the systems for understanding and processing the human languages. In this article, I will use NLP to analyze my WhatsApp Chats. For some privacy reasons, I will use Person 1, Person 2 and so on in my WhatsApp Chats.

Get The Whatsapp Data for NLP

If you have never exported your whatsapp chats before, don’t worry it’s very easy. For NLP of WhatsApp chats, you need to extract the whatsapp chats from your smartphone. You just need to open any chat in your whatsapp then select the export chat option. The text file you will get as a return will look like this:

["[02/07/2017, 5:47:33 pm] Person_1: Hey there! This is the first message",
 "[02/07/2017, 5:48:24 pm] Person_1: This is the second message",
 "[02/07/2017, 5:48:44 pm] Person_1: Thirdโ€ฆ",
 "[02/07/2017, 8:10:52 pm] Person_2: Hey Person_1! This is the fourth message",
 "[02/07/2017, 8:14:11 pm] Person_2: Fifth โ€ฆetc"]

I will use two different approaches for the NLP of WhatsApp Chats. First, by focusing on the fundamentals of NLP and the other is by using the datetime stamp at the starting of every conversation.

Formatting Whatsapp Chats for NLP

To analyze our whatsapp conversations, initially, our conversation needs to be formatted in the form of data. This involved a few basic steps in achieving the formation of data by creating a dictionary, constructed within two keys with each of the respective values with a list of the person tokenized conversations.

ppl=defaultdict(list)

for line in content:
    try:
        person = line.split(':')[2][7:]
        text = nltk.sent_tokenize(':'.join(line.split(':')[3:]))
        ppl[person].extend(text)   # If key exists (person), extend list with value (text),
                                   # if not create a new key, with value added to list
    except:
        print(line)  # in case reading a line fails, examine why
        pass
ppl = {'Person_1' : ['This is message 1', 'Another message',
'Hi Person_2', ... , 'My last tokenised message in the chat'] ,
 'Person_2':['Hello Person_1!', 'How's it going?', 'Another messsage',
  ...]}

Classification of Dialogues

The classification of tokenized conversations will ne be achieved by training a Naive Bayes Classification model or the training set with some pre-categorized chat styles conversations:

posts = nltk.corpus.nps_chat.xml_posts()

def extract_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

fposts = [(extract_features(p.text), p.get('class')) for p in posts]
test_size = int(len(fposts) * 0.1)
train_set, test_set = fposts[test_size:], fposts[:test_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

Our trained model can be tested by using a test set or even by user input. Our model is trained in a way that can classify any tokenized sentence into different categories like Greetings, Statements, Emotions, questions, etc.

classifier.classify(extract_features('Hi there!'))

‘Greet’

classifier.classify(extract_features('Do you want to watch a film later?'))

ynQuestion’

Now let’s run the model on WhatsApp data for counting the occurrences of each category of the tokenized conversations:

ax = df.T.plot(kind='bar', figsize=(10, 7), legend=True,
               fontsize=16, color=['y','g'])

ax.set_title("Frequency of Message Categories", fontsize= 18)
ax.set_xlabel("Message Category", fontsize=14)
ax.set_ylabel("Frequency", fontsize=14)

#plt.savefig('plots/cat_message')   # uncomment to save
plt.show()
NLP for Whatsapp

NLP for WhatsApp Chats Emotions

We all use emojis, everyone, not only on WhatsApp but with any other chatting platform. Now let’s see what emojis are being used in most of the conversations.

def extract_emojis(str):
  return ''.join(c for c in str if c in emoji.UNICODE_EMOJI)

for key, val in ppl.items():
    emojis=extract_emojis(str(ppl[key]))
    count = Counter(emojis).most_common()[:10]

    print("{}'s emojis:\n {} \n".format(key, emojis))
    print("Most common: {}\n\n".format(count))
Person_1's emojis:
 ๐Ÿ˜๐Ÿ•บ๐Ÿผ๐Ÿป๐Ÿ˜ฎ๐Ÿคค๐Ÿ˜ญ๐Ÿ˜๐Ÿ’๐Ÿผ๐Ÿ˜๐Ÿ‘๐Ÿ™๐Ÿณ๐Ÿ‹๐Ÿ˜๐Ÿ˜ฑ๐Ÿ™„๐Ÿ˜ณโ˜บ๐Ÿ˜ญ๐Ÿš€๐Ÿ’ซโญโœจ๐Ÿ’ฅ๐Ÿ•๐Ÿ•๐Ÿ˜๐Ÿ˜Š๐Ÿ˜˜๐Ÿ™„๐Ÿ’ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜โœ…๐Ÿ˜ฑ๐Ÿ˜๐Ÿ˜ญ๐Ÿ™„๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿธ๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜…๐Ÿ˜˜๐Ÿ˜ญ๐Ÿ‘๐Ÿ’ช๐Ÿ˜ญ๐Ÿ™…โ™‚๐Ÿ™†โ™‚๐Ÿ™‹โ™‚๐Ÿ’โ™‚๐Ÿ˜˜๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐Ÿ˜Š๐Ÿ˜˜๐Ÿ™„๐Ÿ˜ด๐Ÿ˜‰๐Ÿ•บ๐Ÿผ๐Ÿ˜ญ๐Ÿ˜Ž๐Ÿ˜ญ๐Ÿ™„๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ‘๐Ÿ˜ฉ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ“ž๐ŸŽ‰๐Ÿ˜˜๐Ÿ˜€๐Ÿ˜š๐Ÿ˜ฑ๐Ÿ‘๐Ÿ๐Ÿ˜๐Ÿš‚๐Ÿค“๐Ÿ‘๐Ÿ™„๐Ÿ™Œ๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ™Œ๐Ÿ˜๐Ÿ˜”๐Ÿ˜ญ๐Ÿ˜˜๐Ÿคฐ๐Ÿผ๐Ÿ˜˜๐Ÿ™„๐Ÿ™„๐Ÿ˜ฐ๐Ÿ™‹๐Ÿผโ™€๐Ÿ˜ญ๐Ÿ™„๐Ÿ˜๐Ÿค“๐Ÿ‘๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜˜๐Ÿ•๐Ÿ’ฉโ˜น๐Ÿ™‹๐Ÿผโ™€๐Ÿ˜˜๐Ÿ˜ด๐Ÿšฒ๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜ญโ˜น๐Ÿ˜—๐Ÿ˜™๐Ÿ˜š๐Ÿ˜š๐Ÿค”๐Ÿค๐Ÿป๐ŸŽ‚โœˆ๐Ÿ˜˜๐Ÿ‘Œ๐Ÿ˜ฐ๐Ÿ˜˜๐Ÿ”บ๐Ÿ”ฅ๐Ÿ˜ฉ๐Ÿ˜˜๐Ÿ’จ๐Ÿ˜š๐Ÿ˜ฑ๐Ÿ˜ข๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜—๐Ÿค”๐Ÿค”๐Ÿค”๐Ÿค”๐Ÿค”๐Ÿค”๐Ÿค”๐Ÿ‘€๐Ÿ‘๐Ÿ˜‡๐Ÿ˜—๐Ÿ˜š๐Ÿ˜˜๐Ÿ™„โ˜น๐Ÿ˜˜๐Ÿ˜ฉ๐Ÿ˜š๐Ÿ˜‡โšก๐Ÿ’ฅ๐Ÿ”ฅโ˜น๐Ÿ˜ญ๐Ÿ˜ฉ๐Ÿ˜ญ๐Ÿ˜ฐ๐Ÿ˜ฑ๐Ÿ˜…๐Ÿ˜…๐Ÿ˜๐Ÿ˜ž๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜Š๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜๐Ÿ˜˜๐Ÿ™„๐Ÿ˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ™„๐Ÿ˜˜๐Ÿ‘€๐Ÿ˜˜๐Ÿ˜˜๐Ÿ‘€๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿฅ•๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜ญ๐Ÿ˜˜๐Ÿ˜˜๐Ÿ–•๐Ÿป๐Ÿ˜˜๐ŸŒ‡๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ™„๐Ÿ˜ช๐Ÿคง๐Ÿ˜˜๐Ÿฅš๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜ฑ๐Ÿ˜˜๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜˜๐Ÿ†˜โŒโ€ผโญ•โ™จ๐Ÿšซโ›”๐Ÿšท๐Ÿ–๐Ÿ“Œ๐Ÿ“โœ‚๐Ÿ“•๐Ÿ“ฎ๐Ÿ”ปโ˜Žโฐ๐Ÿšจ๐Ÿš’๐Ÿš—๐ŸฅŠ๐Ÿ“๐Ÿท๐ŸŒถ๐Ÿ…๐ŸŽโ˜„๐ŸŒน๐ŸŽ’๐Ÿ‘ โ›‘๐Ÿ˜Ž๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜™๐Ÿ‘€๐Ÿ™„๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿฅš๐Ÿ˜˜๐Ÿ™„๐Ÿ™„๐Ÿ˜˜ 

Most common: [('๐Ÿ˜˜', 77), ('๐Ÿ˜ญ', 68), ('๐Ÿ™„', 16), ('๐Ÿ‘', 13), ('๐Ÿ˜', 11), ('๐ŸŽ‰', 10), ('๐Ÿค”', 8), ('๐Ÿผ', 6), ('๐Ÿ˜ฑ', 6), ('๐Ÿ˜š', 6)]


Person_2's emojis:
 ๐Ÿ˜๐Ÿ™‚๐Ÿค“๐Ÿ˜…๐Ÿ˜€๐Ÿ‘๐Ÿ˜‚๐Ÿ˜ฌ๐Ÿ‘ป๐Ÿ˜๐Ÿ˜‚โœŒ๐Ÿ˜ด๐Ÿ˜ฌ๐Ÿ˜ฌ๐Ÿ™„๐ŸŽ‰โœŒ๐Ÿ˜‚๐Ÿ˜ช๐Ÿ˜’๐Ÿ˜ฌ๐Ÿ˜๐Ÿ˜ฌ๐Ÿ˜๐Ÿ˜ฌ๐Ÿ˜๐Ÿ˜๐Ÿคข๐Ÿ˜๐Ÿ˜’๐Ÿ˜๐Ÿ˜๐Ÿ˜˜๐Ÿ˜’๐Ÿ˜…๐Ÿ˜‚๐Ÿ’ช๐Ÿ‘Š๐Ÿ˜ฌ๐Ÿ˜๐Ÿ’โ™‚๐Ÿ˜ด๐Ÿ˜ฌ๐Ÿ˜…๐Ÿ˜๐Ÿ˜†๐Ÿฌ๐Ÿ™๐Ÿ˜ฌ๐Ÿฌ๐Ÿ˜๐Ÿ˜โœŒ๐Ÿ˜๐Ÿ˜๐Ÿ‘Š๐Ÿ‘ฎ๐Ÿ˜•โœŒ๐Ÿ˜๐Ÿ˜๐Ÿ˜โœŒ๐Ÿ˜ฑ๐Ÿ˜ฉ๐Ÿ˜ฌโœŒโœŒ๐Ÿ˜‚๐Ÿ˜˜๐Ÿ’‡โ™‚๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜…๐Ÿ™‚๐Ÿ˜ฌ๐Ÿ™๐Ÿ˜๐Ÿ˜๐Ÿ˜•๐Ÿ˜ด๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜˜๐Ÿ˜…๐Ÿ˜ด๐Ÿ™‚๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐Ÿ˜๐Ÿš€๐Ÿš€๐Ÿš€๐Ÿ˜๐Ÿ˜ฑโœŒ๐Ÿ•๐Ÿ•๐Ÿ˜๐Ÿ‘๐Ÿ˜‚๐Ÿ˜๐Ÿ˜‘๐Ÿ˜˜๐Ÿ™„๐Ÿ˜๐Ÿ˜˜๐Ÿ˜ฌ๐Ÿ˜‚๐Ÿ˜๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰โœŒโ˜บ๐Ÿ˜‘๐Ÿ˜๐Ÿ˜ฌ๐Ÿ™‚๐Ÿ˜ฑ๐Ÿ˜‚โœŒโ˜บ๐Ÿ˜๐Ÿ‘Š๐Ÿ˜๐Ÿ‘Š๐Ÿ‘๐Ÿ˜๐Ÿ’๐Ÿผ๐Ÿ˜…๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜•โœŒ๐Ÿค“๐Ÿ˜‚๐Ÿ˜˜๐Ÿ˜๐Ÿ˜โœŒโœŒ๐Ÿ˜˜๐Ÿ™๐Ÿ˜˜๐Ÿ˜๐ŸŽ‰โœŒ๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜…๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜‚๐Ÿ™๐Ÿ˜๐Ÿ˜”โœŒ๐Ÿ˜˜๐Ÿ˜๐Ÿ˜๐Ÿ˜โœŒ๐Ÿ™‚๐Ÿ‘๐Ÿ˜˜๐Ÿ˜ฌ๐Ÿ˜โœŒ๐Ÿ˜‚๐Ÿ™‹๐Ÿผ๐Ÿ˜Ž๐Ÿ˜๐Ÿค“๐Ÿ’ฉ๐Ÿ˜‚๐Ÿ˜˜๐Ÿ˜๐Ÿ˜โœŒ๐Ÿ™‚โœŒ๐Ÿ˜˜โœŒ๐Ÿ˜๐Ÿค”โœŒ๐Ÿ‹๐Ÿผโ™€๐Ÿ˜ฌ๐Ÿ™‚๐Ÿ˜๐Ÿ‘Š๐Ÿ˜โœŒ๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿคœ๐Ÿค›โ˜นโšก๐Ÿ˜ฌ๐ŸŽฏ๐Ÿ’ช๐Ÿ˜โ˜น๐Ÿ˜ž๐Ÿ‘‹๐Ÿ™‚๐Ÿ˜˜๐Ÿ˜ด๐Ÿ˜๐Ÿ˜๐ŸŽ‰๐Ÿ˜โœŒ๐Ÿ™‚๐Ÿ˜˜๐Ÿ˜ฌโœŒ๐Ÿ‘๐Ÿ˜๐Ÿ’ƒ๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ˜ขโ˜น๐Ÿ™๐Ÿ™๐Ÿ‘‹๐Ÿ˜๐Ÿ˜ฌ๐Ÿ˜โœŒ๐Ÿ˜˜๐Ÿ™๐Ÿ‘๐Ÿ™Œ๐Ÿค“๐Ÿ˜๐ŸŽ‰๐Ÿ’โ™‚๐Ÿ˜๐Ÿ˜‘๐Ÿ˜๐Ÿ˜๐Ÿ˜๐ŸŽ‰๐Ÿ˜โ˜น๐Ÿ˜•๐Ÿ˜ข๐Ÿ˜ฌโœŒ๐Ÿ˜ž๐Ÿ˜ฌโœŒ๐Ÿ˜ฌ๐Ÿ‘๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ‘๐Ÿ‘๐Ÿ‘Š๐Ÿ˜๐Ÿ˜ง๐Ÿ˜˜๐Ÿ˜ช๐Ÿ˜๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐Ÿ˜•๐Ÿ‘๐Ÿ˜๐Ÿ‘‰๐Ÿ˜๐Ÿ‘Š๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚๐Ÿคณ๐Ÿ‘Œ๐Ÿ˜๐Ÿ‘Œ๐Ÿ™‹๐Ÿผโ™€๐Ÿ‘‹๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ™๐Ÿ˜•๐Ÿ‘Š๐Ÿ˜๐Ÿค”๐Ÿค—๐Ÿค™๐Ÿ‘๐Ÿ˜ฌ๐Ÿค”๐ŸŽ‰๐ŸŽ…๐Ÿป๐Ÿ‘๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿคš๐Ÿ˜˜๐Ÿคš๐Ÿ‘๐Ÿ‘Š๐Ÿ™๐Ÿ™๐Ÿ™๐Ÿ™„๐Ÿ˜˜๐Ÿ™‹๐Ÿผโ™€๐Ÿคฃ๐Ÿ˜˜๐ŸŽ‰๐Ÿ˜ฌ๐Ÿ™๐Ÿ˜–๐Ÿ’โ™‚๐Ÿ˜‚๐Ÿ˜’๐ŸŽ‰๐Ÿ˜—๐Ÿ‘๐Ÿค”๐Ÿค๐Ÿ™„๐Ÿ‘Š๐Ÿ˜˜๐Ÿ˜‰๐Ÿ˜˜๐Ÿ™‚โ˜น๐Ÿ’ฐ๐Ÿ˜๐ŸŽ‰๐Ÿ˜‘๐Ÿ˜ฌ๐Ÿ‘๐Ÿ‘๐Ÿ‘Ž๐Ÿ™‹โ™‚๐Ÿ’โ™‚๐Ÿ˜๐Ÿ˜๐Ÿ™‚โ˜น๐Ÿค”๐Ÿฆ„๐Ÿฆ„๐Ÿ˜ฌ๐Ÿ˜†๐Ÿ˜ด๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ„โ™€๐Ÿ‘€๐Ÿ˜๐Ÿ„โ™€๐Ÿ‘๐Ÿ˜ฌ๐Ÿ‘Š๐Ÿ˜ฌ๐Ÿค”๐Ÿ˜๐Ÿ™„๐Ÿ‘Œ๐Ÿ‘๐Ÿ˜ซโ˜น๐Ÿค—๐Ÿ˜ฉ๐Ÿ‘€๐Ÿ˜๐Ÿ’ฐ๐Ÿค”๐Ÿ‘๐Ÿ˜๐Ÿ˜ฐ๐Ÿ˜ณ๐Ÿ˜ฃ๐Ÿ˜Ÿ๐Ÿ˜˜๐Ÿ‘€๐Ÿค—๐Ÿ™‚๐Ÿ˜…๐Ÿ‘๐Ÿค”๐Ÿ™‚๐Ÿ˜๐Ÿ˜๐Ÿ˜ฃ๐Ÿ•บ๐Ÿ˜ฎ๐Ÿ™‚โ˜นโ˜น๐Ÿ˜‘๐Ÿค˜โ˜น๐Ÿ˜ฌ๐Ÿณ๐Ÿ˜˜๐Ÿ˜ฌ๐Ÿ˜˜๐Ÿค˜๐Ÿ™‹โ™‚๐Ÿ™๐Ÿ“๐Ÿ˜ข๐Ÿ˜๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜๐Ÿ˜˜๐Ÿ‘๐Ÿ˜š๐Ÿ˜š๐Ÿ˜š๐Ÿคž๐Ÿ˜๐Ÿ™„๐Ÿ˜๐Ÿ™‹โ™‚๐Ÿ˜ด๐Ÿ˜˜๐Ÿ‘๐Ÿ˜๐Ÿ‘Š๐Ÿ˜‘๐Ÿ˜’๐Ÿ‘๐Ÿ˜‘๐Ÿ˜ฌ๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ˜•โ˜น๐Ÿ˜Ÿ๐Ÿ’‡โ™€๐Ÿ‘๐ŸŽ‰๐Ÿ˜๐Ÿ˜๐Ÿ˜š๐Ÿค”๐Ÿ‘๐Ÿ‘๐Ÿ˜๐Ÿ˜๐Ÿ‘๐Ÿ˜๐Ÿ˜š๐Ÿ˜๐ŸŽ‰๐Ÿ˜ฌ๐Ÿ™‚๐Ÿ˜ฌ๐Ÿ˜๐Ÿ”ฅ๐Ÿคโ˜น๐Ÿ™Œ๐Ÿ˜๐Ÿ’โ™‚๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ™๐Ÿ˜ญ๐Ÿ™‚๐Ÿ˜ฌ๐Ÿ˜˜๐Ÿ™‚๐Ÿ˜๐Ÿ˜ฌ๐Ÿ‘โ˜บ๐Ÿ™๐Ÿ˜‚๐Ÿ‘€๐Ÿ‘Œ๐Ÿ™Œ๐Ÿ˜๐Ÿ’๐Ÿผโ™€๐Ÿ˜๐Ÿ˜ฌ๐Ÿ‘๐Ÿ˜•๐Ÿ™‚๐Ÿ˜—๐Ÿ˜๐Ÿ˜•๐Ÿ™๐Ÿ‘€๐Ÿ˜๐Ÿ‘๐ŸŽ‰๐Ÿ˜ฉ๐Ÿ˜•๐Ÿ™๐Ÿ˜Š๐Ÿ˜ด๐Ÿคž๐Ÿ˜š๐Ÿ˜ฉ๐Ÿ˜ฉ๐Ÿ˜ฉ๐Ÿ˜๐Ÿ˜ฌ๐Ÿ‘๐Ÿ‘๐Ÿ˜ฌ๐Ÿ˜š๐Ÿ˜๐Ÿ˜ฑ๐Ÿ‘ป๐Ÿ‘ฝ๐Ÿ˜‘๐Ÿ˜๐Ÿ˜ด๐Ÿค’๐Ÿ˜๐Ÿ™๐Ÿ‘Š๐Ÿค“โ˜น๐Ÿ˜๐Ÿค™๐Ÿ˜๐Ÿ‘ฝ๐Ÿ‘Š๐Ÿ˜Š๐Ÿค™๐Ÿ˜โ˜น๐Ÿ™„๐Ÿ˜‡๐Ÿ™‚๐Ÿ˜๐Ÿ˜ฉโ˜น๐Ÿ˜š๐Ÿ˜๐Ÿ‘๐Ÿ™๐Ÿ‘‹๐Ÿ˜Ÿ๐Ÿ˜โ˜น๐Ÿ˜š๐Ÿค”๐Ÿ˜ง๐Ÿ™โ˜น๐Ÿ™ƒ๐Ÿ™‚๐Ÿ‘‹๐Ÿ™‚๐Ÿ‘๐Ÿ‘๐Ÿ˜๐Ÿค™๐Ÿ‘๐Ÿ’ฐ๐Ÿ™‚๐Ÿ˜ข๐Ÿค™๐Ÿ’ฐ๐Ÿ˜š๐Ÿ‘๐Ÿค”๐Ÿคฃ๐Ÿคฃ๐Ÿคฃ๐ŸŽ‰๐Ÿ˜ข๐Ÿ˜๐Ÿ˜ฌ๐Ÿค“๐Ÿ‘Š๐Ÿ’โ™‚๐Ÿ˜๐Ÿ˜๐Ÿ˜๐Ÿ‘๐Ÿ”ฅ๐Ÿค™๐Ÿ˜๐Ÿ‘‰๐Ÿ˜—๐Ÿ˜โšก๐Ÿ’†โ™€โšก๐Ÿ‘๐Ÿ˜š๐Ÿ˜˜๐Ÿค”โ˜น๐Ÿค๐Ÿ˜ข๐Ÿ˜ณ๐Ÿ˜ณ๐Ÿ˜‰๐Ÿ‘โ˜บ๐Ÿ‘Šโ˜นโšกโšกโšกโ˜นโ˜นโ˜น๐Ÿ‘โ˜น๐Ÿ˜š๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ˜ข๐Ÿ’ฐ๐Ÿ˜๐Ÿ˜ฌ๐Ÿ‘Š๐Ÿค”๐Ÿ‘ป๐Ÿ™Œ๐Ÿ’๐Ÿผโ™€๐Ÿ˜’๐Ÿ˜ซ๐Ÿ‘๐Ÿ‘Š๐Ÿ˜‡๐Ÿ™‚๐Ÿค”๐Ÿค™โ˜น๐Ÿ˜ช๐Ÿ˜‰๐Ÿ‘๐Ÿ˜๐Ÿ’ช๐Ÿ˜ญ๐Ÿ˜๐Ÿ’ฉ๐Ÿคค๐Ÿ˜šโ˜นโ˜น๐Ÿ‘Š๐Ÿค™๐Ÿ˜š๐Ÿ˜˜๐Ÿ™๐Ÿคฅ๐Ÿ˜๐Ÿ‘๐Ÿ‘๐Ÿ˜š๐Ÿค—๐Ÿ˜๐Ÿ™„๐Ÿ™„๐Ÿ˜๐Ÿ‘๐Ÿ˜๐Ÿ˜ฏ๐Ÿ˜š๐Ÿ‘๐Ÿ™„๐Ÿ™Œ๐Ÿค”๐Ÿ˜๐Ÿ˜˜๐Ÿ‘๐Ÿ‘Š๐Ÿ˜ฑ๐Ÿ˜๐Ÿ‘๐Ÿ˜˜๐Ÿ˜๐ŸŽ‰๐Ÿ˜ญ๐Ÿ˜๐Ÿ˜š๐Ÿ˜˜๐Ÿ˜ด๐Ÿ‘๐Ÿ˜๐Ÿค”๐Ÿค”๐Ÿ˜๐Ÿคข๐Ÿ˜˜๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜š๐Ÿ˜ฌ๐Ÿ‘๐Ÿ˜˜๐Ÿ‘Š๐Ÿ‘Œ๐Ÿ˜˜๐Ÿ˜๐Ÿ˜๐Ÿ˜š๐Ÿ‘‹๐Ÿ˜โœ‹โ˜๐Ÿ˜ญ๐Ÿค”๐Ÿ‘๐Ÿ˜˜๐Ÿค™๐Ÿ’๐Ÿผโ™€๐Ÿ˜˜๐Ÿ˜˜๐Ÿ‘๐Ÿ‘€๐Ÿ‘‹๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ™๐Ÿ™๐Ÿ‘๐Ÿ˜˜๐Ÿ˜๐Ÿ˜š๐Ÿ‘Š๐Ÿ‘๐Ÿ˜ฌ๐Ÿ‘๐Ÿ‘๐ŸŽ‰๐Ÿ‘๐Ÿ˜‹๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜โ˜น๐Ÿ˜˜๐Ÿ˜๐Ÿ‘๐Ÿ˜๐Ÿค™๐Ÿ‘๐Ÿ‘๐Ÿ˜š๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ‘๐Ÿ’๐Ÿผโ™€๐Ÿ‘๐Ÿ˜˜๐Ÿ˜๐Ÿค”๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ˜˜๐Ÿ‘๐Ÿ˜๐Ÿ˜˜๐Ÿ‘Š๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘โ˜น๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ˜˜๐Ÿ‘๐Ÿ˜ด๐Ÿค™๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜•๐Ÿ‘Š๐Ÿ‘๐Ÿ‘๐Ÿ˜๐Ÿ˜˜๐Ÿ˜š๐Ÿ‘†๐Ÿ’โ™€๐Ÿ˜ด๐Ÿ˜˜๐Ÿ‘Š๐Ÿ˜ฅ๐Ÿ‘Š๐Ÿ‘๐Ÿ˜…๐Ÿ™‚๐Ÿ‘Š๐Ÿค™๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜ฒ๐Ÿ˜˜๐Ÿ‘๐Ÿค”๐Ÿ˜ซ๐Ÿคฃ๐Ÿณ๐Ÿ˜Ž๐Ÿ˜š๐Ÿ˜ข๐Ÿ˜ฏ๐Ÿ’ƒ๐Ÿ‘๐Ÿ™„๐Ÿ‘๐Ÿ‘๐Ÿ’‡โ™‚๐Ÿ‘Š๐Ÿ˜š๐Ÿ˜š๐Ÿ˜˜๐Ÿ‘๐Ÿ™„๐Ÿ˜˜๐Ÿ˜š๐Ÿ˜˜๐Ÿ˜ข๐Ÿ›Ž๐Ÿ˜š๐Ÿ™๐Ÿ˜‚๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ‘Œ๐Ÿ‘๐Ÿคทโ™‚๐Ÿ˜‚๐Ÿ‘๐Ÿ˜•๐Ÿ‘๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ‘๐Ÿ‘Š๐Ÿ˜…๐Ÿ˜‰๐Ÿ’ค๐Ÿ‘๐Ÿ˜๐Ÿ˜š๐Ÿ‘๐Ÿค™๐Ÿค“๐Ÿค—๐Ÿ˜˜๐Ÿ˜๐Ÿ’ƒ๐Ÿ˜๐Ÿ˜˜๐Ÿ˜˜๐Ÿ˜ฌ๐Ÿ’โ™‚๐Ÿ˜‚โ˜น๐Ÿ˜๐Ÿ‘๐Ÿ˜˜ 

Most common: [('๐Ÿ˜', 138), ('๐Ÿ˜˜', 103), ('๐Ÿ‘', 91), ('๐Ÿ˜ฌ', 42), ('๐Ÿ‘Š', 29), ('โ˜น', 29), ('๐Ÿ˜š', 28), ('โœŒ', 27), ('๐Ÿ˜', 25), ('๐Ÿ™‚', 24)]

It’s very interesting to visualize how one person uses more emojis than the other person. This is the only way we express our emotions while having a whatsapp conversation.

Sentiment Against Time

The plotting of sentiments against the datetime is not as easy as it looks. As there are many different sentiments on the same day, so the first step is to calculate the mean sentiment for each day and then grouping by datetime. So let’s see how we can do this:

df= pd.DataFrame(final).T  # convert dictionary to a dataframe, makes process of plotting straightforward
df.columns = ['pol', 'name', 'date', 'token']
df['pol'] = df['pol'].apply(lambda x : float(x)) # convert polarity to a float
df3 = df.groupby(['date'], as_index=False).agg('mean')
df3['name'] = 'Combined'
final =pd.concat([df2, df3])
final['date'] = pd.to_datetime(final.date, format='%d/%m/%Y') # need to chnage 'date' to a datetime object
final = final.sort_values('date')
final['x'] = final['date'].rank(method='dense', ascending=True).astype(int)
final[:6]
datenamepolx
172017-07-02Combined0.3211621
342017-07-02Person_10.2984901
352017-07-02Person_20.3417731
592017-07-03Person_20.2494892
292017-07-03Combined0.2714582
582017-07-03Person_10.3373672

Even plotting the average of sentiments for each day will prove to be very messy. So let’s simply take a rolling average of 10 days, and then plot the average sentiment score:

sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(12,8))
colours=['b','y','g']
i=0

for label, df in final.groupby('name'):
    
    new=df.reset_index()
    new['rol'] = new['pol'].rolling(10).mean() # rolling mean calculation on a 10 day basis
    
    g = new.plot(x='date', y='rol', ax=ax, label=label, color=colours[i]) # rolling mean plot
    plt.scatter(df['date'].tolist(), df['pol'], color=colours[i], alpha=0.2) # underlying scatter plot
    
    i+=1

ax.set_ybound(lower=-0.1, upper=0.4)
ax.set_xlabel('Date', fontsize=15)
ax.set_ylabel('Compound Sentiment', fontsize=15)

g.set_title('10 Day Rolling Mean Sentiment', fontsize=18)
rolling mean

Frequency of Chats

Now let’s have a look at the frequency of whatsapp chats which is not a part of NLP for Whatsapp but it is a part of time series analysis. We can use time series here to see the frequency of chats. First, need to create a colour pallete ordered by the total number of messages for each day.

Also, read – PyTorch for Deep Learning.

pal = sns.cubehelix_palette(7, rot=-.25, light=.7)

Ordered list of days according to total message count

days_freq = list(df.day.value_counts().index)
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

This is essentially the current order of colours:

lst = list(zip(days, pal[::-1]))
lst
[('Monday', [0.12071162840208301, 0.14526386650440642, 0.2463679091477368]),
 ('Tuesday', [0.18152581198633005, 0.24364059111738742, 0.37281834227732574]),
 ('Wednesday', [0.2426591079772084, 0.3511228226876375, 0.4852103253459974]),
 ('Thursday', [0.30463866738797124, 0.45571986933681846, 0.5751187147066701]),
 ('Friday', [0.37810168111401876, 0.5633546614344814, 0.6530658354036274]),
 ('Saturday', [0.46091631066717925, 0.662287611911293, 0.7165315069314769]),
 ('Sunday', [0.5632111255041908, 0.758620966612444, 0.7764634182455044])]

Reorder colours according to their index position in the ‘days_freq’ list:

pal_reorder=[]

for i in days:
    #print(i)
    j=0
    for day in days_freq:
        
        if i == day:
            #print(lst[j][1])
            pal_reorder.append(lst[j][1])
        j+=1
pal_reorder   # colours ordered according to total message count for the day
[[0.30463866738797124, 0.45571986933681846, 0.5751187147066701],
 [0.18152581198633005, 0.24364059111738742, 0.37281834227732574],
 [0.12071162840208301, 0.14526386650440642, 0.2463679091477368],
 [0.2426591079772084, 0.3511228226876375, 0.4852103253459974],
 [0.37810168111401876, 0.5633546614344814, 0.6530658354036274],
 [0.5632111255041908, 0.758620966612444, 0.7764634182455044],
 [0.46091631066717925, 0.662287611911293, 0.7165315069314769]]
sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
pal = sns.cubehelix_palette(7, rot=-.25, light=.7)
g = sns.FacetGrid(df[(df.float_time > 8)], row="day", hue="day",   # change "day" to year_month if required
                  aspect=10, size=1.5, palette=pal_reorder, xlim=(7,24))

# Draw the densities in a few steps
g.map(sns.kdeplot, "float_time", clip_on=False, shade=True, alpha=1, lw=1.5, bw=.2)
g.map(sns.kdeplot, "float_time", clip_on=False, color="w", lw=3, bw=.2)
g.map(plt.axhline, y=0, lw=1, clip_on=False)

# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, 0.1, label, fontweight="bold", color=color, 
            ha="left", va="center", transform=ax.transAxes, size=18)

g.map(label, "float_time")
g.set_xlabels('Time of Day', fontsize=30)
g.set_xticklabels(fontsize=20)
# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-0.5)
g.fig.suptitle('Message Density by Time and Day of the Week, Shaded by Total Message Count', fontsize=22)   
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)
NLP for Whatsapp

Also, read – 10 Machine Learning Projects to Boost your Portfolio.

I hope you liked this article on NLP for WhatsApp chats. Feel free to ask your valuable questions in the comments section. Don’t forget to subscribe for my daily newsletters below to get email notifications if you like my work.

Receive Daily Newsletters

Leave a Reply