WhatsApp Group Chat Analysis

So I am a part of a WhatsApp group named as “Data Science Community”, recently I thought to explore the chat of this group and do some analysis on it. So, here in this article, I will take you through a WhatsApp group chat Analysis with Data Science.

If you don’t know how to extract the messages from any chat then just open any chat click on the 3 dots above, select more and then select explore chat, and share it with any means, most preferable your email.

The chat you will get at the end does not need any cleaning and preparation it can be used directly for the task. Now let’s start with this WhatsApp group chat analysis, I will simply import the required packages and get started with the task:

import regex
import pandas as pd
import numpy as np
import emoji
import plotly.express as px
from collections import Counter
import matplotlib.pyplot as plt
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
% matplotlib inlineCode language: Python (python)

WhatsApp Group Chat Analysis

Although, the data is ready to use we still need to change the format of the date and time of messages which can be done easily. For this I will define a function that can detect whether each line starts with a date as it states that it is a unique message:

def startsWithDateAndTime(s):
    pattern = '^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -' 
    result = re.match(pattern, s)
    if result:
        return True
    return False

Now I will create a function to extract the usernames in the chats as Authors:

def FindAuthor(s):
  s=s.split(":")
  if len(s)==2:
    return True
  else:
    return False

Now, I will create a function to separate all the information from each other so that we could easily use the information as a pandas dataframe:

def getDataPoint(line):   
    splitLine = line.split(' - ') 
    dateTime = splitLine[0]
    date, time = dateTime.split(', ') 
    message = ' '.join(splitLine[1:])
    if FindAuthor(message): 
        splitMessage = message.split(': ') 
        author = splitMessage[0] 
        message = ' '.join(splitMessage[1:])
    else:
        author = None
    return date, time, author, message

The code below will help you to get the data, if you are using an IDE or Jupyter notebook or Google Colab on anything, you can use the code below, you just need to make sure that you write the complete path of your dataset if you are not using Colab or Notebook:

from google.colab import files
uploaded = files.upload()
data = [] # List to keep track of data so it can be used by a Pandas dataframe
conversation = 'WhatsApp Chat (1).txt'
with open(conversation, encoding="utf-8") as fp:
    fp.readline() # Skipping first line of the file because contains information related to something about end-to-end encryption
    messageBuffer = [] 
    date, time, author = None, None, None
    while True:
        line = fp.readline() 
        if not line: 
            break
        line = line.strip() 
        if startsWithDateAndTime(line): 
            if len(messageBuffer) > 0: 
                parsedData.append([date, time, author, ' '.join(messageBuffer)]) 
            messageBuffer.clear() 
            date, time, author, message = getDataPoint(line) 
            messageBuffer.append(message) 
        else:
            messageBuffer.append(line)

Now, let’s put the data into a dataframe and have a look at the data:

df = pd.DataFrame(parsedData, columns=['Date', 'Time', 'Author', 'Message']) # Initialising a pandas Dataframe.
df["Date"] = pd.to_datetime(df["Date"])
df.tail(20)

The above dataframe looks good. Now let’s start with our WhatsApp group chat analysis.

Also, Read – Named Entity Recognition (NER)

To Get All The Authors:

The Authors are representing all the participants of the WhatsApp group, now let’s use how we can extract names of all the authors:

df.Author.unique()

array([None, ‘Aman Kharwal’, ‘Sahil Pansare’, ‘+91 97386 30266’, ‘+91 97217 95958’, ‘+91 83696 21916’, ‘+91 88064 51751’, ‘+91 96627 78558’, ‘+91 90252 51204’, ‘+91 70665 40498’, ‘+91 84471 85093’, ‘+91 79065 56743’, ‘+60 11-5689 2040’, ‘+91 99150 15281’, ‘+91 93983 18393’, ‘+91 95612 77706’, ‘+91 98224 35433’, ‘+91 98673 74287’, ‘+91 74474 80190’, ‘+91 87288 48041’, ‘+91 86106 90461’, ‘+91 76200 14058’, ‘+91 98507 34912’, ‘+91 77868 68987’, ‘+91 77387 12804’, ‘+91 98119 14741’, ‘+91 99724 91453’, ‘+91 70382 50701’, ‘+91 83448 26314’, ‘+91 95000 28536’, ‘+91 93703 49063’, ‘+91 93808 22645’, ‘+91 99165 66683’, ‘+91 70424 73460’, ‘Sumehar’, ‘+91 86002 94761’], dtype=object)

So, I got only 3 numbers saved out of all the participants. I will use the data as it is and will explore the 3 numbers with names.

WhatsApp Group Chat Analysis: Group Wise Status

Now let’s have some analysis by looking at the statistics. I will first create a function which will split the text and other media files from each other including emojis:

media_messages = df[df['Message'] == '<Media omitted>'].shape[0]
print(media_messages)
def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)

    return emoji_list

df["emoji"] = df["Message"].apply(split_count)
emojis = sum(df['emoji'].str.len())
print(emojis)
URLPATTERN = r'(https?://\S+)'
df['urlcount'] = df.Message.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
links = np.sum(df.urlcount)
print("Data science Community")
print("Messages:",total_messages)
print("Media:",media_messages)
print("Emojis:",emojis)
print("Links:",links)

Data Science Community
Messages: 2201
Media: 470
Emojis: 613
Links: 437

Now we will look at the author wise status from the WhatsApp group chat:

media_messages_df = df[df['Message'] == '<Media omitted>']
messages_df = df.drop(media_messages_df.index)
messages_df.info()
messages_df['Letter_Count'] = messages_df['Message'].apply(lambda s : len(s))
messages_df['Word_Count'] = messages_df['Message'].apply(lambda s : len(s.split(' ')))
messages_df["MessageCount"]=1

l = ["Aman Kharwal", "Sahil Pansare", "Sumehar"]
for i in range(len(l)):
  # Filtering out messages of particular user
  req_df= messages_df[messages_df["Author"] == l[i]]
  # req_df will contain messages of only one particular user
  print(f'Stats of {l[i]} -')
  # shape will print number of rows which indirectly means the number of messages
  print('Messages Sent', req_df.shape[0])
  #Word_Count contains of total words in one message. Sum of all words/ Total Messages will yield words per message
  words_per_message = (np.sum(req_df['Word_Count']))/req_df.shape[0]
  print('Words per message', words_per_message)
  #media conists of media messages
  media = media_messages_df[media_messages_df['Author'] == l[i]].shape[0]
  print('Media Messages Sent', media)
  # emojis conists of total emojis
  emojis = sum(req_df['emoji'].str.len())
  print('Emojis Sent', emojis)
  #links consist of total links
  links = sum(req_df["urlcount"])   
  print('Links Sent', links)   
  print()

Stats of Aman Kharwal –
Messages Sent 431
Words per message 5.907192575406032
Media Messages Sent 17
Emojis Sent 83
Links Sent 245

Stats of Sahil Pansare –
Messages Sent 306
Words per message 20.81045751633987
Media Messages Sent 12
Emojis Sent 195
Links Sent 52

Stats of Sumehar –
Messages Sent 52
Words per message 4.826923076923077
Media Messages Sent 0
Emojis Sent 8
Links Sent 0

Now let’s have a look at the most emojis used in the group:

total_emojis_list = list([a for b in messages_df.emoji for a in b])
emoji_dict = dict(Counter(total_emojis_list))
emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True)
for i in emoji_dict:
  print(i)

(‘👍🏻’, 118) (‘😊’, 81) (‘💯’, 60) (‘🤝’, 39) (‘👌🏿’, 28) (‘👍🏽’, 28) (‘👌’, 24) (‘👏🏾’, 20) (‘😂’, 16) (‘‼️’, 16) (‘👉’, 16) (‘⭐’, 12) (‘👍🏼’, 12) (‘👍’, 12) (‘✌🏻’, 8) (‘✌️’, 8) (‘😅’, 8) (‘🌹’, 8) (‘\U0001f973’, 8) (‘😇’, 8) (‘😉’, 6) (‘🙌🏻’, 6) (‘\U0001f929’, 5) (‘😄’, 4) (‘😶’, 4) (‘☕’, 4) (‘\U0001f9e1’, 4) (‘\U0001f91f🏻’, 4) (‘🙂’, 4) (‘👏’, 4) (‘🔔’, 4) (‘✨’, 4) (‘😬’, 4) (‘👌🏾’, 4) (‘😍’, 4) (‘❤’, 4) (‘🙏🏻’, 3) (‘🤣’, 3) (‘🔥’, 2) (‘👻’, 2) (‘😁’, 1) (‘👏🏻’, 1) (‘🙌’, 1) (‘🤔’, 1)

WhatsApp Group Chat Analysis: Word Cloud

Now, I will create a Word Cloud for our WhatsApp Group Chat Analysis, to see what the group is based on. A Word Cloud is a graph of words which shows the most used words by representing the most used words bigger than the rest:

text = " ".join(review for review in messages_df.Message)
print ("There are {} words in all the messages.".format(len(text)))
stopwords = set(STOPWORDS)
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
# Display the generated image:
# the matplotlib way:
plt.figure( figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Whatsapp group chat analysis data analysis

The above Word Cloud is based on the chats of the whole group. Now I will look at the Author wise WordCloud:

l = ["Aman Kharwal", "Sahil Pansare", "Sumehar"]
for i in range(len(l)):
  dummy_df = messages_df[messages_df['Author'] == l[i]]
  text = " ".join(review for review in dummy_df.Message)
  stopwords = set(STOPWORDS)
  #Generate a word cloud image
  print('Author name',l[i])
  wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
  #Display the generated image   
  plt.figure( figsize=(10,5))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

Author name Aman Kharwal

aman kharwal group chat

Author name Sahil Pansare

sahil pansare group chat

Author name Sumehar

Sumehar image for post

The above WhatsApp Group Chat analysis clearly shows that the group is not a friendship group, it is based on beginners of Machine Learning and programming.

I hope you liked this article on WhatsApp Group Chat Analysis. Feel free to ask your valuable questions in the comments section below. You can follow me on Medium, to learn every topic of Machine Learning.

Also, Read – Machine Learning Algorithms That Are Mostly Used.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1500

10 Comments

  1. def startsWithDateAndTime(s):
    pattern = ‘^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -‘
    result = re.match(pattern, s)
    if result:
    return True
    return False

    found this error with the above codes:

    in startsWithDateAndTime(s)
    1 def startsWithDateAndTime(s):
    2 pattern = ‘^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -‘
    —-> 3 result = re.match(pattern, s)
    4 if result:
    5 return True

    NameError: name ‘re’ is not defined

Leave a Reply