So I am a part of a WhatsApp group named as “Data Science Community”, recently I thought to explore the chat of this group and do some analysis on it. So, here in this article, I will take you through a WhatsApp group chat Analysis with Data Science.
If you don’t know how to extract the messages from any chat then just open any chat click on the 3 dots above, select more and then select explore chat, and share it with any means, most preferable your email.
The chat you will get at the end does not need any cleaning and preparation it can be used directly for the task. Now let’s start with this WhatsApp group chat analysis, I will simply import the required packages and get started with the task:
import regex
import pandas as pd
import numpy as np
import emoji
import plotly.express as px
from collections import Counter
import matplotlib.pyplot as plt
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
% matplotlib inline
Code language: Python (python)
WhatsApp Group Chat Analysis
Although, the data is ready to use we still need to change the format of the date and time of messages which can be done easily. For this I will define a function that can detect whether each line starts with a date as it states that it is a unique message:
def startsWithDateAndTime(s): pattern = '^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -' result = re.match(pattern, s) if result: return True return False
Now I will create a function to extract the usernames in the chats as Authors:
def FindAuthor(s): s=s.split(":") if len(s)==2: return True else: return False
Now, I will create a function to separate all the information from each other so that we could easily use the information as a pandas dataframe:
def getDataPoint(line): splitLine = line.split(' - ') dateTime = splitLine[0] date, time = dateTime.split(', ') message = ' '.join(splitLine[1:]) if FindAuthor(message): splitMessage = message.split(': ') author = splitMessage[0] message = ' '.join(splitMessage[1:]) else: author = None return date, time, author, message
The code below will help you to get the data, if you are using an IDE or Jupyter notebook or Google Colab on anything, you can use the code below, you just need to make sure that you write the complete path of your dataset if you are not using Colab or Notebook:
from google.colab import files uploaded = files.upload() data = [] # List to keep track of data so it can be used by a Pandas dataframe conversation = 'WhatsApp Chat (1).txt' with open(conversation, encoding="utf-8") as fp: fp.readline() # Skipping first line of the file because contains information related to something about end-to-end encryption messageBuffer = [] date, time, author = None, None, None while True: line = fp.readline() if not line: break line = line.strip() if startsWithDateAndTime(line): if len(messageBuffer) > 0: parsedData.append([date, time, author, ' '.join(messageBuffer)]) messageBuffer.clear() date, time, author, message = getDataPoint(line) messageBuffer.append(message) else: messageBuffer.append(line)
Now, let’s put the data into a dataframe and have a look at the data:
df = pd.DataFrame(parsedData, columns=['Date', 'Time', 'Author', 'Message']) # Initialising a pandas Dataframe. df["Date"] = pd.to_datetime(df["Date"]) df.tail(20)

The above dataframe looks good. Now let’s start with our WhatsApp group chat analysis.
Also, Read – Named Entity Recognition (NER)
To Get All The Authors:
The Authors are representing all the participants of the WhatsApp group, now let’s use how we can extract names of all the authors:
df.Author.unique()
array([None, ‘Aman Kharwal’, ‘Sahil Pansare’, ‘+91 97386 30266’, ‘+91 97217 95958’, ‘+91 83696 21916’, ‘+91 88064 51751’, ‘+91 96627 78558’, ‘+91 90252 51204’, ‘+91 70665 40498’, ‘+91 84471 85093’, ‘+91 79065 56743’, ‘+60 11-5689 2040’, ‘+91 99150 15281’, ‘+91 93983 18393’, ‘+91 95612 77706’, ‘+91 98224 35433’, ‘+91 98673 74287’, ‘+91 74474 80190’, ‘+91 87288 48041’, ‘+91 86106 90461’, ‘+91 76200 14058’, ‘+91 98507 34912’, ‘+91 77868 68987’, ‘+91 77387 12804’, ‘+91 98119 14741’, ‘+91 99724 91453’, ‘+91 70382 50701’, ‘+91 83448 26314’, ‘+91 95000 28536’, ‘+91 93703 49063’, ‘+91 93808 22645’, ‘+91 99165 66683’, ‘+91 70424 73460’, ‘Sumehar’, ‘+91 86002 94761’], dtype=object)
So, I got only 3 numbers saved out of all the participants. I will use the data as it is and will explore the 3 numbers with names.
WhatsApp Group Chat Analysis: Group Wise Status
Now let’s have some analysis by looking at the statistics. I will first create a function which will split the text and other media files from each other including emojis:
media_messages = df[df['Message'] == '<Media omitted>'].shape[0] print(media_messages) def split_count(text): emoji_list = [] data = regex.findall(r'\X', text) for word in data: if any(char in emoji.UNICODE_EMOJI for char in word): emoji_list.append(word) return emoji_list df["emoji"] = df["Message"].apply(split_count) emojis = sum(df['emoji'].str.len()) print(emojis) URLPATTERN = r'(https?://\S+)' df['urlcount'] = df.Message.apply(lambda x: re.findall(URLPATTERN, x)).str.len() links = np.sum(df.urlcount) print("Data science Community") print("Messages:",total_messages) print("Media:",media_messages) print("Emojis:",emojis) print("Links:",links)
Data Science Community
Messages: 2201
Media: 470
Emojis: 613
Links: 437
Now we will look at the author wise status from the WhatsApp group chat:
media_messages_df = df[df['Message'] == '<Media omitted>'] messages_df = df.drop(media_messages_df.index) messages_df.info() messages_df['Letter_Count'] = messages_df['Message'].apply(lambda s : len(s)) messages_df['Word_Count'] = messages_df['Message'].apply(lambda s : len(s.split(' '))) messages_df["MessageCount"]=1 l = ["Aman Kharwal", "Sahil Pansare", "Sumehar"] for i in range(len(l)): # Filtering out messages of particular user req_df= messages_df[messages_df["Author"] == l[i]] # req_df will contain messages of only one particular user print(f'Stats of {l[i]} -') # shape will print number of rows which indirectly means the number of messages print('Messages Sent', req_df.shape[0]) #Word_Count contains of total words in one message. Sum of all words/ Total Messages will yield words per message words_per_message = (np.sum(req_df['Word_Count']))/req_df.shape[0] print('Words per message', words_per_message) #media conists of media messages media = media_messages_df[media_messages_df['Author'] == l[i]].shape[0] print('Media Messages Sent', media) # emojis conists of total emojis emojis = sum(req_df['emoji'].str.len()) print('Emojis Sent', emojis) #links consist of total links links = sum(req_df["urlcount"]) print('Links Sent', links) print()
Stats of Aman Kharwal –
Messages Sent 431
Words per message 5.907192575406032
Media Messages Sent 17
Emojis Sent 83
Links Sent 245
Stats of Sahil Pansare –
Messages Sent 306
Words per message 20.81045751633987
Media Messages Sent 12
Emojis Sent 195
Links Sent 52
Stats of Sumehar –
Messages Sent 52
Words per message 4.826923076923077
Media Messages Sent 0
Emojis Sent 8
Links Sent 0
Now let’s have a look at the most emojis used in the group:
total_emojis_list = list([a for b in messages_df.emoji for a in b]) emoji_dict = dict(Counter(total_emojis_list)) emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True) for i in emoji_dict: print(i)
(‘👍🏻’, 118) (‘😊’, 81) (‘💯’, 60) (‘🤝’, 39) (‘👌🏿’, 28) (‘👍🏽’, 28) (‘👌’, 24) (‘👏🏾’, 20) (‘😂’, 16) (‘‼️’, 16) (‘👉’, 16) (‘⭐’, 12) (‘👍🏼’, 12) (‘👍’, 12) (‘✌🏻’, 8) (‘✌️’, 8) (‘😅’, 8) (‘🌹’, 8) (‘\U0001f973’, 8) (‘😇’, 8) (‘😉’, 6) (‘🙌🏻’, 6) (‘\U0001f929’, 5) (‘😄’, 4) (‘😶’, 4) (‘☕’, 4) (‘\U0001f9e1’, 4) (‘\U0001f91f🏻’, 4) (‘🙂’, 4) (‘👏’, 4) (‘🔔’, 4) (‘✨’, 4) (‘😬’, 4) (‘👌🏾’, 4) (‘😍’, 4) (‘❤’, 4) (‘🙏🏻’, 3) (‘🤣’, 3) (‘🔥’, 2) (‘👻’, 2) (‘😁’, 1) (‘👏🏻’, 1) (‘🙌’, 1) (‘🤔’, 1)
WhatsApp Group Chat Analysis: Word Cloud
Now, I will create a Word Cloud for our WhatsApp Group Chat Analysis, to see what the group is based on. A Word Cloud is a graph of words which shows the most used words by representing the most used words bigger than the rest:
text = " ".join(review for review in messages_df.Message) print ("There are {} words in all the messages.".format(len(text))) stopwords = set(STOPWORDS) # Generate a word cloud image wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text) # Display the generated image: # the matplotlib way: plt.figure( figsize=(10,5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()

The above Word Cloud is based on the chats of the whole group. Now I will look at the Author wise WordCloud:
l = ["Aman Kharwal", "Sahil Pansare", "Sumehar"] for i in range(len(l)): dummy_df = messages_df[messages_df['Author'] == l[i]] text = " ".join(review for review in dummy_df.Message) stopwords = set(STOPWORDS) #Generate a word cloud image print('Author name',l[i]) wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text) #Display the generated image plt.figure( figsize=(10,5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Author name Aman Kharwal

Author name Sahil Pansare

Author name Sumehar

The above WhatsApp Group Chat analysis clearly shows that the group is not a friendship group, it is based on beginners of Machine Learning and programming.
I hope you liked this article on WhatsApp Group Chat Analysis. Feel free to ask your valuable questions in the comments section below. You can follow me on Medium, to learn every topic of Machine Learning.
Also, Read – Machine Learning Algorithms That Are Mostly Used.
File “”, line 15
if len(messageBuffer) & amp;gt; 0:
^
SyntaxError: invalid syntax
in the code where we upload the whatsapp chat txt file
what should we do
Collect data from whatsapp group, read the article from the start you will get to know to to extract whatsapp chat
ok got that error corrected
but the error below is not
I also got same error could you please tell me how you resolved.
Yes, I found an error above, I have now updated the code. You can get the complete notebook here:https://github.com/amankharwal/Website-data/blob/master/WhatsApp.ipynb
please reply as soon as possible
and also u have not declared parsedData
Look at 5th code block where I am importing the data from Google colab
def startsWithDateAndTime(s):
pattern = ‘^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -‘
result = re.match(pattern, s)
if result:
return True
return False
found this error with the above codes:
in startsWithDateAndTime(s)
1 def startsWithDateAndTime(s):
2 pattern = ‘^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -‘
—-> 3 result = re.match(pattern, s)
4 if result:
5 return True
NameError: name ‘re’ is not defined
import regex as re