Data Manipulation Interview Questions

Data Manipulation means altering, transforming, or restructuring data to prepare it for analysis, reporting, or other data-related tasks. It is a fundamental step in data management and analysis, as raw data often needs to be cleaned, organized, and refined to derive meaningful insights. In a technical interview for Data Science, most of the questions are based on Data Manipulation. So, if you want to know what kind of Data Manipulation interview questions you can get, this article is for you. In this article, I’ll take you through a list of Data Manipulation interview questions solved and explained using Python.

Getting Started for Data Manipulation Interview Questions

In your Data Science technical interview, you will be given a dataset and all the Data Manipulation interview questions will be formed based on that dataset. I’ll also follow the same approach to introduce you to Data Manipulation interview questions and solving them.

The dataset I’ll be using to create Data Manipulation interview questions is based on user behaviour on an app. You can download the dataset from here. Below is the column information about the dataset that I’ll be using here:

  1. userid: The identity number of the user;
  2. Average Screen Time: The average screen time of the user on the application;
  3. Average Spent on App (INR): The average amount spent by the user on the application;
  4. Left Review: Did the user leave any reviews about the experience on the application? (1 if true, otherwise 0)
  5. Ratings: Ratings given by the user to the application;
  6. New Password Request: The number of times the user requested a new password;
  7. Last Visited Minutes: Minuted passed by when the user was last active;
  8. Status: Installed if the application is installed and uninstalled if the user has deleted the application;

Data Manipulation Interview Questions

Now, let’s go through the Data Manipulation Interview questions based on the dataset one by one.

How would you load and explore this dataset to gain an initial understanding of its structure and contents?

To load and explore a dataset using Python and Pandas, you can follow these steps below.

Step 1: Import the pandas library and the dataset:

import pandas as pd
# Load the dataset into a DataFrame
df = pd.read_csv('userbehaviour.csv')

Step 2: Explore the dataset to gain an initial understanding of its structure and contents:

# Display the first 5 rows of the dataset
print(df.head())
   userid  Average Screen Time  Average Spent on App (INR)  Left Review  \
0    1001                 17.0                       634.0            1   
1    1002                  0.0                        54.0            0   
2    1003                 37.0                       207.0            0   
3    1004                 32.0                       445.0            1   
4    1005                 45.0                       427.0            1   

   Ratings  New Password Request  Last Visited Minutes       Status  
0        9                     7                  2990    Installed  
1        4                     8                 24008  Uninstalled  
2        8                     5                   971    Installed  
3        6                     2                   799    Installed  
4        5                     6                  3668    Installed  
# Get the dimensions of the dataset
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
Number of rows: 999
Number of columns: 8
# Check the data types of each column
print(df.dtypes)
userid                          int64
Average Screen Time           float64
Average Spent on App (INR)    float64
Left Review                     int64
Ratings                         int64
New Password Request            int64
Last Visited Minutes            int64
Status                         object
dtype: object
# Check for missing values
print(df.isnull().sum())
userid                        0
Average Screen Time           0
Average Spent on App (INR)    0
Left Review                   0
Ratings                       0
New Password Request          0
Last Visited Minutes          0
Status                        0
dtype: int64
# Summary statistics for numerical columns
print(df.describe())
            userid  Average Screen Time  Average Spent on App (INR)  \
count   999.000000           999.000000                  999.000000   
mean   1500.000000            24.390390                  424.415415   
std     288.530761            14.235415                  312.365695   
min    1001.000000             0.000000                    0.000000   
25%    1250.500000            12.000000                   96.000000   
50%    1500.000000            24.000000                  394.000000   
75%    1749.500000            36.000000                  717.500000   
max    1999.000000            50.000000                  998.000000   

       Left Review     Ratings  New Password Request  Last Visited Minutes  
count   999.000000  999.000000            999.000000            999.000000  
mean      0.497497    6.513514              4.941942           5110.898899  
std       0.500244    2.701511              2.784626           8592.036516  
min       0.000000    0.000000              1.000000            201.000000  
25%       0.000000    5.000000              3.000000           1495.500000  
50%       0.000000    7.000000              5.000000           2865.000000  
75%       1.000000    9.000000              7.000000           4198.000000  
max       1.000000   10.000000             15.000000          49715.000000  
# Unique values in the 'Status' column
print(df['Status'].unique())
['Installed' 'Uninstalled']
# Number of unique users
unique_users = df['userid'].nunique()
print(f"Number of unique users: {unique_users}")
Number of unique users: 999
Calculate the mean, median, and standard deviation of the ‘Average Screen Time’ and ‘Average Spent on App (INR)’ columns.

Here’s how to calculate the mean, median, and standard deviation of the ‘Average Screen Time’ and ‘Average Spent on App (INR)’ columns:

# Calculate the mean of 'Average Screen Time' column
average_screen_time_mean = df['Average Screen Time'].mean()

# Calculate the median of 'Average Screen Time' column
average_screen_time_median = df['Average Screen Time'].median()

# Calculate the standard deviation of 'Average Screen Time' column
average_screen_time_std = df['Average Screen Time'].std()

# Calculate the mean of 'Average Spent on App (INR)' column
average_spent_on_app_mean = df['Average Spent on App (INR)'].mean()

# Calculate the median of 'Average Spent on App (INR)' column
average_spent_on_app_median = df['Average Spent on App (INR)'].median()

# Calculate the standard deviation of 'Average Spent on App (INR)' column
average_spent_on_app_std = df['Average Spent on App (INR)'].std()

# Print the results
print(f"Mean of Average Screen Time: {average_screen_time_mean:.2f}")
print(f"Median of Average Screen Time: {average_screen_time_median:.2f}")
print(f"Standard Deviation of Average Screen Time: {average_screen_time_std:.2f}")
print(f"Mean of Average Spent on App (INR): {average_spent_on_app_mean:.2f}")
print(f"Median of Average Spent on App (INR): {average_spent_on_app_median:.2f}")
print(f"Standard Deviation of Average Spent on App (INR): {average_spent_on_app_std:.2f}")
Mean of Average Screen Time: 24.39
Median of Average Screen Time: 24.00
Standard Deviation of Average Screen Time: 14.24
Mean of Average Spent on App (INR): 424.42
Median of Average Spent on App (INR): 394.00
Standard Deviation of Average Spent on App (INR): 312.37
Create a new feature that represents the ratio of “Average Spent on App (INR)” to “Average Screen Time” for each user. How might this feature be useful?

Here’s how to create a new feature that represents the ratio of Average Spent on App (INR) to Average Screen Time for each user:

# Create a new feature 'Spending-to-Screen-Time Ratio'
df['Spending-to-Screen-Time Ratio'] = df['Average Spent on App (INR)'] / df['Average Screen Time']

# Print the updated DataFrame to see the new feature
print(df.head())
   userid  Average Screen Time  Average Spent on App (INR)  Left Review  \
0    1001                 17.0                       634.0            1   
1    1002                  0.0                        54.0            0   
2    1003                 37.0                       207.0            0   
3    1004                 32.0                       445.0            1   
4    1005                 45.0                       427.0            1   

   Ratings  New Password Request  Last Visited Minutes       Status  \
0        9                     7                  2990    Installed   
1        4                     8                 24008  Uninstalled   
2        8                     5                   971    Installed   
3        6                     2                   799    Installed   
4        5                     6                  3668    Installed   

   Spending-to-Screen-Time Ratio  
0                      37.294118  
1                            inf  
2                       5.594595  
3                      13.906250  
4                       9.488889  

We can use this ratio to categorize users into different segments, such as high spenders with low screen time or low spenders with high screen time. It can help us understand which user groups are most engaged with the app.

Filter the dataset to include only users who have left a review (Left Review = 1) and have given a rating of 4 or higher. How many such users are there?

Here’s how to filter the dataset to include only users who have left a review (Left Review = 1) and have given a rating of 4 or higher:

# Filter the dataset based on the conditions
filtered_df = df[(df['Left Review'] == 1) & (df['Ratings'] >= 4)]

# Count the number of users meeting the criteria
num_users = filtered_df.shape[0]

# Print the number of users
print(f"Number of users who left a review with a rating of 4 or higher: {num_users}")
Number of users who left a review with a rating of 4 or higher: 421
Calculate the correlation between “Ratings” and “Average Spent on App (INR)” for the dataset. What does this correlation tell us about user behaviour?

Here’s how to calculate the correlation between Ratings and Average Spent on App:

# Calculate the correlation between 'Ratings' and 'Average Spent on App (INR)'
correlation = df['Ratings'].corr(df['Average Spent on App (INR)'])

# Print the correlation coefficient
print(f"Correlation between 'Ratings' and 'Average Spent on App (INR)': {correlation:.2f}")
Correlation between 'Ratings' and 'Average Spent on App (INR)': 0.48

A correlation coefficient of 0.48 between Ratings and Average Spent on App (INR) suggests a moderate positive linear relationship between these two variables. It indicates that as user ratings of the application increase, the average amount spent by users on the app tends to increase as well. In other words, users who rate the app more positively are more likely to spend more money on in-app purchases or paid features.

Split the dataset into two subsets: one for users who have “Status” as Installed and another for users who have “Status” as Uninstalled. What are the average screen time and average spent on the app for each group?

Here’s how to split the dataset into two subsets based on the Status column (one for users with Status as Installed and another for users with Status as Uninstalled) and calculate the average screen time and average spent on the app for each group:

# Split the dataset into two subsets based on 'Status'
installed_users = df[df['Status'] == 'Installed']
uninstalled_users = df[df['Status'] == 'Uninstalled']

# Calculate the average screen time and average spent on the app for each group
average_screen_time_installed = installed_users['Average Screen Time'].mean()
average_spent_on_app_installed = installed_users['Average Spent on App (INR)'].mean()

average_screen_time_uninstalled = uninstalled_users['Average Screen Time'].mean()
average_spent_on_app_uninstalled = uninstalled_users['Average Spent on App (INR)'].mean()

# Print the results
print("For 'Installed' Users:")
print(f"Average Screen Time: {average_screen_time_installed:.2f}")
print(f"Average Spent on App (INR): {average_spent_on_app_installed:.2f}")

print("\nFor 'Uninstalled' Users:")
print(f"Average Screen Time: {average_screen_time_uninstalled:.2f}")
print(f"Average Spent on App (INR): {average_spent_on_app_uninstalled:.2f}")
For 'Installed' Users:
Average Screen Time: 26.39
Average Spent on App (INR): 458.16

For 'Uninstalled' Users:
Average Screen Time: 2.28
Average Spent on App (INR): 52.02

So these are the examples of the Data Manipulation interview questions and how to solve them using Python. You can learn more about Data Manipulation from the resources mentioned below:

  1. Data Manipulation Guide
  2. A project based on Data Manipulation

Summary

In a technical interview for Data Science, most of the questions are based on Data Manipulation. Data Manipulation means altering, transforming, or restructuring data to prepare it for analysis, reporting, or other data-related tasks. It is a fundamental step in data management and analysis, as raw data often needs to be cleaned, organized, and refined to derive meaningful insights. I hope you liked this article on Data Manipulation Interview Questions. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1536

Leave a Reply