Data Manipulation means altering, transforming, or restructuring data to prepare it for analysis, reporting, or other data-related tasks. It is a fundamental step in data management and analysis, as raw data often needs to be cleaned, organized, and refined to derive meaningful insights. In a technical interview for Data Science, most of the questions are based on Data Manipulation. So, if you want to know what kind of Data Manipulation interview questions you can get, this article is for you. In this article, I’ll take you through a list of Data Manipulation interview questions solved and explained using Python.
Getting Started for Data Manipulation Interview Questions
In your Data Science technical interview, you will be given a dataset and all the Data Manipulation interview questions will be formed based on that dataset. I’ll also follow the same approach to introduce you to Data Manipulation interview questions and solving them.
The dataset I’ll be using to create Data Manipulation interview questions is based on user behaviour on an app. You can download the dataset from here. Below is the column information about the dataset that I’ll be using here:
- userid: The identity number of the user;
- Average Screen Time: The average screen time of the user on the application;
- Average Spent on App (INR): The average amount spent by the user on the application;
- Left Review: Did the user leave any reviews about the experience on the application? (1 if true, otherwise 0)
- Ratings: Ratings given by the user to the application;
- New Password Request: The number of times the user requested a new password;
- Last Visited Minutes: Minuted passed by when the user was last active;
- Status: Installed if the application is installed and uninstalled if the user has deleted the application;
Data Manipulation Interview Questions
Now, let’s go through the Data Manipulation Interview questions based on the dataset one by one.
How would you load and explore this dataset to gain an initial understanding of its structure and contents?
To load and explore a dataset using Python and Pandas, you can follow these steps below.
Step 1: Import the pandas library and the dataset:
import pandas as pd # Load the dataset into a DataFrame df = pd.read_csv('userbehaviour.csv')
Step 2: Explore the dataset to gain an initial understanding of its structure and contents:
# Display the first 5 rows of the dataset print(df.head())
userid Average Screen Time Average Spent on App (INR) Left Review \ 0 1001 17.0 634.0 1 1 1002 0.0 54.0 0 2 1003 37.0 207.0 0 3 1004 32.0 445.0 1 4 1005 45.0 427.0 1 Ratings New Password Request Last Visited Minutes Status 0 9 7 2990 Installed 1 4 8 24008 Uninstalled 2 8 5 971 Installed 3 6 2 799 Installed 4 5 6 3668 Installed
# Get the dimensions of the dataset print(f"Number of rows: {df.shape[0]}") print(f"Number of columns: {df.shape[1]}")
Number of rows: 999 Number of columns: 8
# Check the data types of each column print(df.dtypes)
userid int64 Average Screen Time float64 Average Spent on App (INR) float64 Left Review int64 Ratings int64 New Password Request int64 Last Visited Minutes int64 Status object dtype: object
# Check for missing values print(df.isnull().sum())
userid 0 Average Screen Time 0 Average Spent on App (INR) 0 Left Review 0 Ratings 0 New Password Request 0 Last Visited Minutes 0 Status 0 dtype: int64
# Summary statistics for numerical columns print(df.describe())
userid Average Screen Time Average Spent on App (INR) \ count 999.000000 999.000000 999.000000 mean 1500.000000 24.390390 424.415415 std 288.530761 14.235415 312.365695 min 1001.000000 0.000000 0.000000 25% 1250.500000 12.000000 96.000000 50% 1500.000000 24.000000 394.000000 75% 1749.500000 36.000000 717.500000 max 1999.000000 50.000000 998.000000 Left Review Ratings New Password Request Last Visited Minutes count 999.000000 999.000000 999.000000 999.000000 mean 0.497497 6.513514 4.941942 5110.898899 std 0.500244 2.701511 2.784626 8592.036516 min 0.000000 0.000000 1.000000 201.000000 25% 0.000000 5.000000 3.000000 1495.500000 50% 0.000000 7.000000 5.000000 2865.000000 75% 1.000000 9.000000 7.000000 4198.000000 max 1.000000 10.000000 15.000000 49715.000000
# Unique values in the 'Status' column print(df['Status'].unique())
['Installed' 'Uninstalled']
# Number of unique users unique_users = df['userid'].nunique() print(f"Number of unique users: {unique_users}")
Number of unique users: 999
Calculate the mean, median, and standard deviation of the ‘Average Screen Time’ and ‘Average Spent on App (INR)’ columns.
Here’s how to calculate the mean, median, and standard deviation of the ‘Average Screen Time’ and ‘Average Spent on App (INR)’ columns:
# Calculate the mean of 'Average Screen Time' column average_screen_time_mean = df['Average Screen Time'].mean() # Calculate the median of 'Average Screen Time' column average_screen_time_median = df['Average Screen Time'].median() # Calculate the standard deviation of 'Average Screen Time' column average_screen_time_std = df['Average Screen Time'].std() # Calculate the mean of 'Average Spent on App (INR)' column average_spent_on_app_mean = df['Average Spent on App (INR)'].mean() # Calculate the median of 'Average Spent on App (INR)' column average_spent_on_app_median = df['Average Spent on App (INR)'].median() # Calculate the standard deviation of 'Average Spent on App (INR)' column average_spent_on_app_std = df['Average Spent on App (INR)'].std() # Print the results print(f"Mean of Average Screen Time: {average_screen_time_mean:.2f}") print(f"Median of Average Screen Time: {average_screen_time_median:.2f}") print(f"Standard Deviation of Average Screen Time: {average_screen_time_std:.2f}") print(f"Mean of Average Spent on App (INR): {average_spent_on_app_mean:.2f}") print(f"Median of Average Spent on App (INR): {average_spent_on_app_median:.2f}") print(f"Standard Deviation of Average Spent on App (INR): {average_spent_on_app_std:.2f}")
Mean of Average Screen Time: 24.39 Median of Average Screen Time: 24.00 Standard Deviation of Average Screen Time: 14.24 Mean of Average Spent on App (INR): 424.42 Median of Average Spent on App (INR): 394.00 Standard Deviation of Average Spent on App (INR): 312.37
Create a new feature that represents the ratio of “Average Spent on App (INR)” to “Average Screen Time” for each user. How might this feature be useful?
Here’s how to create a new feature that represents the ratio of Average Spent on App (INR) to Average Screen Time for each user:
# Create a new feature 'Spending-to-Screen-Time Ratio' df['Spending-to-Screen-Time Ratio'] = df['Average Spent on App (INR)'] / df['Average Screen Time'] # Print the updated DataFrame to see the new feature print(df.head())
userid Average Screen Time Average Spent on App (INR) Left Review \ 0 1001 17.0 634.0 1 1 1002 0.0 54.0 0 2 1003 37.0 207.0 0 3 1004 32.0 445.0 1 4 1005 45.0 427.0 1 Ratings New Password Request Last Visited Minutes Status \ 0 9 7 2990 Installed 1 4 8 24008 Uninstalled 2 8 5 971 Installed 3 6 2 799 Installed 4 5 6 3668 Installed Spending-to-Screen-Time Ratio 0 37.294118 1 inf 2 5.594595 3 13.906250 4 9.488889
We can use this ratio to categorize users into different segments, such as high spenders with low screen time or low spenders with high screen time. It can help us understand which user groups are most engaged with the app.
Filter the dataset to include only users who have left a review (Left Review = 1) and have given a rating of 4 or higher. How many such users are there?
Here’s how to filter the dataset to include only users who have left a review (Left Review = 1) and have given a rating of 4 or higher:
# Filter the dataset based on the conditions filtered_df = df[(df['Left Review'] == 1) & (df['Ratings'] >= 4)] # Count the number of users meeting the criteria num_users = filtered_df.shape[0] # Print the number of users print(f"Number of users who left a review with a rating of 4 or higher: {num_users}")
Number of users who left a review with a rating of 4 or higher: 421
Calculate the correlation between “Ratings” and “Average Spent on App (INR)” for the dataset. What does this correlation tell us about user behaviour?
Here’s how to calculate the correlation between Ratings and Average Spent on App:
# Calculate the correlation between 'Ratings' and 'Average Spent on App (INR)' correlation = df['Ratings'].corr(df['Average Spent on App (INR)']) # Print the correlation coefficient print(f"Correlation between 'Ratings' and 'Average Spent on App (INR)': {correlation:.2f}")
Correlation between 'Ratings' and 'Average Spent on App (INR)': 0.48
A correlation coefficient of 0.48 between Ratings and Average Spent on App (INR) suggests a moderate positive linear relationship between these two variables. It indicates that as user ratings of the application increase, the average amount spent by users on the app tends to increase as well. In other words, users who rate the app more positively are more likely to spend more money on in-app purchases or paid features.
Split the dataset into two subsets: one for users who have “Status” as Installed and another for users who have “Status” as Uninstalled. What are the average screen time and average spent on the app for each group?
Here’s how to split the dataset into two subsets based on the Status column (one for users with Status as Installed and another for users with Status as Uninstalled) and calculate the average screen time and average spent on the app for each group:
# Split the dataset into two subsets based on 'Status' installed_users = df[df['Status'] == 'Installed'] uninstalled_users = df[df['Status'] == 'Uninstalled'] # Calculate the average screen time and average spent on the app for each group average_screen_time_installed = installed_users['Average Screen Time'].mean() average_spent_on_app_installed = installed_users['Average Spent on App (INR)'].mean() average_screen_time_uninstalled = uninstalled_users['Average Screen Time'].mean() average_spent_on_app_uninstalled = uninstalled_users['Average Spent on App (INR)'].mean() # Print the results print("For 'Installed' Users:") print(f"Average Screen Time: {average_screen_time_installed:.2f}") print(f"Average Spent on App (INR): {average_spent_on_app_installed:.2f}") print("\nFor 'Uninstalled' Users:") print(f"Average Screen Time: {average_screen_time_uninstalled:.2f}") print(f"Average Spent on App (INR): {average_spent_on_app_uninstalled:.2f}")
For 'Installed' Users: Average Screen Time: 26.39 Average Spent on App (INR): 458.16 For 'Uninstalled' Users: Average Screen Time: 2.28 Average Spent on App (INR): 52.02
So these are the examples of the Data Manipulation interview questions and how to solve them using Python. You can learn more about Data Manipulation from the resources mentioned below:
Summary
In a technical interview for Data Science, most of the questions are based on Data Manipulation. Data Manipulation means altering, transforming, or restructuring data to prepare it for analysis, reporting, or other data-related tasks. It is a fundamental step in data management and analysis, as raw data often needs to be cleaned, organized, and refined to derive meaningful insights. I hope you liked this article on Data Manipulation Interview Questions. Feel free to ask valuable questions in the comments section below.