Data manipulation refers to the process of adjusting, organizing, or transforming data to make it more organized, clean, and accessible for analysis. A Data Scientist/Analyst spend 70% of the time on data manipulation while solving a problem, which is why most of the questions in Data Science interviews are based on data manipulation. So, if you are looking for problems to practice your data manipulation skills, this article is for you. In this article, I’ll take you through some data manipulation operations asked in Data Science interviews with solutions using Python.
Data Manipulation Operations Asked in Interviews
Let’s go through some data manipulation interview problems based on the data manipulation operations commonly asked in Data Science interviews and how to solve them using Python.
Problem 1: Cleaning Product Reviews
A company wants to analyze text reviews for sentiment analysis but needs to clean the data first. The task is to remove emojis, special characters, and numbers, and convert all text to lowercase for standardized processing.
Solving this problem involves text preprocessing, which can be efficiently handled using pandas along with Python’s re module for regular expressions. Let’s create the data based on the given problem first:
import pandas as pd import re # sample data reviews_data = pd.DataFrame({ 'review': [ 'Great product! 😊 Highly recommend!!!', 'Bad quality 😔 Don't buy! 2021', 'Value for money 💰 5 stars!!!' ] }) reviews_data
Now, here’s how to solve this problem using Python:
# function to clean text data def clean_text(text): # remove emojis text = re.sub('[\U00010000-\U0010ffff]', '', text) # remove special characters and numbers text = re.sub('[^A-Za-z\s]', '', text) # convert to lowercase text = text.lower() return text # apply the cleaning function to the review column reviews_data['cleaned_review'] = reviews_data['review'].apply(clean_text) print(reviews_data[['review', 'cleaned_review']])
review cleaned_review
0 Great product! 😊 Highly recommend!!! great product highly recommend
1 Bad quality 😔 Don’t buy! 2021 bad quality dont buy
2 Value for money 💰 5 stars!!! value for money stars
Problem 2: Analyzing Review Length for Product Insights
A company wants to understand if the length of product reviews has any correlation with customer satisfaction. The task is to categorize the reviews into ‘short’, ‘medium’, and ‘long’, and calculate the average rating for each category.
Solving this problem requires text length analysis, categorization, and aggregation. Let’s create the data based on the given problem first:
# a dataframe 'reviews_data' with 'rating' and 'review' columns reviews_data = pd.DataFrame({ 'rating': [5, 2, 4, 3, 5], 'review': [ 'Good quality product, works well', 'Not worth the price', 'Decent purchase, happy with the product', 'Could be better', 'Excellent! Highly recommend' ] }) reviews_data
Now, here’s how to solve this problem using Python:
# define the thresholds for review length categorization def categorize_review_length(text): length = len(text.split()) if length <= 5: return 'short' elif 5 < length <= 10: return 'medium' else: return 'long' # categorize reviews based on their length reviews_data['length_category'] = reviews_data['review'].apply(categorize_review_length) # calculate the average rating for each category average_rating_by_length = reviews_data.groupby('length_category')['rating'].mean() print(average_rating_by_length)
length_category
medium 4.00
short 3.75
Name: rating, dtype: float64
Problem 3: Social Media Trend Analysis
A social media company wants to analyze trends in user posts to understand popular topics over time. The task is to identify the top trending topics on a weekly basis.
This problem involves time-series data manipulation and text analysis. Let’s prepare the data based on this problem:
# sample data posts_data = pd.DataFrame({ 'timestamp': ['2024-01-01 10:00:00', '2024-01-08 11:00:00', '2024-01-15 12:00:00'], 'post': ['Launching new feature #innovation', 'Bug fixes and improvements #update', 'Join our live event #webinar'] }) posts_data
Now, here’s how to solve this problem using Python:
- Parse timestamps and resample data to a weekly frequency.
- Extract keywords from the posts to identify topics.
- Count occurrences of each topic per week to identify trends.
# convert timestamp to datetime and set as index posts_data['timestamp'] = pd.to_datetime(posts_data['timestamp']) posts_data.set_index('timestamp', inplace=True) # resample data weekly and aggregate posts weekly_posts = posts_data['post'].resample('W').agg(' '.join) # extract and count hashtags (or keywords) to identify trends weekly_trends = weekly_posts.apply(lambda x: pd.value_counts(x.split()).to_dict()) print(weekly_trends)
timestamp
2024-01-07 {'Launching': 1, 'new': 1, 'feature': 1, '#inn...
2024-01-14 {'Bug': 1, 'fixes': 1, 'and': 1, 'improvements...
2024-01-21 {'Join': 1, 'our': 1, 'live': 1, 'event': 1, '...
Freq: W-SUN, Name: post, dtype: object
Problem 4: Traffic Anomaly Detection
A tech company needs to monitor its network traffic to quickly identify anomalies that could indicate a potential issue. The task is to detect significant spikes or drops.
This problem requires time-series analysis and anomaly detection. Let’s prepare the data to solve this problem:
# sample data traffic_data = pd.DataFrame({ 'timestamp': pd.date_range(start='2024-01-01', periods=24, freq='H'), 'requests': [100, 102, 104, 500, 105, 103, 101, 100, 99, 98, 97, 500, 95, 94, 93, 92, 91, 90, 500, 88, 87, 86, 85, 500] }) traffic_data.head()
Now, here’s how to solve this problem using Python:
- Compute a rolling mean and standard deviation to capture the underlying trend and variability in the data.
- Use a Z-score or a similar metric that measures the number of standard deviations an observation is from the mean to identify anomalies.
# set timestamp as index traffic_data.set_index('timestamp', inplace=True) # calculate rolling mean and standard deviation with a larger window window_size = 6 # Larger window for smoothing rolling_mean = traffic_data['requests'].rolling(window=window_size).mean() rolling_std = traffic_data['requests'].rolling(window=window_size).std() # define anomalies as points more than 2 standard deviations from the mean anomaly_threshold = 2 traffic_data['is_anomaly'] = abs(traffic_data['requests'] - rolling_mean) > (anomaly_threshold * rolling_std) # display anomalies anomalies = traffic_data[traffic_data['is_anomaly']] print(anomalies)
Problem 5: Calculating Price Elasticity of Demand
An e-commerce company needs to understand how sensitive the demand for its products is to changes in price. The task is to calculate the price elasticity of demand for each product. This analysis will help the company in pricing strategies to maximize revenue.
Price elasticity of demand (PED) measures the responsiveness of quantity demanded to a change in price. It’s calculated as the percentage change in quantity demanded divided by the percentage change in price. A higher absolute value of PED indicates greater sensitivity of demand to price changes. Let’s prepare the data to solve this problem:
sales_data = pd.DataFrame({ 'product_id': [1, 1, 1, 2, 2, 2], 'date': pd.date_range(start='2021-01-01', periods=6, freq='M'), 'price': [10, 12, 11, 20, 21, 19], 'quantity_sold': [100, 80, 90, 60, 57, 62] }) sales_data
To solve this, one would typically:
- Calculate the percentage change in quantity demanded and price for each product.
- Compute the average price and quantity for the period.
- Calculate the elasticity using the formula: PED=(% change in quantity / % change in price).
sales_data['price_pct_change'] = sales_data.groupby('product_id')['price'].pct_change() sales_data['quantity_pct_change'] = sales_data.groupby('product_id')['quantity_sold'].pct_change() sales_data['price_elasticity'] = sales_data['quantity_pct_change'] / sales_data['price_pct_change'].abs() # average elasticity per product average_ped = sales_data.groupby('product_id')['price_elasticity'].mean() print(average_ped)
product_id
1 0.250000
2 -0.039474
Name: price_elasticity, dtype: float64
Summary
So, these were some data manipulation interview problems based on the data manipulation operations commonly asked in Data Science interviews. A Data Scientist/Analyst spend 70% of the time on data manipulation while solving a problem, which is why most of the questions in Data Science interviews are based on data manipulation.
I hope you liked this article on Data Manipulation operations asked in Data Science interviews. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.