Data Manipulation Operations Asked in Interviews

Data manipulation refers to the process of adjusting, organizing, or transforming data to make it more organized, clean, and accessible for analysis. A Data Scientist/Analyst spend 70% of the time on data manipulation while solving a problem, which is why most of the questions in Data Science interviews are based on data manipulation. So, if you are looking for problems to practice your data manipulation skills, this article is for you. In this article, I’ll take you through some data manipulation operations asked in Data Science interviews with solutions using Python.

Data Manipulation Operations Asked in Interviews

Let’s go through some data manipulation interview problems based on the data manipulation operations commonly asked in Data Science interviews and how to solve them using Python.

Problem 1: Cleaning Product Reviews

A company wants to analyze text reviews for sentiment analysis but needs to clean the data first. The task is to remove emojis, special characters, and numbers, and convert all text to lowercase for standardized processing.

Solving this problem involves text preprocessing, which can be efficiently handled using pandas along with Python’s re module for regular expressions. Let’s create the data based on the given problem first:

import pandas as pd
import re

# sample data
reviews_data = pd.DataFrame({
    'review': [
        'Great product! 😊 Highly recommend!!!',
        'Bad quality 😔 Don't buy! 2021',
        'Value for money 💰 5 stars!!!'
    ]
})
reviews_data
data manipulation question 1

Now, here’s how to solve this problem using Python:

# function to clean text data
def clean_text(text):
    # remove emojis
    text = re.sub('[\U00010000-\U0010ffff]', '', text)
    # remove special characters and numbers
    text = re.sub('[^A-Za-z\s]', '', text)
    # convert to lowercase
    text = text.lower()
    return text

# apply the cleaning function to the review column
reviews_data['cleaned_review'] = reviews_data['review'].apply(clean_text)

print(reviews_data[['review', 'cleaned_review']])
                                 review                   cleaned_review
0 Great product! 😊 Highly recommend!!! great product highly recommend
1 Bad quality 😔 Don’t buy! 2021 bad quality dont buy
2 Value for money 💰 5 stars!!! value for money stars

Problem 2: Analyzing Review Length for Product Insights

A company wants to understand if the length of product reviews has any correlation with customer satisfaction. The task is to categorize the reviews into ‘short’, ‘medium’, and ‘long’, and calculate the average rating for each category.

Solving this problem requires text length analysis, categorization, and aggregation. Let’s create the data based on the given problem first:

# a dataframe 'reviews_data' with 'rating' and 'review' columns
reviews_data = pd.DataFrame({
    'rating': [5, 2, 4, 3, 5],
    'review': [
        'Good quality product, works well',
        'Not worth the price',
        'Decent purchase, happy with the product',
        'Could be better',
        'Excellent! Highly recommend'
    ]
})

reviews_data
question 2

Now, here’s how to solve this problem using Python:

# define the thresholds for review length categorization
def categorize_review_length(text):
    length = len(text.split())
    if length <= 5:
        return 'short'
    elif 5 < length <= 10:
        return 'medium'
    else:
        return 'long'

# categorize reviews based on their length
reviews_data['length_category'] = reviews_data['review'].apply(categorize_review_length)

# calculate the average rating for each category
average_rating_by_length = reviews_data.groupby('length_category')['rating'].mean()

print(average_rating_by_length)
length_category
medium 4.00
short 3.75
Name: rating, dtype: float64

Problem 3: Social Media Trend Analysis

A social media company wants to analyze trends in user posts to understand popular topics over time. The task is to identify the top trending topics on a weekly basis.

This problem involves time-series data manipulation and text analysis. Let’s prepare the data based on this problem:

# sample data
posts_data = pd.DataFrame({
    'timestamp': ['2024-01-01 10:00:00', '2024-01-08 11:00:00', '2024-01-15 12:00:00'],
    'post': ['Launching new feature #innovation', 'Bug fixes and improvements #update', 'Join our live event #webinar']
})

posts_data
data manipulation question 3

Now, here’s how to solve this problem using Python:

  1. Parse timestamps and resample data to a weekly frequency.
  2. Extract keywords from the posts to identify topics.
  3. Count occurrences of each topic per week to identify trends.
# convert timestamp to datetime and set as index
posts_data['timestamp'] = pd.to_datetime(posts_data['timestamp'])
posts_data.set_index('timestamp', inplace=True)

# resample data weekly and aggregate posts
weekly_posts = posts_data['post'].resample('W').agg(' '.join)

# extract and count hashtags (or keywords) to identify trends
weekly_trends = weekly_posts.apply(lambda x: pd.value_counts(x.split()).to_dict())

print(weekly_trends)
timestamp
2024-01-07 {'Launching': 1, 'new': 1, 'feature': 1, '#inn...
2024-01-14 {'Bug': 1, 'fixes': 1, 'and': 1, 'improvements...
2024-01-21 {'Join': 1, 'our': 1, 'live': 1, 'event': 1, '...
Freq: W-SUN, Name: post, dtype: object

Problem 4: Traffic Anomaly Detection

A tech company needs to monitor its network traffic to quickly identify anomalies that could indicate a potential issue. The task is to detect significant spikes or drops.

This problem requires time-series analysis and anomaly detection. Let’s prepare the data to solve this problem:

# sample data
traffic_data = pd.DataFrame({
    'timestamp': pd.date_range(start='2024-01-01', periods=24, freq='H'),
    'requests': [100, 102, 104, 500, 105, 103, 101, 
                 100, 99, 98, 97, 500, 95, 94, 93,
                 92, 91, 90, 500, 88, 87, 86, 85, 500]
})

traffic_data.head()
question 4

Now, here’s how to solve this problem using Python:

  1. Compute a rolling mean and standard deviation to capture the underlying trend and variability in the data.
  2. Use a Z-score or a similar metric that measures the number of standard deviations an observation is from the mean to identify anomalies.
# set timestamp as index
traffic_data.set_index('timestamp', inplace=True)

# calculate rolling mean and standard deviation with a larger window
window_size = 6  # Larger window for smoothing
rolling_mean = traffic_data['requests'].rolling(window=window_size).mean()
rolling_std = traffic_data['requests'].rolling(window=window_size).std()

# define anomalies as points more than 2 standard deviations from the mean
anomaly_threshold = 2
traffic_data['is_anomaly'] = abs(traffic_data['requests'] - rolling_mean) > (anomaly_threshold * rolling_std)

# display anomalies
anomalies = traffic_data[traffic_data['is_anomaly']]

print(anomalies)

Problem 5: Calculating Price Elasticity of Demand

An e-commerce company needs to understand how sensitive the demand for its products is to changes in price. The task is to calculate the price elasticity of demand for each product. This analysis will help the company in pricing strategies to maximize revenue.

Price elasticity of demand (PED) measures the responsiveness of quantity demanded to a change in price. It’s calculated as the percentage change in quantity demanded divided by the percentage change in price. A higher absolute value of PED indicates greater sensitivity of demand to price changes. Let’s prepare the data to solve this problem:

sales_data = pd.DataFrame({
    'product_id': [1, 1, 1, 2, 2, 2],
    'date': pd.date_range(start='2021-01-01', periods=6, freq='M'),
    'price': [10, 12, 11, 20, 21, 19],
    'quantity_sold': [100, 80, 90, 60, 57, 62]
})

sales_data
data manipulation question 5

To solve this, one would typically:

  1. Calculate the percentage change in quantity demanded and price for each product.
  2. Compute the average price and quantity for the period.
  3. Calculate the elasticity using the formula: PED=(% change in quantity / % change in price).
sales_data['price_pct_change'] = sales_data.groupby('product_id')['price'].pct_change()
sales_data['quantity_pct_change'] = sales_data.groupby('product_id')['quantity_sold'].pct_change()

sales_data['price_elasticity'] = sales_data['quantity_pct_change'] / sales_data['price_pct_change'].abs()

# average elasticity per product
average_ped = sales_data.groupby('product_id')['price_elasticity'].mean()

print(average_ped)
product_id
1 0.250000
2 -0.039474
Name: price_elasticity, dtype: float64

Summary

So, these were some data manipulation interview problems based on the data manipulation operations commonly asked in Data Science interviews. A Data Scientist/Analyst spend 70% of the time on data manipulation while solving a problem, which is why most of the questions in Data Science interviews are based on data manipulation.

I hope you liked this article on Data Manipulation operations asked in Data Science interviews. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1619

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading