GroupBy in Python

Groupby is a fairly simple concept. It helps in creating a group of categories and apply as a function to the categories. It’s a simple concept, but it’s an extremely valuable technique that is widely used in data science. The value of groupby comes from its ability to efficiently aggregate data, both in terms of performance and the amount of code it takes.

In real data science projects, you will have to deal with large amounts of data and try things over and over again, so efficiency becomes an important consideration. In this article, I will introduce you to the Groupby function in Pandas.

Also, Read – Lung Segmentation with Machine Learning.

If you come from an SQL background and are familiar with GROUP BY, you can scroll through this to see some syntax examples.

Groupby in Action

The dataset I am using in this task can be downloaded from here. It is based on restaurant data. We need to group the restaurants by type of parking available and then get the average rating of the restaurants in each parking category. We want to know if restaurants with parking lots have a better service rating.

Here are the steps we need to follow to start the task:

  • Merge two data frames together
  • Create groups based on the types of parking available in restaurants
  • Calculate the average scores for each parking group
# Load restaurant ratings and parking lot info
import pandas as pd
ratings = pd.read_csv("rating_final.csv")
parking = pd.read_csv("chefmozparking.csv")

# Merge the dataframes
df = pd.merge(left=ratings, right=parking, on="placeID", how="left")

# Show the merged data
df.head()Code language: PHP (php)

Now, let’s use the groupby function:

# Group by the parking_lot column
parking_group = df.groupby("parking_lot")

# Calculate the mean ratings
parking_group["service_rating"].mean()Code language: PHP (php)
none             1.098039
public           1.021978
valet parking    1.344828
yes              1.092545
Name: service_rating, dtype: float64

The ratings look low, let’s look at the statistics to get an idea of why these ratings are low:

parking_group['service_rating'].describe()Code language: CSS (css)

With all of the summary stats in front of us, we can see that the lowest rating for all parking categories is 0 and the highest is 2. I guess users were asked to rate restaurants with 1 to 3 stars, which equals 0, 1, or 2 in the data. Thus, restaurants with valet parking have higher service scores.

Also, Read – Cross-Validation in Machine Learning.

I hope you liked this article on the application on the Groupby method in python pandas. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning and Python.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

Leave a Reply