Groupby is a fairly simple concept. It helps in creating a group of categories and apply as a function to the categories. It’s a simple concept, but it’s an extremely valuable technique that is widely used in data science. The value of groupby comes from its ability to efficiently aggregate data, both in terms of performance and the amount of code it takes.
In real data science projects, you will have to deal with large amounts of data and try things over and over again, so efficiency becomes an important consideration. In this article, I will introduce you to the Groupby function in Pandas.
If you come from an SQL background and are familiar with GROUP BY, you can scroll through this to see some syntax examples.
Groupby in Action
The dataset I am using in this task can be downloaded from here. It is based on restaurant data. We need to group the restaurants by type of parking available and then get the average rating of the restaurants in each parking category. We want to know if restaurants with parking lots have a better service rating.
Here are the steps we need to follow to start the task:
- Merge two data frames together
- Create groups based on the types of parking available in restaurants
- Calculate the average scores for each parking group
# Load restaurant ratings and parking lot info import pandas as pd ratings = pd.read_csv("rating_final.csv") parking = pd.read_csv("chefmozparking.csv") # Merge the dataframes df = pd.merge(left=ratings, right=parking, on="placeID", how="left") # Show the merged data df.head()Code language: PHP (php)
Now, let’s use the groupby function:
# Group by the parking_lot column parking_group = df.groupby("parking_lot") # Calculate the mean ratings parking_group["service_rating"].mean()Code language: PHP (php)
parking_lot none 1.098039 public 1.021978 valet parking 1.344828 yes 1.092545 Name: service_rating, dtype: float64
The ratings look low, let’s look at the statistics to get an idea of why these ratings are low:
parking_group['service_rating'].describe()Code language: CSS (css)
With all of the summary stats in front of us, we can see that the lowest rating for all parking categories is 0 and the highest is 2. I guess users were asked to rate restaurants with 1 to 3 stars, which equals 0, 1, or 2 in the data. Thus, restaurants with valet parking have higher service scores.
I hope you liked this article on the application on the Groupby method in python pandas. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning and Python.