Pandas is a powerful open-source library in Python that provides high-performance data manipulation and analysis tools. Pandas enhances the productivity and efficiency of Data Science professionals by providing a comprehensive set of tools for data manipulation, analysis, and exploration. Its versatility, performance, and ease of use make it a popular choice among data scientists and analysts for handling and analyzing structured data. If you want to learn Pandas for Data Science, this article is for you. In this article, I’ll take you through a complete guide to Pandas for Data Science.
What is Pandas?
Pandas is a powerful open-source library in Python that provides high-performance data manipulation and analysis tools. It introduces two fundamental data structures: Series and DataFrame, which allow data to be organized, manipulated, and analyzed in a tabular format similar to spreadsheets or databases.
Data Science professionals rely on Pandas for several reasons. Firstly, Pandas simplifies the process of data handling and manipulation, allowing professionals to clean, preprocess, and transform data.
Secondly, Pandas offers powerful tools for data analysis and exploration. It provides functions for aggregating data, computing descriptive statistics, handling time series data, and working with categorical and textual data.
Furthermore, Pandas excels at data integration and preparation. Data Science professionals can easily load data from different sources (such as CSV, Excel, SQL databases, and more), merge or join datasets, and perform data transformations before further analysis.
To install Pandas on your Python virtual environment, you can execute the command mentioned below in your terminal or command prompt:
- pip install pandas
A Practical Guide to Pandas for Data Science
In this section, I will take you through a practical guide to Pandas for Data Science. Let’s start by creating a DataFrame:
import pandas as pd data = {'Name': ['Aman', 'Akanksha', 'Akshit', 'Divyansha', 'Hardik'], 'Age': [24, 26, 24, 22, 22], 'City': ['New Delhi', 'Mumbai', 'Kolkata', 'Chennai', 'Bangalore'], 'Salary': [80000, 70000, 65000, 60000, 55000]} df = pd.DataFrame(data) print(df)
Name Age City Salary 0 Aman 24 New Delhi 80000 1 Akanksha 26 Mumbai 70000 2 Akshit 24 Kolkata 65000 3 Divyansha 22 Chennai 60000 4 Hardik 22 Bangalore 55000
The DataFrame contains information about a group of individuals. Each row represents a person, and the columns represent different attributes or characteristics of these individuals.
Now let’s have a look at the head() function in pandas:
print(df.head(3))
Name Age City Salary 0 Aman 24 New Delhi 80000 1 Akanksha 26 Mumbai 70000 2 Akshit 24 Kolkata 65000
The head() function helps look at a subset of the data. It allows us to retrieve the topmost five rows of the DataFrame. In this case, we specify 3 as the argument to head(), indicating that we want to retrieve the first three rows.
In the same way, we can look at the lowermost rows of the DataFrame using tail():
print(df.tail(2))
Name Age City Salary 3 Divyansha 22 Chennai 60000 4 Hardik 22 Bangalore 55000
Now let’s look at the overall structure of the DataFrame, including the number of rows and columns it contains:
print(df.shape)
(5, 4)
The result is a tuple that contains two values. The first value represents the number of rows in the DataFrame, while the second value represents the number of columns. We got (5, 4) in the above output, which means that the DataFrame has 5 rows and 4 columns.
Now let’s have a look at the column names:
print(df.columns)
Index(['Name', 'Age', 'City', 'Salary'], dtype='object')
Now let’s have a look at the summary statistics of the data:
print(df.describe())
Age Salary count 5.00000 5.000000 mean 23.60000 66000.000000 std 1.67332 9617.692031 min 22.00000 55000.000000 25% 22.00000 60000.000000 50% 24.00000 65000.000000 75% 24.00000 70000.000000 max 26.00000 80000.000000
The summary includes the following statistical measures of the numerical columns in the data:
- Count: The number of non-missing values in each column.
- Mean: The average value of the data in each column.
- Standard Deviation: A measure of the variability or dispersion of the data.
- Minimum: The smallest value observed in each column.
- 25th Percentile: The value below which 25% of the data falls.
- 50th Percentile (Median): The value below which 50% of the data falls.
- 75th Percentile: The value below which 75% of the data falls.
- Maximum: The largest value observed in each column.
Now let’s see how to look at the number of missing values in each column:
print(df.isnull().sum())
Name 0 Age 0 City 0 Salary 0 dtype: int64
The df.isnull().sum() provides a concise summary of the count of missing values in each column of the DataFrame, assisting us in understanding the presence and extent of missing data in the dataset.
Now let’s see how to drop missing values:
df = df.dropna()
df = df.dropna() creates a new DataFrame that excludes rows containing missing values, ensuring that the resulting DataFrame contains only complete and valid data for further analysis or processing.
Similarly, this is how to look at the number of duplicates in each column:
print(df.duplicated())
0 False 1 False 2 False 3 False 4 False dtype: bool
And here’s how to drop duplicates from the data:
df = df.drop_duplicates()
Now here’s how to select columns:
df['Name'] # Select a single column df[['Name', 'Age']] # Select multiple columns
Selecting a single column using df[‘Column Name’] allows us to extract and work with the values of a specific attribute from the DataFrame. Selecting multiple columns using df[[‘Column1’, ‘Column2’]] allows us to extract and work with multiple attributes simultaneously, enabling a more comprehensive analysis and exploration of the data.
Now let’s see how to select rows by index and label:
df.loc[2] # Select a row by label df.iloc[3] # Select a row by index
By calling df.loc[2], we instruct to extract and display the row with the label or index value of 2. The result is a series or a one-dimensional labelled array that contains the values present in that particular row. And by calling df.iloc[3], we instruct to extract and display the row located at index position 3.
Now let’s have a look at some examples of filtering rows:
df[df['Age'] > 25] # Filter rows based on a condition df.query('City == "New Delhi"') # Filter rows using a query syntax
By calling df[df[‘Age’] > 25], we instruct to extract and display the rows that meet the specified condition, which is that the ‘Age’ column value should be greater than 25. And by calling df.query(‘City == “New Delhi”‘), we instruct to filter the DataFrame based on the condition that the ‘City’ column value should be equal to “New Delhi”.
An example of sorting your DataFrame according to the values of a particular column:
# Sort the DataFrame by the values in a specified column df.sort_values(by='Salary')
By calling df.sort_values(by=’Salary’), we instruct to sort the DataFrame in ascending order based on the values in the ‘Salary’ column.
Now let’s have a look at an example of creating new columns using existing columns:
# Creating a new column using existing columns df['Salary_Per_Year'] = df['Salary'] * 12
By calling df[‘Salary_Per_Year’] = df[‘Salary’] * 12, we instruct to perform the calculation for each row in the ‘Salary’ column, multiplying the salary value by 12, and assign the result to the new column ‘Salary_Per_Year’.
Now let’s have a look at renaming columns:
# Renaming Columns df_renamed = df.rename(columns={'Name': 'Full Name', 'Age': 'Age (years)'})
By calling df.rename(columns={‘Name’: ‘Full Name’, ‘Age’: ‘Age (years)’}), we instruct to rename the ‘Name’ column to ‘Full Name’ and the ‘Age’ column to ‘Age (years)’.
Now let’s see how to convert a DataFrame to a CSV file:
# Export the DataFrame to a CSV file df.to_csv('data.csv', index=False)
Now let’s have a look at how to drop columns and rows:
df_dropped_col = df.drop('Salary', axis=1) df_dropped_rows = df.drop([1, 3], axis=0)
By calling df.drop(‘Salary’, axis=1), we instruct to drop the ‘Salary’ column from the DataFrame. And by calling df.drop([1, 3], axis=0), we instruct to drop the rows with index values 1 and 3 from the DataFrame.
Now let’s see how to merge and join DataFrames:
data1 = {'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]} data2 = {'key': ['B', 'C', 'D', 'E'], 'value2': [5, 6, 7, 8]} df1 = pd.DataFrame(data1) df2 = pd.DataFrame(data2) # Merging and Joining DataFrames merged_df = pd.merge(df1, df2, on='key') joined_df = df1.join(df2.set_index('key'), on='key')
Merging DataFrames using pd.merge(df1, df2, on=’key’) combines two DataFrames into a single DataFrame based on a common key column. Joining DataFrames using df1.join(df2.set_index(‘key’), on=’key’) combines DataFrames based on a shared column by joining the columns of one DataFrame to another.
Now let’s see how to convert a column containing Information about dates into a datetime data type:
# Create a sample DataFrame with date and value columns data = {'date': ['2023-07-01', '2023-07-02', '2023-07-03', '2023-07-04', '2023-07-05'], 'value': [10, 15, 12, 8, 11]} df = pd.DataFrame(data) # Convert the 'date' column to datetime data type df['date'] = pd.to_datetime(df['date'])
Converting a column to datetime data type using df[‘date’] = pd.to_datetime(df[‘date’]) ensures that the ‘date’ column is recognized as datetime data, enabling us to leverage time-based operations and analysis on the data.
Now let’s see how to aggregate and transform data:
# Create a sample DataFrame data = {'category': ['A', 'B', 'A', 'B', 'A'], 'value1': [10, 15, 8, 12, 7], 'value2': [5, 8, 4, 6, 3]} df = pd.DataFrame(data) # Advanced Aggregation aggregated_df = df.groupby('category').agg({'value1': 'sum', 'value2': 'mean'}) # Advanced Transformation transformed_df = df.groupby('category').transform('mean') # Display the original DataFrame, aggregated DataFrame, and transformed DataFrame print("Original DataFrame:") print(df) print("\nAggregated DataFrame:") print(aggregated_df) print("\nTransformed DataFrame:") print(transformed_df)
Original DataFrame: category value1 value2 0 A 10 5 1 B 15 8 2 A 8 4 3 B 12 6 4 A 7 3 Aggregated DataFrame: value1 value2 category A 25 4.0 B 27 7.0 Transformed DataFrame: value1 value2 0 8.333333 4.0 1 13.500000 7.0 2 8.333333 4.0 3 13.500000 7.0 4 8.333333 4.0
Aggregation allows us to summarize and condense the data by computing summary statistics or combining multiple values into a single representation. In this case, we calculate the sum of ‘value1’ and the mean of ‘value2’ for each category, providing a consolidated view of the data grouped by categories. Transformation allows us to apply computations or modifications to individual values in the DataFrame based on group-specific properties. In this case, we transform the values in ‘value1’ and ‘value2’ to represent the mean value of their respective categories, providing a standardized representation of the data within each category.
So these were some of the most important Pandas operations you should know while getting started with Pandas for Data Science. You can explore more Pandas operations from the official documentation of Pandas here.
Also, Read – Numpy Guide for Data Science!
Summary
Pandas is a powerful open-source library in Python that provides high-performance data manipulation and analysis tools. It introduces two fundamental data structures: Series and DataFrame, which allow data to be organized, manipulated, and analyzed in a tabular format similar to spreadsheets or databases. I hope you liked this article on a practical guide to Pandas for Data Science. Feel free to ask valuable questions in the comments section below.