Important Pandas Functions for Data Science

Pandas is a very fast and efficient DataFrame object for working with data. It provides highly efficient functions, ranging from reading and writing data to manipulate and preparing data for any kind of data science task. Although you need to learn all the functions of the Pandas library, if you want to know the most important functions that it provides for data science, this article is for you. In this article, I’m going to introduce you to some of the most important Pandas functions for data science that you need to know.

Important Pandas Functions for Data Science

Pandas is an amazing Python library for working with data. Some of the amazing features it provides for working with data are:

  1. Intelligent data alignment
  2. Integrated handling of missing data
  3. Flexible data reshaping
  4. Easy insertion and deletion of columns
  5. data aggregation and transformation
  6. Merging and joining of datasets
  7. Time series functionality
  8. Academic and Commercial usage

There are so many functions that Pandas provide for all the features mentioned above. Although you need to learn all the functions that it provides but there are some very important functions in Pandas that you need to use in almost every data science task, such important pandas functions for data science are explained below.

Reading a Dataset:

Pandas provide functions to read data in any format. Mostly, we use CSV format datasets in data science tasks, so below is how you can read a CSV file using Pandas:

import pandas as pd
data = pd.read_csv("GOOG.csv")

Looking at the First Five Rows:

It’s not easy to look at every row of data, so to get a first look at the data, it’s best to look at the first five rows to get an idea of what kind of data you’re going to be working with. So here’s how to look at the first five rows of the dataset:

print(data.head())
         Date         Open         High  ...        Close    Adj Close   Volume
0  2019-08-09  1197.989990  1203.880005  ...  1188.010010  1188.010010  1065700
1  2019-08-12  1179.209961  1184.959961  ...  1174.709961  1174.709961  1003000
2  2019-08-13  1171.459961  1204.780029  ...  1197.270020  1197.270020  1294400
3  2019-08-14  1176.310059  1182.300049  ...  1164.290039  1164.290039  1578700
4  2019-08-15  1163.500000  1175.839966  ...  1167.260010  1167.260010  1218700

[5 rows x 7 columns]

Checking Null Values:

Having missing values in a dataset affects the analysis of the data, so it is very important to remove missing values or fill them in. But before you move on to fill in or delete your data, you need to know how many missing values you have. So here’s how to find missing values in a dataset:

print(data.isnull().sum())
Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

Fortunately, this dataset does not have any missing values. If your data has missing values and you want to delete them, then you can use the function mentioned below:

data.dropna()

Filling Missing Values:

If you want to fill all the missing values in a dataset with a specific value such as 0, 1, 100, or any other value, then you can use the function mentioned below:

data.fillna(0)

There are more strategies that you can use to fill all the missing values, you can learn about it from here.

Query Data:

To ask specific queries to your data, you can use the query() function in pandas, which allows you to query specific records from the dataset, just like SQL. Here’s how you can query your data:

print(data.query("Close > 1500"))
           Date         Open         High  ...        Close    Adj Close   Volume
126  2020-02-10  1474.319946  1509.500000  ...  1508.680054  1508.680054  1419900
127  2020-02-11  1511.810059  1529.630005  ...  1508.790039  1508.790039  1344600
128  2020-02-12  1514.479980  1520.694946  ...  1518.270020  1518.270020  1167600
129  2020-02-13  1512.689941  1527.180054  ...  1514.660034  1514.660034   929500
130  2020-02-14  1515.599976  1520.739990  ...  1520.739990  1520.739990  1197800
131  2020-02-18  1515.000000  1531.630005  ...  1519.670044  1519.670044  1120700
132  2020-02-19  1525.069946  1532.105957  ...  1526.689941  1526.689941   949300
133  2020-02-20  1522.000000  1529.640015  ...  1518.150024  1518.150024  1096600
230  2020-07-09  1506.449951  1522.719971  ...  1510.989990  1510.989990  1423300
231  2020-07-10  1506.150024  1543.829956  ...  1541.739990  1541.739990  1856300
232  2020-07-13  1550.000000  1577.131958  ...  1511.339966  1511.339966  1846400
233  2020-07-14  1490.310059  1522.949951  ...  1520.579956  1520.579956  1585000
234  2020-07-15  1523.130005  1535.329956  ...  1513.640015  1513.640015  1610700
235  2020-07-16  1500.000000  1518.689941  ...  1518.000000  1518.000000  1519300
236  2020-07-17  1521.619995  1523.439941  ...  1515.550049  1515.550049  1456700
237  2020-07-20  1515.260010  1570.290039  ...  1565.719971  1565.719971  1557300
238  2020-07-21  1586.989990  1586.989990  ...  1558.420044  1558.420044  1218600
239  2020-07-22  1560.500000  1570.000000  ...  1568.489990  1568.489990   932000
240  2020-07-23  1566.969971  1571.869995  ...  1515.680054  1515.680054  1627600
241  2020-07-24  1498.930054  1517.635986  ...  1511.869995  1511.869995  1544000
242  2020-07-27  1515.599976  1540.969971  ...  1530.199951  1530.199951  1246000
243  2020-07-28  1525.180054  1526.479980  ...  1500.339966  1500.339966  1702200
244  2020-07-29  1506.319946  1531.251953  ...  1522.020020  1522.020020  1106500
245  2020-07-30  1497.000000  1537.869995  ...  1531.449951  1531.449951  1671400
250  2020-08-06  1471.750000  1502.390015  ...  1500.099976  1500.099976  1995400

[25 rows x 7 columns]

In the code above, I am requesting all rows where the values in the Close column are more than 500.

Sorting Values:

You can also sort your dataset using Pandas according to a particular column. For example, below is how you can sort your data in ascending order according to the values of the Close column in the dataset:

print(data.sort_values(by="Close"))
           Date         Open         High  ...        Close    Adj Close   Volume
155  2020-03-23  1061.319946  1071.319946  ...  1056.619995  1056.619995  4044100
154  2020-03-20  1135.719971  1143.989990  ...  1072.319946  1072.319946  3601800
150  2020-03-16  1096.000000  1152.266968  ...  1084.329956  1084.329956  4252400
152  2020-03-18  1056.510010  1106.500000  ...  1096.800049  1096.800049  4233400
164  2020-04-03  1119.015015  1123.540039  ...  1097.880005  1097.880005  2313400
..          ...          ...          ...  ...          ...          ...      ...
245  2020-07-30  1497.000000  1537.869995  ...  1531.449951  1531.449951  1671400
231  2020-07-10  1506.150024  1543.829956  ...  1541.739990  1541.739990  1856300
238  2020-07-21  1586.989990  1586.989990  ...  1558.420044  1558.420044  1218600
237  2020-07-20  1515.260010  1570.290039  ...  1565.719971  1565.719971  1557300
239  2020-07-22  1560.500000  1570.000000  ...  1568.489990  1568.489990   932000

[252 rows x 7 columns]

Now below is how you can sort values in descending order:

print(data.sort_values(by="Close", ascending=False))
           Date         Open         High  ...        Close    Adj Close   Volume
239  2020-07-22  1560.500000  1570.000000  ...  1568.489990  1568.489990   932000
237  2020-07-20  1515.260010  1570.290039  ...  1565.719971  1565.719971  1557300
238  2020-07-21  1586.989990  1586.989990  ...  1558.420044  1558.420044  1218600
231  2020-07-10  1506.150024  1543.829956  ...  1541.739990  1541.739990  1856300
245  2020-07-30  1497.000000  1537.869995  ...  1531.449951  1531.449951  1671400
..          ...          ...          ...  ...          ...          ...      ...
164  2020-04-03  1119.015015  1123.540039  ...  1097.880005  1097.880005  2313400
152  2020-03-18  1056.510010  1106.500000  ...  1096.800049  1096.800049  4233400
150  2020-03-16  1096.000000  1152.266968  ...  1084.329956  1084.329956  4252400
154  2020-03-20  1135.719971  1143.989990  ...  1072.319946  1072.319946  3601800
155  2020-03-23  1061.319946  1071.319946  ...  1056.619995  1056.619995  4044100

[252 rows x 7 columns]

Descriptive Statistics:

To get the descriptive statistical information about your data, Pandas provides the describe() function that returns:

  1. the total of all the columns 
  2. mean value of all the columns
  3. the standard deviation of all the columns
  4. minimum and maximum values of all the columns
  5. 1st, 2nd, and 3rd quartile of all the columns

Below is how you can use this function:

print(data.describe())
              Open         High  ...    Adj Close        Volume
count   252.000000   252.000000  ...   252.000000  2.520000e+02
mean   1330.245284  1345.712141  ...  1332.321488  1.708384e+06
std     121.453125   120.306284  ...   121.333070  7.665229e+05
min    1056.510010  1071.319946  ...  1056.619995  3.475000e+05
25%    1230.180023  1243.845001  ...  1230.957550  1.218225e+06
50%    1334.229981  1350.729981  ...  1338.174988  1.515100e+06
75%    1433.782501  1443.512512  ...  1436.064972  1.905950e+06
max    1586.989990  1586.989990  ...  1568.489990  4.267700e+06

[8 rows x 6 columns]

Correlation:

You can also look for the correlation between all the columns in the data by using the corr() function as shown below:

print(data.corr())
               Open      High       Low     Close  Adj Close    Volume
Open       1.000000  0.993979  0.992965  0.986880   0.986880 -0.184352
High       0.993979  1.000000  0.989503  0.992714   0.992714 -0.139278
Low        0.992965  0.989503  1.000000  0.992617   0.992617 -0.248279
Close      0.986880  0.992714  0.992617  1.000000   1.000000 -0.195943
Adj Close  0.986880  0.992714  0.992617  1.000000   1.000000 -0.195943
Volume    -0.184352 -0.139278 -0.248279 -0.195943  -0.195943  1.000000

Summary

Pandas is a very fast and efficient DataFrame object for working with data. It provides highly efficient functions, ranging from reading and writing data to manipulate and preparing data for any kind of data science task. I hope you liked this article on all the important Pandas functions for Data Science. Feel free to ask your valuable questions in the comments section below.

Default image
Aman Kharwal
Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder
Articles: 1103

Leave a Reply