Pandas Tutorial for Data Science

Saving a pandas Dataframe as a CSV File - MRINAL WALIA - Medium

In this tutorial we’ll build knowledge by looking in detail at the data structures provided by the Pandas library for Data Science.

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas Series

Data Type Name – Series

  • There are some differences worth noting between ndarrays and Series objects. First of all, elements in NumPy arrays are accessed by their integer position, starting with zero for the first element.
  • A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array. You can also use letters instead of numbers, or number an array in descending order instead of ascending order.
  • Second, aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail.
import numpy as np # linear algebra
import pandas as pd # data processing

Creating a Series using Pandas

You could convert a list,numpy array, or dictionary to a Series in the following manner:

labels = ['w','x','y','z']
list = [10,20,30,40]
array = np.array([10,20,30,40])
dict = {'w':10,'x':20,'y':30,'z':40}

Using Lists

pd.Series(data=list)
0    10
1    20
2    30
3    40
dtype: int64
pd.Series(data=list,index=labels)
w    10
x    20
y    30
z    40
dtype: int64
pd.Series(list,labels)
w    10
x    20
y    30
z    40
dtype: int64

Using NumPy Arrays to create Series

pd.Series(array)
0    10
1    20
2    30
3    40
dtype: int64
pd.Series(array,labels)
w    10
x    20
y    30
z    40
dtype: int64

Using Dictionary to create series

pd.Series(dict)
w    10
x    20
y    30
z    40
dtype: int64

Using an Index

We shall now see how to index in a Series using the following examples of 2 series

sports1 = pd.Series([1,2,3,4],index = ['Cricket', 'Football','Basketball', 'Golf']) 
sports1
Cricket       1
Football      2
Basketball    3
Golf          4
dtype: int64
sports2 = pd.Series([1,2,5,4],index = ['Cricket', 'Football','Baseball', 'Golf'])
sports2
Cricket     1
Football    2
Baseball    5
Golf        4
dtype: int64

DataFrames

DataFrames concept in python is similar to that of R programming language. DataFrame is a collection of Series combined together to share the same index positions.

from numpy.random import randn
np.random.seed(1)
dataframe = pd.DataFrame(randn(10,5),index='A B C D E F G H I J'.split(),columns='Score1 Score2 Score3 Score4 Score5'.split())
dataframe

Selection and Indexing

Ways in which we can grab data from a DataFrame

dataframe['Score3']
A   -0.528172
B   -0.761207
C   -0.322417
D   -0.877858
E    0.901591
F   -0.935769
G   -0.687173
H    0.234416
I   -0.747158
J    2.100255
Name: Score3, dtype: float64
# Pass a list of column names in any order necessary
dataframe[['Score2','Score1']]

DataFrame Columns are nothing but a Series each

type(dataframe['Score1'])
pandas.core.series.Series

Adding a new column to the DataFrame

dataframe['Score6'] = dataframe['Score1'] + dataframe['Score2']
dataframe

Removing Columns from DataFrame

dataframe.drop('Score6',axis=1) # Use axis=0 for dropping rows and axis=1 for dropping columns

Selecting Rows

dataframe.loc['F']
Score1   -0.683728
Score2   -0.122890
Score3   -0.935769
Score4   -0.267888
Score5    0.530355
Name: F, dtype: float64

Or select based off of index position instead of label – use iloc instead of loc function

dataframe.iloc[2]
Score1    1.462108
Score2   -2.060141
Score3   -0.322417
Score4   -0.384054
Score5    1.133769
Name: C, dtype: float64

Conditional Selection

Similar to NumPy, we can make conditional selections using Brackets

dataframe>0.5
dataframe[dataframe>0.5]

Missing Data

Methods to deal with missing data in Pandas

dataframe = pd.DataFrame({'Cricket':[1,2,np.nan,4,6,7,2,np.nan],
                  'Baseball':[5,np.nan,np.nan,5,7,2,4,5],
                  'Tennis':[1,2,3,4,5,6,7,8]})
dataframe
dataframe.dropna()
dataframe.fillna(value=0)

Groupby

The groupby method is used to group rows together and perform aggregate functions

# Create dataframe as given below
dat = {'CustID':['1001','1001','1002','1002','1003','1003'],
       'CustName':['UIPat','DatRob','Goog','Chrysler','Ford','GM'],
       'Profitinlakhs':[2005,3245,1245,8765,5463,3547]}
dataframe = pd.DataFrame(dat)
dataframe

We can now use the .groupby() method to group rows together based on a column name.

For example let’s group based on CustID. This will create a DataFrameGroupBy object:

CustID_grouped = dataframe.groupby("CustID")

Now we can aggregate using the variable

CustID_grouped.mean()

Or we can call the groupby function for each aggregation

dataframe.groupby('CustID').mean()

Merging, Joining, and Concatenating

There are 3 important ways of combining DataFrames together:

  • Merging
  • Joining
  • Concatenating

Example DataFrames

dafa1 = pd.DataFrame({'CustID': ['101', '102', '103', '104'],
                        'Sales': [13456, 45321, 54385, 53212],
                        'Priority': ['CAT0', 'CAT1', 'CAT2', 'CAT3'],
                        'Prime': ['yes', 'no', 'no', 'yes']},
                        index=[0, 1, 2, 3])
dafa2 = pd.DataFrame({'CustID': ['101', '103', '104', '105'],
                        'Sales': [13456, 54385, 53212, 4534],
                        'Payback': ['CAT4', 'CAT5', 'CAT6', 'CAT7'],
                        'Imp': ['yes', 'no', 'no', 'no']},
                         index=[4, 5, 6, 7])
dafa3 = pd.DataFrame({'CustID': ['101', '104', '105', '106'],
                        'Sales': [13456, 53212, 4534, 3241],
                        'Pol': ['CAT8', 'CAT9', 'CAT10', 'CAT11'],
                        'Level': ['yes', 'no', 'no', 'yes']},
                        index=[8, 9, 10, 11])

Concatenation

Concatenation joins DataFrames basically either by rows or colums(axis=0 or 1). We also need to ensure dimension sizes of dataframes are the same.

pd.concat([dafa1,dafa2])
pd.concat([dafa1,dafa2,dafa3],axis=1)

Merging

Just like SQL tables, merge function in python allows us to merge dataframes.

pd.merge(dafa1,dafa2,how='outer',on='CustID')

Joining

Join can be used to combine columns of 2 dataframes that have different index values into a signle dataframe.

The one difference between merge and join is that, merge uses common columns to combine two dataframes, whereas join uses the row index to join two dataframes.

daf3 = pd.DataFrame({'Q1': ['101', '102', '103'],
                     'Q2': ['201', '202', '203']},
                      index=['I0', 'I1', 'I2']) 

dafa4 = pd.DataFrame({'Q3': ['301', '302', '303'],
                    'Q4': ['401', '402', '403']},
                      index=['I0', 'I2', 'I3'])
daf3.join(dafa4)

I hope you like this tutorial on Pandas for Data Science, you can comment down the topic you want next tutorial.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1535

Leave a Reply