
In this tutorial we’ll build knowledge by looking in detail at the data structures provided by the Pandas library for Data Science.
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.
DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
Pandas Series
Data Type Name – Series
- There are some differences worth noting between ndarrays and Series objects. First of all, elements in NumPy arrays are accessed by their integer position, starting with zero for the first element.
- A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array. You can also use letters instead of numbers, or number an array in descending order instead of ascending order.
- Second, aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail.
import numpy as np # linear algebra import pandas as pd # data processing
Creating a Series using Pandas
You could convert a list,numpy array, or dictionary to a Series in the following manner:
labels = ['w','x','y','z'] list = [10,20,30,40] array = np.array([10,20,30,40]) dict = {'w':10,'x':20,'y':30,'z':40}
Using Lists
pd.Series(data=list)
0 10 1 20 2 30 3 40 dtype: int64
pd.Series(data=list,index=labels)
w 10 x 20 y 30 z 40 dtype: int64
pd.Series(list,labels)
w 10 x 20 y 30 z 40 dtype: int64
Using NumPy Arrays to create Series
pd.Series(array)
0 10 1 20 2 30 3 40 dtype: int64
pd.Series(array,labels)
w 10 x 20 y 30 z 40 dtype: int64
Using Dictionary to create series
pd.Series(dict)
w 10 x 20 y 30 z 40 dtype: int64
Using an Index
We shall now see how to index in a Series using the following examples of 2 series
sports1 = pd.Series([1,2,3,4],index = ['Cricket', 'Football','Basketball', 'Golf']) sports1
Cricket 1 Football 2 Basketball 3 Golf 4 dtype: int64
sports2 = pd.Series([1,2,5,4],index = ['Cricket', 'Football','Baseball', 'Golf']) sports2
Cricket 1 Football 2 Baseball 5 Golf 4 dtype: int64
DataFrames
DataFrames concept in python is similar to that of R programming language. DataFrame is a collection of Series combined together to share the same index positions.
from numpy.random import randn np.random.seed(1) dataframe = pd.DataFrame(randn(10,5),index='A B C D E F G H I J'.split(),columns='Score1 Score2 Score3 Score4 Score5'.split()) dataframe

Selection and Indexing
Ways in which we can grab data from a DataFrame
dataframe['Score3']
A -0.528172 B -0.761207 C -0.322417 D -0.877858 E 0.901591 F -0.935769 G -0.687173 H 0.234416 I -0.747158 J 2.100255 Name: Score3, dtype: float64
# Pass a list of column names in any order necessary dataframe[['Score2','Score1']]

DataFrame Columns are nothing but a Series each
type(dataframe['Score1'])
pandas.core.series.Series
Adding a new column to the DataFrame
dataframe['Score6'] = dataframe['Score1'] + dataframe['Score2'] dataframe

Removing Columns from DataFrame
dataframe.drop('Score6',axis=1) # Use axis=0 for dropping rows and axis=1 for dropping columns

Selecting Rows
dataframe.loc['F']
Score1 -0.683728 Score2 -0.122890 Score3 -0.935769 Score4 -0.267888 Score5 0.530355 Name: F, dtype: float64
Or select based off of index position instead of label – use iloc instead of loc function
dataframe.iloc[2]
Score1 1.462108 Score2 -2.060141 Score3 -0.322417 Score4 -0.384054 Score5 1.133769 Name: C, dtype: float64
Conditional Selection
Similar to NumPy, we can make conditional selections using Brackets
dataframe>0.5

dataframe[dataframe>0.5]

Missing Data
Methods to deal with missing data in Pandas
dataframe = pd.DataFrame({'Cricket':[1,2,np.nan,4,6,7,2,np.nan], 'Baseball':[5,np.nan,np.nan,5,7,2,4,5], 'Tennis':[1,2,3,4,5,6,7,8]}) dataframe

dataframe.dropna()

dataframe.fillna(value=0)

Groupby
The groupby method is used to group rows together and perform aggregate functions
# Create dataframe as given below dat = {'CustID':['1001','1001','1002','1002','1003','1003'], 'CustName':['UIPat','DatRob','Goog','Chrysler','Ford','GM'], 'Profitinlakhs':[2005,3245,1245,8765,5463,3547]} dataframe = pd.DataFrame(dat) dataframe

We can now use the .groupby() method to group rows together based on a column name.
For example let’s group based on CustID. This will create a DataFrameGroupBy object:
CustID_grouped = dataframe.groupby("CustID")
Now we can aggregate using the variable
CustID_grouped.mean()

Or we can call the groupby function for each aggregation
dataframe.groupby('CustID').mean()

Merging, Joining, and Concatenating
There are 3 important ways of combining DataFrames together:
- Merging
- Joining
- Concatenating
Example DataFrames
dafa1 = pd.DataFrame({'CustID': ['101', '102', '103', '104'], 'Sales': [13456, 45321, 54385, 53212], 'Priority': ['CAT0', 'CAT1', 'CAT2', 'CAT3'], 'Prime': ['yes', 'no', 'no', 'yes']}, index=[0, 1, 2, 3]) dafa2 = pd.DataFrame({'CustID': ['101', '103', '104', '105'], 'Sales': [13456, 54385, 53212, 4534], 'Payback': ['CAT4', 'CAT5', 'CAT6', 'CAT7'], 'Imp': ['yes', 'no', 'no', 'no']}, index=[4, 5, 6, 7]) dafa3 = pd.DataFrame({'CustID': ['101', '104', '105', '106'], 'Sales': [13456, 53212, 4534, 3241], 'Pol': ['CAT8', 'CAT9', 'CAT10', 'CAT11'], 'Level': ['yes', 'no', 'no', 'yes']}, index=[8, 9, 10, 11])
Concatenation
Concatenation joins DataFrames basically either by rows or colums(axis=0 or 1). We also need to ensure dimension sizes of dataframes are the same.
pd.concat([dafa1,dafa2])

pd.concat([dafa1,dafa2,dafa3],axis=1)

Merging
Just like SQL tables, merge function in python allows us to merge dataframes.
pd.merge(dafa1,dafa2,how='outer',on='CustID')

Joining
Join can be used to combine columns of 2 dataframes that have different index values into a signle dataframe.
The one difference between merge and join is that, merge uses common columns to combine two dataframes, whereas join uses the row index to join two dataframes.
daf3 = pd.DataFrame({'Q1': ['101', '102', '103'], 'Q2': ['201', '202', '203']}, index=['I0', 'I1', 'I2']) dafa4 = pd.DataFrame({'Q3': ['301', '302', '303'], 'Q4': ['401', '402', '403']}, index=['I0', 'I2', 'I3'])
daf3.join(dafa4)

I hope you like this tutorial on Pandas for Data Science, you can comment down the topic you want next tutorial.