Categories
By Aman Kharwal

Merging Datasets

Merging Datasets is one of the most high-performance features, which is provided by pandas in Python. In this article, I will show how we can merge datasets in Python with the help of examples and real-world scenarios.

For convenience, I will define a display function:

import pandas as pd
import numpy as np

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

Categories of Merging Datasets

Pandas is blessed with a function known as pd.merge(), which implements a number of categories for merging datasets which includes – one to one, many to one, and many to many. Now I will show simple examples for all the categories of merging datasets.

One to one Join

One to one join is the most simplest category of merging datasets. Let’s go through the example of one to one join below:

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')

df1

employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR

df2

employeehire_date
0Lisa2004
1Bob2008
2Jake2012
3Sue2014

Now to merge the above information in a single pandas dataframe, pd.merge function will be used:

df3 = pd.merge(df1, df2)
df3
employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014

Many to One

Many to one join consist of data where the two key columns contain duplicate values. The output dataframe in many to one join will preserve those duplicate values as appropriate entries. You will understand this more by going through the example below:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

df3

employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014

df4

groupsupervisor
0AccountingCarly
1EngineeringGuido
2HRSteve
pd.merge(df3, df4)
employeegrouphire_datesupervisor
0BobAccounting2008Carly
1JakeEngineering2012Guido
2LisaEngineering2004Guido
3SueHR2014Steve

Many-to-many

Many-to-many joins are a bit confusing conceptually but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will get more clear from the example below:

df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

df1

employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR

df5

groupskills
0Accountingmath
1Accountingspreadsheets
2Engineeringcoding
3Engineeringlinux
4HRspreadsheets
5HRorganization
pd.merge(df1, df5)
employeegroupskills
0BobAccountingmath
1BobAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4LisaEngineeringcoding
5LisaEngineeringlinux
6SueHRspreadsheets
7SueHRorganization

Merging Datasets Example: US States Data

Merging Datasets methods are mostly used when you gather data by collecting different sources, I will use some datasets regarding the United Nations states and their populations. You can download the datasets below.

Now, lets take a quick look at the first five rows in all the three datasets:

pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')

display('pop.head()', 'areas.head()', 'abbrevs.head()')

pop.head()

state/regionagesyearpopulation
0ALunder1820121117489.0
1ALtotal20124817528.0
2ALunder1820101130966.0
3ALtotal20104785570.0
4ALunder1820111125763.0

areas.head()

statearea (sq. mi)
0Alabama52423
1Alaska656425
2Arizona114006
3Arkansas53182
4California163707

abbrevs.head()

stateabbreviation
0AlabamaAL
1AlaskaAK
2ArizonaAZ
3ArkansasAR
4CaliforniaCA

From this information, we need to compute a relatively straightforward result: rank US states and territories by their 2010 population density. We clearly have the data here to find this result, but we’ll have started with merging datasets to find the result.

merged = pd.merge(pop, abbrevs, how='outer',
                  left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', 1) # drop duplicate info
merged.head()
state/regionagesyearpopulationstate
0ALunder1820121117489.0Alabama
1ALtotal20124817528.0Alabama
2ALunder1820101130966.0Alabama
3ALtotal20104785570.0Alabama
4ALunder1820111125763.0Alabama

Before moving forward, let’s double-check whether there were any mismatches here, which we can do by looking for rows with null values:

merged.isnull().any()
state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

We can clearly see that, some of the population info is null; let’s figure out which these are:

merged[merged['population'].isnull()].head()
state/regionagesyearpopulationstate
2448PRunder181990NaNNaN
2449PRtotal1990NaNNaN
2450PRtotal1991NaNNaN
2451PRunder181991NaNNaN
2452PRtotal1993NaNNaN

It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source. We see also that some of the new state entries are also null, which means that there was no corresponding entry in the abbrevs key! Let’s figure out which regions lack this match:

merged.loc[merged['state'].isnull(), 'state/region'].unique()
array(['PR', 'USA'], dtype=object)

We can fix these quickly by filling in appropriate entries:

merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
state/region    False
ages            False
year            False
population       True
state           False
dtype: bool

Now we can merge the result with the area data using a similar procedure. Examining our results, we will want to join on the state column in both:

final = pd.merge(merged, areas, on='state', how='left')
final.head()
state/regionagesyearpopulationstatearea (sq. mi)
0ALunder1820121117489.0Alabama52423.0
1ALtotal20124817528.0Alabama52423.0
2ALunder1820101130966.0Alabama52423.0
3ALtotal20104785570.0Alabama52423.0
4ALunder1820111125763.0Alabama52423.0

Again, let’s check for nulls to see if there were any mismatches:

final.isnull().any()
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool

There are nulls in the area column; we can take a look to see which regions were ignored here:

final['state'][final['area (sq. mi)'].isnull()].unique()

We see that our areas DataFrame does not contain the area of the United States as a whole. We could insert the appropriate value, but in this case I will just drop the null values because the population density of the entire United States is not relevant to our current discussion:

final.dropna(inplace=True)
final.head()

state/region
agesyearpopulationstatearea (sq. mi)
0ALunder1820121117489.0Alabama52423.0
1ALtotal20124817528.0Alabama52423.0
2ALunder1820101130966.0Alabama52423.0
3ALtotal20104785570.0Alabama52423.0
4ALunder1820111125763.0Alabama52423.0

Now we have all the data we need. To answer the question of interest, let’s first select the portion of the data corresponding with the year 2000, and the total population.

data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()
state/regionagesyearpopulationstatearea (sq. mi)
3ALtotal20104785570.0Alabama52423.0
91AKtotal2010713868.0Alaska656425.0
101AZtotal20106408790.0Arizona114006.0
189ARtotal20102922280.0Arkansas53182.0
197CAtotal201037333601.0California163707.0

Now let’s compute the population density and display it in order. I will start by re-indexing our data on the state, and then compute the result:

data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)
density.head()
state
District of Columbia    8898.897059
Puerto Rico             1058.665149
New Jersey              1009.253268
Rhode Island             681.339159
Connecticut              645.600649
dtype: float64

Also, read – Model Selection Technique in Machine Learning.

I hope you liked this article on Merging Datasets, this is one of the major problems beginners face while practicing data science. Feel free to asks questions on this topic or any other topic that you want in the comments section below.

Receive Daily Newsletters

Leave a Reply