Merging Datasets is one of the most high-performance features, which is provided by pandas in Python. In this article, I will show how we can merge datasets in Python with the help of examples and real-world scenarios.
For convenience, I will define a display function:
import pandas as pd
import numpy as np
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
Code language: Python (python)
Categories of Merging Datasets
Pandas is blessed with a function known as pd.merge(), which implements a number of categories for merging datasets which includes – one to one, many to one, and many to many. Now I will show simple examples for all the categories of merging datasets.
One to one Join
One to one join is the most simplest category of merging datasets. Let’s go through the example of one to one join below:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
Code language: Python (python)
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
Now to merge the above information in a single pandas dataframe, pd.merge function will be used:
df3 = pd.merge(df1, df2)
df3
Code language: Python (python)
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
Many to One
Many to one join consist of data where the two key columns contain duplicate values. The output dataframe in many to one join will preserve those duplicate values as appropriate entries. You will understand this more by going through the example below:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
Code language: Python (python)
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
df4
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
pd.merge(df3, df4)
Code language: Python (python)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Bob | Accounting | 2008 | Carly |
1 | Jake | Engineering | 2012 | Guido |
2 | Lisa | Engineering | 2004 | Guido |
3 | Sue | HR | 2014 | Steve |
Many-to-many
Many-to-many joins are a bit confusing conceptually but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will get more clear from the example below:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
Code language: Python (python)
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df5
group | skills | |
---|---|---|
0 | Accounting | math |
1 | Accounting | spreadsheets |
2 | Engineering | coding |
3 | Engineering | linux |
4 | HR | spreadsheets |
5 | HR | organization |
pd.merge(df1, df5)
Code language: Python (python)
employee | group | skills | |
---|---|---|---|
0 | Bob | Accounting | math |
1 | Bob | Accounting | spreadsheets |
2 | Jake | Engineering | coding |
3 | Jake | Engineering | linux |
4 | Lisa | Engineering | coding |
5 | Lisa | Engineering | linux |
6 | Sue | HR | spreadsheets |
7 | Sue | HR | organization |
Merging Datasets Example: US States Data
Merging Datasets methods are mostly used when you gather data by collecting different sources, I will use some datasets regarding the United Nations states and their populations. You can download the datasets below.
Now, lets take a quick look at the first five rows in all the three datasets:
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')
display('pop.head()', 'areas.head()', 'abbrevs.head()')
Code language: Python (python)
pop.head()
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
areas.head()
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
abbrevs.head()
state | abbreviation | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
From this information, we need to compute a relatively straightforward result: rank US states and territories by their 2010 population density. We clearly have the data here to find this result, but we’ll have started with merging datasets to find the result.
merged = pd.merge(pop, abbrevs, how='outer',
left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', 1) # drop duplicate info
merged.head()
Code language: Python (python)
state/region | ages | year | population | state | |
---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama |
1 | AL | total | 2012 | 4817528.0 | Alabama |
2 | AL | under18 | 2010 | 1130966.0 | Alabama |
3 | AL | total | 2010 | 4785570.0 | Alabama |
4 | AL | under18 | 2011 | 1125763.0 | Alabama |
Before moving forward, let’s double-check whether there were any mismatches here, which we can do by looking for rows with null values:
merged.isnull().any()
Code language: Python (python)
state/region False ages False year False population True state True dtype: bool
We can clearly see that, some of the population
info is null; let’s figure out which these are:
merged[merged['population'].isnull()].head()
Code language: Python (python)
state/region | ages | year | population | state | |
---|---|---|---|---|---|
2448 | PR | under18 | 1990 | NaN | NaN |
2449 | PR | total | 1990 | NaN | NaN |
2450 | PR | total | 1991 | NaN | NaN |
2451 | PR | under18 | 1991 | NaN | NaN |
2452 | PR | total | 1993 | NaN | NaN |
It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source. We see also that some of the new state
entries are also null, which means that there was no corresponding entry in the abbrevs
key! Let’s figure out which regions lack this match:
merged.loc[merged['state'].isnull(), 'state/region'].unique()
Code language: Python (python)
array(['PR', 'USA'], dtype=object)
We can fix these quickly by filling in appropriate entries:
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
Code language: Python (python)
state/region False ages False year False population True state False dtype: bool
Now we can merge the result with the area data using a similar procedure. Examining our results, we will want to join on the state
column in both:
final = pd.merge(merged, areas, on='state', how='left')
final.head()
Code language: Python (python)
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |
1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |
Again, let’s check for nulls to see if there were any mismatches:
final.isnull().any()
Code language: Python (python)
state/region False ages False year False population True state False area (sq. mi) True dtype: bool
There are nulls in the area
column; we can take a look to see which regions were ignored here:
final['state'][final['area (sq. mi)'].isnull()].unique()
Code language: Python (python)
We see that our areas
DataFrame
does not contain the area of the United States as a whole. We could insert the appropriate value, but in this case I will just drop the null values because the population density of the entire United States is not relevant to our current discussion:
final.dropna(inplace=True)
final.head()
Code language: Python (python)
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |
1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |
Now we have all the data we need. To answer the question of interest, let’s first select the portion of the data corresponding with the year 2000, and the total population.
data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()
Code language: Python (python)
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
91 | AK | total | 2010 | 713868.0 | Alaska | 656425.0 |
101 | AZ | total | 2010 | 6408790.0 | Arizona | 114006.0 |
189 | AR | total | 2010 | 2922280.0 | Arkansas | 53182.0 |
197 | CA | total | 2010 | 37333601.0 | California | 163707.0 |
Now let’s compute the population density and display it in order. I will start by re-indexing our data on the state, and then compute the result:
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)
density.head()
Code language: Python (python)
state District of Columbia 8898.897059 Puerto Rico 1058.665149 New Jersey 1009.253268 Rhode Island 681.339159 Connecticut 645.600649 dtype: float64
Also, read – Model Selection Technique in Machine Learning.
I hope you liked this article on Merging Datasets, this is one of the major problems beginners face while practicing data science. Feel free to asks questions on this topic or any other topic that you want in the comments section below.