Data Visualization with Seaborn

Before learning Seaborn, you should know that matplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. There are several valid complaints about Matplotlib that often come up:

Also, read – Matplotlib Tutorial for Data Science

  • Before version 2.0, Matplotlib’s defaults are not exactly the best choices. It was based on MATLAB circa 1999, and this often shows.
  • Matplotlib’s API is a relatively low level. Doing sophisticated statistical visualization is possible, but usually requires a lot of boilerplate code.
  • Matplotlib predated Pandas by more than a decade and thus is not designed for use with Pandas DataFrames. To visualize data from a Pandas DataFrame, you must extract each Series and often concatenate them together into the right format. It would be more helpful to have a plotting library that can intelligently use the DataFrame labels in a plot.

An answer to these problems is Seaborn. Seaborn provides an API on top of Matplotlib that offers rational choices for plot style and colour defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrames.

The Matplotlib team is addressing this: it has recently added the tools and is starting to handle Pandas data more seamlessly. The 2.0 release of the library will include a new default stylesheet that will improve on the current status quo. But for all the reasons just discussed, Seaborn remains a handy addon.

Seaborn Vs. Matplotlib

Here is an example of a simple random-walk plot in Matplotlib, using its classic plot formatting and colors. We start with the typical imports:

import matplotlib.pyplot as plt'classic')
%matplotlib inline
import numpy as np
import pandas as pdCode language: Python (python)

Now we create some random walk data:

# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)Code language: Python (python)

And do a simple plot:

# Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left')Code language: Python (python)
Visualization with matplotlib

Although the result contains all the information we’d like it to convey, it does so in a way that is not all that aesthetically pleasing. It even looks a bit old-fashioned in the context of 21st-century data visualization.

Now let’s take a look at how it works with Seaborn. As we will see, Seaborn has many of its high-level plotting routines. Still, it can also overwrite Matplotlib’s default parameters and in turn, get even simple Matplotlib scripts to produce vastly superior output. We can set the style by calling Seaborn’s set() method. By convention, Seaborn is imported as sns:

import seaborn as sns
sns.set()Code language: Python (python)

Now let’s rerun the same two lines as before:

# same plotting code as above!
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left')Code language: Python (python)
Visualization with seaborn

Exploring Seaborn Visualization

The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.

Let’s take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following could be done using raw Matplotlib commands (this is what Seaborn does under the hood), but the Seaborn API is much more convenient.

Histograms, KDE, and densities with Seaborn

Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables. We have seen that this is relatively straightforward in Matplotlib:

data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])

for col in 'xy':
    plt.hist(data[col], normed=True, alpha=0.5)Code language: Python (python)
Visualization with seaborn

Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with sns.kdeplot:

for col in 'xy':
    sns.kdeplot(data[col], shade=True)Code language: Python (python)
Visualization with seaborn

Histograms and KDE can be combined using distplot:

sns.distplot(data['y'])Code language: Python (python)
Visualization with seaborn

If we pass the full two-dimensional dataset to kdeplot, we will get a two-dimensional visualization of the data:

sns.kdeplot(data)Code language: Python (python)
Visualization with seaborn

We can see the joint distribution and the marginal distributions together using sns.jointplot. For this plot, we’ll set the style to a white background:

with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='kde')Code language: Python (python)
Visualization with seaborn

There are other parameters that can be passed to jointplot—for example, we can use a hexagonally based histogram instead:

with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='hex')Code language: Python (python)
Visualization with seaborn

Pair plots Visualization using Seaborn

When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for exploring correlations between multidimensional data when you’d like to plot all pairs of values against each other.

We’ll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:

iris = sns.load_dataset("iris")Code language: Python (python)

Visualizing the multidimensional relationships among the samples is as easy as calling sns.pairplot:

sns.pairplot(iris, hue='species', size=2.5)Code language: Python (python)
Visualization with seaborn

Faceted histograms Visualization with Seaborn

Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid makes this extremely simple. We’ll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data:

tips = sns.load_dataset('tips')
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True), "tip_pct", bins=np.linspace(0, 40, 15))Code language: Python (python)
Visualization with seaborn

Factor plots Visualization with Seaborn

Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter:

with sns.axes_style(style='ticks'):
    g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
    g.set_axis_labels("Day", "Total Bill")Code language: Python (python)
Visualization with seaborn

Working on Real Data with Seaborn

Here we’ll look at using Seaborn to help visualize and understand finishing results from a marathon. We will start by downloading the data from the Web, and loading it into Pandas:

data = pd.read_csv('marathon-data.csv')Code language: Python (python)

By default, Pandas loaded the time columns as Python strings (type object); we can see this by looking at the dtypes attribute of the DataFrame:

data.dtypesCode language: Python (python)
age        int64
gender    object
split     object
final     object
dtype: object

Let’s fix this by providing a converter for the times:

def convert_time(s):
    h, m, s = map(int, s.split(':'))
    return pd.datetools.timedelta(hours=h, minutes=m, seconds=s)

data = pd.read_csv('marathon-data.csv',
                   converters={'split':convert_time, 'final':convert_time})
data.dtypesCode language: Python (python)

That looks much better. For the purpose of our Seaborn plotting utilities, let’s next add columns that give the times in seconds:

data['split_sec'] = data['split'].astype(int) / 1E9
data['final_sec'] = data['final'].astype(int) / 1E9Code language: Python (python)

To get an idea of what the data looks like, we can plot a jointplot over the data:

with sns.axes_style('white'):
    g = sns.jointplot("split_sec", "final_sec", data, kind='hex')
    g.ax_joint.plot(np.linspace(4000, 16000),
                    np.linspace(8000, 32000), ':k')Code language: Python (python)
Visualization with seaborn

The dotted line shows where someone’s time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down throughout the marathon. If you have run competitively, you’ll know that those who do the opposite—run faster during the second half of the race—are said to have “negative-split” the competition.

Let’s create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race.

data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']Code language: Python (python)

Where this split difference is less than zero, the person negative-split the race by that fraction. Let’s do a distribution plot of this split fraction:

sns.distplot(data['split_frac'], kde=False);
plt.axvline(0, color="k", linestyle="--")Code language: Python (python)
Visualization with seaborn
sum(data.split_frac < 0)Code language: Python (python)

Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.

Let’s see whether there is any correlation between this split fraction and other variables. We’ll do this using a pairgrid, which draws plots of all these correlations:

g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
                 hue='gender', palette='RdBu_r'), alpha=0.8)
g.add_legend()Code language: Python (python)
Visualization with seaborn

The difference between men and women here is interesting. Let’s look at the histogram of split fractions for these two groups:

sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac')Code language: Python (python)
Visualization with seaborn

The interesting thing here is that there are many more men than women who are running close to an even split! This almost looks like some kind of bimodal distribution among the men and women. Let’s see if we can suss-out what’s going on by looking at the distributions as a function of age.

A nice way to compare distributions is to use a violin plot

sns.violinplot("gender", "split_frac", data=data,
               palette=["lightblue", "lightpink"])Code language: Python (python)
Visualization with seaborn

This is yet another way to compare the distributions between men and women.

Let’s look a little deeper, and compare these violin plots as a function of age. We’ll start by creating a new column in the array that specifies the decade of age that each person is in:

data['age_dec'] = age: 10 * (age // 10))Code language: Python (python)
men = (data.gender == 'M')
women = (data.gender == 'W')

with sns.axes_style(style=None):
    sns.violinplot("age_dec", "split_frac", hue="gender", data=data,
                   split=True, inner="quartile",
                   palette=["lightblue", "lightpink"])Code language: Python (python)
Visualization with seaborn

Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).

Also, surprisingly, the 80-year-old women seem to outperform everyone in terms of their split time. This is probably because we’re estimating the distribution from small numbers, as there are only a handful of runners in that range:

(data.age > 80).sum()Code language: Python (python)

Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We’ll use regplot, which will automatically fit a linear regression to the data:

g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data,
               markers=".", scatter_kws=dict(color='c')), y=0.1, color="k", ls=":")Code language: Python (python)
Visualization with seaborn

Apparently the people with fast splits are the elite runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split.

I hope you liked this article on Data Visualization with Seaborn, feel free to ask questions related to seaborn or any other topic in the comments section.

Also, read – 10 Machine Learning Projects to Boost your Portfolio

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1538

Leave a Reply