Univariate and Multivariate for Data Science

Univariate and multivariate are two types of statistical analysis. In univariate statistics, we analyze a single variable, and in multivariate statistics, we analyze two or more variables. In this article, I’ll walk you through a tutorial on Univariate and Multivariate Statistics for Data Science Using Python.

Univariate and Multivariate Statistics for Data Science

While doing statistical analysis in data science we aim to perform these common tasks:

  • To understand the data, distribution and categories
  • To perform the Univariate Statistical Analysis on the data as part of the EDA
  • To perform Multivariate Statistical Analysis

In this article, I’ll walk you through a brief step-by-step statistical analysis so that you can understand what is univariate and multivariate in a practical way and how to use it for data science.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

The dataset I’m using here has 181 columns, which can be categorized into 8 separate categories. Among them, three categories (Vitals, Laboratories and Blood Gas Laboratories) have 52, 60 and 16 characteristics, respectively. But they have characteristics reported as min and max. Now let’s import the data and start with some statistical analysis with Python:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.read_csv('TrainingWiDS2021.csv')
test = pd.read_csv('UnlabeledWiDS2021.csv')
train.info()
RangeIndex: 130157 entries, 0 to 130156
Columns: 181 entries, Unnamed: 0 to diabetes_mellitus
dtypes: float64(157), int64(18), object(6)
memory usage: 179.7+ MB

Note that the data types are float, int (numeric types) and 6 entities are identified as an object. Now let’s have a look at some descriptive statistics by using the describe function of pandas:

train.describe().T
descriptive statistics

Observations:

  1. So, starting with the target variable (diabetes_mellitus), it’s a binary type. Different types of data exist, but most of them are numeric.
  2. The analysis shows that few variables like age have a minimum value of 0, which could be a missing value or an outlier. Likewise, the maximum value of other variables like BMI is too high. We need to carefully analyze this variable using dependent factors like height and weight for BMI.
  3. The variance between many variables is very high, so scaling can be useful in this scenario.

Univariate Analysis

Now let’s analyze by using the univariate method of statistics where we analyze only one variable one by one:

print(train.shape)
print(train.encounter_id.nunique())
print(train.hospital_id.nunique())
(130157, 181)
130157
204

Let’s start by analyzing the target variable which is diabetes_mellitus:

sns.catplot(x ='diabetes_mellitus', kind ='count',data = train)
# Imbalance Ratio
train.diabetes_mellitus.value_counts(normalize=True)
0    0.783715
1    0.216285
Name: diabetes_mellitus, dtype: float64

As we can see that the target variable looks very much imbalanced so to train a predictive model you will need to balance this data.

So in univariate statistics, we repeat this process for every variable, now let’s see what we do in multivariate statistics in the section below.

Multivariate Analysis

Unlike univariate where we take one variable at a time and analyze it, in multivariate we take several variables (characteristics) at the same time and analyze the models. The multivariate can help us analyze more complex patterns in the data and therefore relate them to real-world scenarios.

Now let’s see how to analyze the data using the multivariate method of statistics. The pair plot is an efficient way to get relationships between entities:

plt.figure(figsize= (6,5))
sns.pairplot(train[['height','bmi','apache_2_diagnosis','apache_3j_diagnosis','diabetes_mellitus']],hue = 'diabetes_mellitus')
plt.show()

So this is what we do in multivariate statistics while analyzing a dataset. You can get the complete code used for the task of univariate and multivariate analysis for data science from below.

#importing the data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.read_csv('TrainingWiDS2021.csv')
test = pd.read_csv('UnlabeledWiDS2021.csv')
train.info()

#describing the data
train.describe().T

#univariate analysis
print(train.shape)
print(train.encounter_id.nunique())
print(train.hospital_id.nunique())
sns.catplot(x ='diabetes_mellitus', kind ='count',data = train)
# Imbalance Ratio
train.diabetes_mellitus.value_counts(normalize=True)

#multivariate analysis
plt.figure(figsize= (6,5))
sns.pairplot(train[['height','bmi','apache_2_diagnosis','apache_3j_diagnosis',
                    'diabetes_mellitus']],hue = 'diabetes_mellitus')
plt.show()

I hope you liked this article on univariate and multivariate statistics for data science. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1620

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading