This article is based on Statistics tutorial to learn essential concepts of Statistics, that we need in Data Science. I will try to present the concepts in a fun and interactive way and I encourage you to play with the code to get a better grasp of the concepts.

### Download the data set

Lets start with importing the libraries and reading the data set

# Dependencies # Standard Dependencies import os import numpy as np import pandas as pd from math import sqrt # Visualization from pylab import * import matplotlib.mlab as mlab import matplotlib.pyplot as plt import seaborn as sns # Statistics from statistics import median from scipy import signal from scipy.special import factorial import scipy.stats as stats from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr from scipy.fftpack import fft, fftshift # Scikit-learn for Machine Learning models from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # Seed for reproducability seed = 12345 np.random.seed(seed) from google.colab import files uploaded = files.upload() df = pd.read_csv("toy_dataset.csv")

## Discrete and Continuous Variables

- A discrete variable is a variable that can only take on a
**countable**number of values. If you can count a set of items, then itâ€™s a discrete variable. An example of a discrete variable is the outcome of a dice.

It can only have 1 of 6 different possible outcomes and is therefore discrete. A discrete random variable can have an infinite number of values. For example, the whole set of natural numbers (1,2,3,etc.) is countable and therefore discrete. - A continuous variable takes on an uncountable number of values. An example of a continuous variable is length. Length can be measured to an arbitrary degree and is therefore continuous.
- In statistics we represent a distribution of discrete variables with PMF’s (Probability Mass Functions) and CDF’s (Cumulative Distribution Functions). We represent distributions of continuous variables with PDF’s (Probability Density Functions) and CDF’s.
- The PMF defines the probability of all possible values x of the random variable. A PDF is the same but for continuous values. The CDF represents the probability that the random variable X will have an outcome less or equal to the value x. The name CDF is used for both discrete and continuous distributions.
- The functions that describe PMF’s, PDF’s and CDF’s can be quite daunting at first, but their visual counterpart often looks quite intuitive.

## PMF (Probability Mass Function)

Here we visualize a PMF of a binomial distribution. You can see that the possible values are all integers. For example, no values are between 50 and 51.

The PMF of a binomial distribution in function form:

# PMF Visualization n = 100 p = 0.5 fig, ax = plt.subplots(1, 1, figsize=(17,5)) x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p)) ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='Binomial PMF') ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5) rv = binom(n, p) #ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='frozen PMF') ax.legend(loc='best', frameon=False, fontsize='xx-large') plt.title('PMF of a binomial distribution (n=100, p=0.5)', fontsize='xx-large') plt.show()

## PDF (Probability Density Functions)

The PDF is the same as a PMF, but continuous. It can be said that the distribution has an infinite number of possible values. Here we visualize a simple normal distribution with a mean of 0 and standard deviation of 1.

PDF of a normal distribution in formula form:

# Plot normal distribution mu = 0 variance = 1 sigma = sqrt(variance) x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100) plt.figure(figsize=(16,5)) plt.plot(x, stats.norm.pdf(x, mu, sigma), label='Normal Distribution') plt.title('Normal Distribution with mean = 0 and std = 1') plt.legend(fontsize='xx-large') plt.show()

## CDF (Cumulative Distribution Function)

The CDF maps the probability that a random variable X will take a value of less than or equal to a value x (P(X â‰¤ x)). CDF’s can be discrete or continuous. In this section we visualize the continuous case. You can see in the plot that the CDF accumulates all probabilities and is therefore bounded between 0 â‰¤ x â‰¤ 1.

The CDF of a normal distribution as a formula:

# Data X = np.arange(-2, 2, 0.01) Y = exp(-X ** 2) # Normalize data Y = Y / (0.01 * Y).sum() # Plot the PDF and CDF plt.figure(figsize=(15,5)) plt.title('Continuous Normal Distributions', fontsize='xx-large') plot(X, Y, label='Probability Density Function (PDF)') plot(X, np.cumsum(Y * 0.01), 'r', label='Cumulative Distribution Function (CDF)') plt.legend(fontsize='xx-large') plt.show()

## Distributions

- A Probability distribution tells us something about the likelihood of each value of the random variable.
- A random variable X is a function that maps events to real numbers.
- The visualizations in this section are of discrete distributions. Many of these distributions can however also be continuous.

## Uniform Distribution

A Uniform distribution is pretty straightforward. Every value has an equal change of occuring. Therefore, the distribution consists of random values with no patterns in them. In this example we generate random floating numbers between 0 and 1.

The PDF of a Uniform Distribution:

CDF:

# Uniform distribution (between 0 and 1) uniform_dist = np.random.random(1000) uniform_df = pd.DataFrame({'value' : uniform_dist}) uniform_dist = pd.Series(uniform_dist) plt.figure(figsize=(18,5)) sns.scatterplot(data=uniform_df) plt.legend(fontsize='xx-large') plt.title('Scatterplot of a Random/Uniform Distribution', fontsize='xx-large')

plt.figure(figsize=(18,5)) sns.distplot(uniform_df) plt.title('Random/Uniform distribution', fontsize='xx-large')

## Normal Distribution

A normal distribution (also called Gaussian or Bell Curve) is very common and convenient. This is mainly because of the Central Limit Theorem (CLT), which states that as the amount independent random samples (like multiple coin flips) goes to infinity the distribution of the sample mean tends towards a normal distribution.

PDF of a normal distribution:

CDF:

# Generate Normal Distribution normal_dist = np.random.randn(10000) normal_df = pd.DataFrame({'value' : normal_dist}) # Create a Pandas Series for easy sample function normal_dist = pd.Series(normal_dist) normal_dist2 = np.random.randn(10000) normal_df2 = pd.DataFrame({'value' : normal_dist2}) # Create a Pandas Series for easy sample function normal_dist2 = pd.Series(normal_dist) normal_df_total = pd.DataFrame({'value1' : normal_dist, 'value2' : normal_dist2})

# Scatterplot plt.figure(figsize=(18,5)) sns.scatterplot(data=normal_df) plt.legend(fontsize='xx-large') plt.title('Scatterplot of a Normal Distribution', fontsize='xx-large')

# Normal Distribution as a Bell Curve plt.figure(figsize=(18,5)) sns.distplot(normal_df) plt.title('Normal distribution (n=1000)', fontsize='xx-large')

## Binomial Distribution

A Binomial Distribution has a countable number of outcomes and is therefore discrete.

Binomial distributions must meet the following three criteria:

- The number of observations or trials is fixed. In other words, you can only figure out the probability of something happening if you do it a certain number of times.
- Each observation or trial is independent. In other words, none of your trials have an effect on the probability of the next trial.
- The probability of success is exactly the same from one trial to another.

An intuitive explanation of a binomial distribution is flipping a coin 10 times. If we have a fair coin our chance of getting heads (p) is 0.50. Now we throw the coin 10 times and count how many times it comes up heads. In most situations we will get heads 5 times, but there is also a change that we get heads 9 times. The PMF of a binomial distribution will give these probabilities if we say N = 10 and p = 0.5. We say that the x for heads is 1 and 0 for tails.

PMF:

CDF:

A **Bernoulli Distribution** is a special case of a Binomial Distribution.

All values in a Bernoulli Distribution are either 0 or 1.

For example, if we take an unfair coin which falls on heads 60 % of the time, we can describe the Bernoulli distribution as follows:

p (change of heads) = 0.6

1 – p (change of tails) = 0.4

heads = 1

tails = 0

Formally, we can describe a Bernoulli distribution with the following PMF (Probability Mass Function):

# Change of heads (outcome 1) p = 0.6 # Create Bernoulli samples bern_dist = bernoulli.rvs(p, size=1000) bern_df = pd.DataFrame({'value' : bern_dist}) bern_values = bern_df['value'].value_counts() # Plot Distribution plt.figure(figsize=(18,4)) bern_values.plot(kind='bar', rot=0) plt.annotate(xy=(0.85,300), s='Samples that came up Tails\nn = {}'.format(bern_values[0]), fontsize='large', color='white') plt.annotate(xy=(-0.2,300), s='Samples that came up Heads\nn = {}'.format(bern_values[1]), fontsize='large', color='white') plt.title('Bernoulli Distribution: p = 0.6, n = 1000')

## Poisson Distribution

The Poisson distribution is a discrete distribution and is popular for modelling the number of times an event occurs in an interval of time or space.

It takes a value lambda, which is equal to the mean of the distribution.

PMF:

CDF:Â

x = np.arange(0, 20, 0.1) y = np.exp(-5)*np.power(5, x)/factorial(x) plt.figure(figsize=(15,8)) plt.title('Poisson distribution with lambda=5', fontsize='xx-large') plt.plot(x, y, 'bs') plt.show()

## Log-Normal Distribution

A log-normal distribution is continuous. The main characteristic of a log-normal distribution is that it’s logarithm is normally distributed. It is also referred to as Galton’s distribution.

PDF:

CDF:

Where Phi is the CDF of the standard normal distribution.

# Specify standard deviation and mean std = 1 mean = 5 # Create log-normal distribution dist=lognorm(std,loc=mean) x=np.linspace(0,15,200) # Visualize log-normal distribution plt.figure(figsize=(15,6)) plt.xlim(5, 10) plt.plot(x,dist.pdf(x), label='log-normal PDF') plt.plot(x,dist.cdf(x), label='log-normal CDF') plt.legend(fontsize='xx-large') plt.title('Visualization of log-normal PDF and CDF', fontsize='xx-large') plt.show()

## Summary Statistics and Moments

**Mean, Median and Mode**

Note: The mean is also called the first moment.

## Moments

A moment is a quantitative measure that says something about the shape of a distribution. There are central moments and non-central moments. This section is focused on the central moments.

The 0th central moment is the total probability and is always equal to 1.

The 1st moment is the mean (expected value).

The 2nd central moment is the variance.

**Variance** = The average of the squared distance of the mean. Variance is interesting in a mathematical sense, but the standard deviation is often a much better measure of how spread out the distribution is.

**Standard Deviation** = The square root of the variance

The 3rd central moment is the skewness.

**Skewness** = A measure that describes the contrast of one tail versus the other tail. For example, if there are more high values in your distribution than low values then your distribution is ‘skewed’ towards the high values.

The 4th central moment is the kurtosis.

**Kurtosis** = A measure of how ‘fat’ the tails in the distribution are.

The higher the moment, the harder it is to estimate with samples. Larger samples are required in order to obtain good estimates.

# Summary print('Summary Statistics for a normal distribution: ') # Median medi = median(normal_dist) print('Median: ', medi) display(normal_df.describe()) # Standard Deviation std = sqrt(np.var(normal_dist)) print('The first four calculated moments of a normal distribution: ') # Mean mean = normal_dist.mean() print('Mean: ', mean) # Variance var = np.var(normal_dist) print('Variance: ', var) # Return unbiased skew normalized by N-1 skew = normal_df['value'].skew() print('Skewness: ', skew) # Return unbiased kurtosis over requested axis using Fisher's definition of kurtosis # (kurtosis of normal == 0.0) normalized by N-1 kurt = normal_df['value'].kurtosis() print('Kurtosis: ', kurt)

## Bias, MSE and SE

**Bias** is a measure of how far the sample mean deviates from the population mean. The sample mean is also called **Expected value**.

Formula for Bias:

The formula for expected value (EV) makes it apparent that the bias can also be formulated as the expected value minus the population mean:

# Take sample normal_df_sample = normal_df.sample(100) # Calculate Expected Value (EV), population mean and bias ev = normal_df_sample.mean()[0] pop_mean = normal_df.mean()[0] bias = ev - pop_mean print('Sample mean (Expected Value): ', ev) print('Population mean: ', pop_mean) print('Bias: ', bias)

#Output Sample mean (Expected Value): -0.09989709931399528 Population mean: -0.012836443467010622 Bias: -0.08706065584698465

**MSE (Mean Squared Error)** is a formula to measure how much estimators deviate from the true distribution. This can be very useful with for example, evaluating regression models.

**RMSE (Root Mean Squared Error)** is just the root of the MSE.

from math import sqrt Y = 100 # Actual Value YH = 94 # Predicted Value # MSE Formula def MSE(Y, YH): return np.square(YH - Y).mean() # RMSE formula def RMSE(Y, YH): return sqrt(np.square(YH - Y).mean()) print('MSE: ', MSE(Y, YH)) print('RMSE: ', RMSE(Y, YH))

#Output MSE: 36.0 RMSE: 6.0

The **Standard Error (SE)** measures how spread the distribution is from the sample mean.

The formula can also be defined as the standard deviation divided by the square root of the number of samples.

# Standard Error (SE) uni_sample = uniform_dist.sample(100) norm_sample = normal_dist.sample(100) print('Standard Error of uniform sample: ', sem(uni_sample)) print('Standard Error of normal sample: ', sem(norm_sample)) # The random samples from the normal distribution should have a higher standard error

#Output Standard Error of uniform sample: 0.028951451192463368 Standard Error of normal sample: 0.10887812225934831

## Sampling methods

**Non-Representative Sampling:**

- Convenience Sampling = Pick samples that are most convenient, like the top of a shelf or people that can be easily approached.
- Haphazard Sampling = Pick samples without thinking about it. This often gives the illusion take you are picking out samples at random.
- Purposive Sampling = Pick samples for a specific purpose. An example is to focus on extreme cases. This can be useful but is limited because it doesn’t allow you to make statements about the whole population.

**Representative Sampling:**

- Simple Random Sampling = Pick samples (psuedo)randomly.
- Systematic Sampling = Pick samples with a fixed interval. For example every 10th sample (0, 10, 20, etc.).
- Stratified Sampling = Pick the same amount of samples from different groups (strata) in the population.
- Cluster Sampling = Divide the population into groups (clusters) and pick samples from those groups.

# Note that we take very small samples just to illustrate the different sampling methods print('---Non-Representative samples:---\n') # Convenience samples con_samples = normal_dist[0:5] print('Convenience samples:\n\n{}\n'.format(con_samples)) # Haphazard samples (Picking out some numbers) hap_samples = [normal_dist[12], normal_dist[55], normal_dist[582], normal_dist[821], normal_dist[999]] print('Haphazard samples:\n\n{}\n'.format(hap_samples)) # Purposive samples (Pick samples for a specific purpose) # In this example we pick the 5 highest values in our distribution purp_samples = normal_dist.nlargest(n=5) print('Purposive samples:\n\n{}\n'.format(purp_samples)) print('---Representative samples:---\n') # Simple (pseudo)random sample rand_samples = normal_dist.sample(5) print('Random samples:\n\n{}\n'.format(rand_samples)) # Systematic sample (Every 2000th value) sys_samples = normal_dist[normal_dist.index % 2000 == 0] print('Systematic samples:\n\n{}\n'.format(sys_samples)) # Stratified Sampling # We will get 1 person from every city in the dataset # We have 8 cities so that makes a total of 8 samples df = pd.read_csv('toy_dataset.csv') strat_samples = [] for city in df['City'].unique(): samp = df[df['City'] == city].sample(1) strat_samples.append(samp['Income'].item()) print('Stratified samples:\n\n{}\n'.format(strat_samples)) # Cluster Sampling # Make random clusters of ten people (Here with replacement) c1 = normal_dist.sample(10) c2 = normal_dist.sample(10) c3 = normal_dist.sample(10) c4 = normal_dist.sample(10) c5 = normal_dist.sample(10) # Take sample from every cluster (with replacement) clusters = [c1,c2,c3,c4,c5] cluster_samples = [] for c in clusters: clus_samp = c.sample(1) cluster_samples.extend(clus_samp) print('Cluster samples:\n\n{}'.format(cluster_samples))

#Output ---Non-Representative samples:--- Convenience samples: 0 0.652346 1 0.871600 2 0.268216 3 0.947681 4 0.147268 dtype: float64 Haphazard samples: [-0.5010027615175823, -0.5066784173896016, 0.6814697928519314, 0.8359965545029714, -0.5374734651177141] Purposive samples: 3378 3.525865 2792 3.366626 490 3.260383 7169 3.189940 7057 3.082067 dtype: float64 ---Representative samples:--- Random samples: 3442 0.236063 9169 -0.042718 8511 -1.529128 6500 -0.787297 2141 1.134073 dtype: float64 Systematic samples: 0 0.652346 2000 -0.914345 4000 -0.198577 6000 -0.030696 8000 -1.084444 dtype: float64 Stratified samples: [29892.0, 110450.0, 85635.0, 122364.0, 83214.0, 74469.0, 113187.0, 92854.0] Cluster samples: [-1.1500083079696, 1.2888783369868333, 1.2540511669530716, -0.6914086007291129, 2.259205774351393]

### Covariance

- Covariance is a measure of how much two random variables vary together.
- Variance is similar to covariance in that variance shows you how much one variable varies.
- If two variables are independent, their covariance is 0. However, a covariance of 0 does not imply that the variables are independent.

# Covariance between Age and Income print('Covariance between Age and Income: ') df[['Age', 'Income']].cov()

#Output Covariance between Age and Income: Age Income Age 133.922426 -3.811863e+02 Income -381.186341 6.244752e+08

### Correlation

Correlation is a standardized version of covariance. Here it becomes more clear that Age and Income do not have a strong correlation in our dataset.

Formula for Pearson’s correlation coefficient:

# Correlation between two normal distributions # Using Pearson's correlation print('Pearson: ') df[['Age', 'Income']].corr(method='pearson')

#Output Pearson: Age Income Age 1.000000 -0.001318 Income -0.001318 1.000000

- Another method for calculating a correlation coefficient is ‘Spearman’s Rho’.
- The formula looks different but it will give similar results as Pearson’s method.
- In this example we see almost no difference, but this is partly because it is obvious that the Age and Income columns in our dataset have no correlation.
- Formula for Spearmans Rho:

# Using Spearman's rho correlation print('Spearman: ') df[['Age', 'Income']].corr(method='spearman')

#Output Spearman: Age Income Age 1.000000 -0.001452 Income -0.001452 1.000000

## Linear regression

Linear Regression can be performed through Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE).

Most Python libraries use OLS to fit linear models.

# Generate data x = np.random.uniform(low=20, high=260, size=100) y = 50000 + 2000*x - 4.5 * x**2 + np.random.normal(size=100, loc=0, scale=10000) # Plot data with Linear Regression plt.figure(figsize=(16,5)) plt.title('Well fitted but not well fitting: Linear regression plot on quadratic data', fontsize='xx-large') sns.regplot(x, y)

# Linear regression from scratch import random # Create data from regression xs = np.array(range(1,20)) ys = [0,8,10,8,15,20,26,29,38,35,40,60,50,61,70,75,80,88,96] # Put data in dictionary data = dict() for i in list(xs): data.update({xs[i-1] : ys[i-1]}) # Slope m = 0 # y intercept b = 0 # Learning rate lr = 0.0001 # Number of epochs epochs = 100000 # Formula for linear line def lin(x): return m * x + b # Linear regression algorithm for i in range(epochs): # Pick a random point and calculate vertical distance and horizontal distance rand_point = random.choice(list(data.items())) vert_dist = abs((m * rand_point[0] + b) - rand_point[1]) hor_dist = rand_point[0] if (m * rand_point[0] + b) - rand_point[1] < 0: # Adjust line upwards m += lr * vert_dist * hor_dist b += lr * vert_dist else: # Adjust line downwards m -= lr * vert_dist * hor_dist b -= lr * vert_dist # Plot data points and regression line plt.figure(figsize=(15,6)) plt.scatter(data.keys(), data.values()) plt.plot(xs, lin(xs)) plt.title('Linear Regression result') print('Slope: {}\nIntercept: {}'.format(m, b))

## 1 reply on “Statistics Tutorial for Data Science”

[…] Data Science is all about applying maths to data. If you don’t know about maths, you will face difficulties with data science. You can start learning Mathematics for Data Science using Numerical Python and Statistics. […]