Statistics for Machine Learning

The Role of Statistics in Machine Learning.

The complex statistics in machine learning worry a lot of Programmers. The proper knowledge of statistics will help you to build machine learning models that are optimized for a given problem. In this article, I’ll walk you through the statistics you need for Machine Learning.

Machine learning is a branch of study in which a model can automatically learn from data-driven experiments without being exclusively modeled as in statistical models. Over some time and with more data, the model predictions will improve.

Also, Read – Machine Learning Full Course for free.

What is Statistics?


Statistics is a field of mathematics which deals with the collection, analysis, and interpretation of numerical data. Statistics are mainly classified into two sub-branches:

  • Descriptive statistics: they are used to summarize data, such as mean, standard deviation for continuous data types (like age), while frequency and percentage are useful for categorical data.
  • Inferential Statistics: Many times a collection of the dataset (also called population in statistical methodology) is not possible, hence a subset of data points is collected, also called a sample and conclusions about the set. the population will be drawn, which is known as inferential statistics. Inferences are drawn using hypothesis testing, estimating numeric characteristics, correlating relationships within data, etc.

Statistical modeling involves applying statistics to data to find underlying hidden relationships by analyzing the significance of variables.

Why Statistics for Machine Learning?

Statistics is the science of data analysis. Classical or conventional statistics are inferential, which means they are used to conclude the data (various parameters).

The main purpose of statistical modeling is to make inferences and understand the characteristics of variables. Machine learning models exploit statistical algorithms and apply them to predict analytics. In statistical models, a hypothesis is a testable way to confirm the validity of the specific algorithm.

In Machine Learning, Statistics help us understand the following:

  • The average value of any KPI (KPIs like sales, number of customers, NPS, revenue, market basket, etc.)
  • One of the widely used imputations for missing value is mean, median, and mode
  • These central trends help us validate data quality and align it with domain knowledge. For example. For a grocery store, the average sales volume increases during the holiday season compared to the rest of the year. When a data is obtained and average numbers are observed, if the central tendencies are significantly different from what is known, we need to check the quality of the data
  • Linear regression, logistic regression, neural networks, follow certain hypotheses which are validated by correlation, checks for outliers, etc.

Important Concepts Of Statistics

Now, let’s understand some important concepts of statistics:

  • Population – The universe of all possible data for a given scenario. For example. All customers with bank credit accounts
  • Sample – The set of observations from a given population. For example. All customers with credit accounts at a bank who opted for customer support services in the last quarter are a sample. These are the same customers we talked about initially, however, we are only filtering those who opted for customer support services in the previous quarter, so this is a sample
  • Parameter – A numerical summary or value associated with a population. For example. Average debt of all customers with credit accounts at a bank
  • Statistics – A numerical summary or value associated with a sample. For example. Average calls made by customers opting for customer support at the start of the week.

A population is usually not observed, the reason being that when we work with data, we do not have access to all the entities in the problem space. There are also often data leaks or data lost during filing. Therefore, each data is considered as a subset of a population.

I hope you liked this article on statistics for Machine Learning. You can learn the implementation of Statistics in machine learning with Python from here. Feel free to ask your valuable questions in the comments section below,

Follow Us:

Leave a Reply