What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is a data analysis technique where we understand the data precisely. Essentially, it means understanding what’s in the data we’re working with. In this article, I’ll walk you through what exploratory data analysis is and what are the steps and techniques of EDA in the process of data science.

What is Exploratory Data Analysis (EDA)?

Exploratory data analysis is the most important step in any data science task. The main objectives of the EDA are:

  1. Analyze data distribution
  2. Detect outliers and anomalies
  3. Select the most important features
  4. Remove unnecessary columns
  5. Removing and filling in missing values
  6. Discover the hidden motives

In the EDA process, we also do feature selection and understand data primarily by visualizing it, understanding each feature and analyzing the relationship between features is also an important part of exploratory data analysis.

Also, Read – 200+ Machine Learning Projects Solved and Explained.

This is the first step after data collection and after EDA we move on to feature engineering and model selection. So EDA helps a lot in selecting the best features for the model and selecting the best model to predict the labels.

Techniques of Exploratory Data Analysis

Let’s understand exploratory data analysis techniques by looking at which techniques to use depending on the type of data we’re working on:

Type of DataEDA Techniques You Should Use
CategoricalDescriptive Statistics
UnivariateLine Plot and Histograms
BivariateScatter Plots
ArraysHeatmap
Multivariate3D or 2D point cloud with a third variable represented in different colours, shapes or sizes.
Multiple groupsBox Plots

Now let’s have a look at the most useful techniques of EDA we should use depending on the objective:

ObjectiveEDA Techniques You Should Use
Get an idea of the distribution of features.Histogram
Outlier DetectionHistogram, scatterplots, box plots
Understanding the relationship between two variables2D scatter plot and Correlation
Visualize the relationship between two input variables and one input variableHeatmap
High dimensional data visualizationT-SNE or PCA + 2D / 3Dscatterplot

Steps of EDA

It is not the rule of thumb but it is prefered to follow the steps below while performing EDA on any type of dataset:

  1. Importing the data
  2. Understanding data distribution and missing values
  3. Understanding each feature
  4. Descriptive Statistics
  5. Understanding Correlation between features
  6. Detect Outliers

Conclusion

EDA is the first and most important step in any data science task. You need to have a good understanding of statistics and visualization techniques to explore and understand your data. Hope you liked this article on what EDA is in data science. Please feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

Leave a Reply