Exploratory Data Analysis (EDA) is a data analysis technique where we understand the data precisely. Essentially, it means understanding what’s in the data we’re working with. In this article, I’ll walk you through what exploratory data analysis is and what are the steps and techniques of EDA in the process of data science.
What is Exploratory Data Analysis (EDA)?
Exploratory data analysis is the most important step in any data science task. The main objectives of the EDA are:
- Analyze data distribution
- Detect outliers and anomalies
- Select the most important features
- Remove unnecessary columns
- Removing and filling in missing values
- Discover the hidden motives
In the EDA process, we also do feature selection and understand data primarily by visualizing it, understanding each feature and analyzing the relationship between features is also an important part of exploratory data analysis.
Also, Read – 200+ Machine Learning Projects Solved and Explained.
This is the first step after data collection and after EDA we move on to feature engineering and model selection. So EDA helps a lot in selecting the best features for the model and selecting the best model to predict the labels.
Techniques of Exploratory Data Analysis
Let’s understand exploratory data analysis techniques by looking at which techniques to use depending on the type of data we’re working on:
Type of Data | EDA Techniques You Should Use |
---|---|
Categorical | Descriptive Statistics |
Univariate | Line Plot and Histograms |
Bivariate | Scatter Plots |
Arrays | Heatmap |
Multivariate | 3D or 2D point cloud with a third variable represented in different colours, shapes or sizes. |
Multiple groups | Box Plots |
Now let’s have a look at the most useful techniques of EDA we should use depending on the objective:
Objective | EDA Techniques You Should Use |
---|---|
Get an idea of the distribution of features. | Histogram |
Outlier Detection | Histogram, scatterplots, box plots |
Understanding the relationship between two variables | 2D scatter plot and Correlation |
Visualize the relationship between two input variables and one input variable | Heatmap |
High dimensional data visualization | T-SNE or PCA + 2D / 3Dscatterplot |
Steps of EDA
It is not the rule of thumb but it is prefered to follow the steps below while performing EDA on any type of dataset:
- Importing the data
- Understanding data distribution and missing values
- Understanding each feature
- Descriptive Statistics
- Understanding Correlation between features
- Detect Outliers
Conclusion
EDA is the first and most important step in any data science task. You need to have a good understanding of statistics and visualization techniques to explore and understand your data. Hope you liked this article on what EDA is in data science. Please feel free to ask your valuable questions in the comments section below.