Stratified Sampling is a method of sampling from a population that can be divided into a subset of the population. In this article, I’m going to walk you through a data science tutorial on how to perform stratified sampling with Python.
Stratified Sampling in Data Science
In Data Science, an important goal in any estimation problem is to obtain an estimator of a parameter in the set of data that can support the salient characteristics of the data.
If the data set is homogeneous with respect to the characteristic under study, then the simple random sampling method will result in a homogeneous sample and, in turn, the sample mean will serve as a good estimator of the mean.
Thus, if the data set is homogeneous with respect to the characteristic under study, then the sample drawn by simple random sampling is assumed to provide a representative sample. In addition, the variance of the sample mean depends not only on the sample size and the sampling fraction but also on the variance of the population.
To increase the precision of an estimator, we need to use a sampling scheme that can reduce the heterogeneity of the population. If the data set is heterogeneous with respect to the characteristic of interest, one of these sampling procedures is stratified sampling.
In Data Science, the basic idea of stratified sampling is to:
- Divide the entire heterogeneous population into smaller groups or subpopulations such that the sampling units are homogeneous with respect to the characteristic of interest within the subpopulation.
- Treat each subpopulation as a separate population.
Stratified Sampling with Python
In this section, I will take you through how to perform stratified sampling with Python. I will use the California housing dataset for this task. Let’s start with the necessary data preparation:
Now we are ready to do stratified sampling with Python based on the categories of income in the dataset. For this, we can use the StratifiedShuffleSplit class of Scikit-Learn:
Now let’s visualize the training set that we have after performing stratified sampling with Python. As the data is based on geographical locations I will use a scatter plot to visualize this dataset:
The red marks represent expensive locations, blue represents cheaper locations and the larger circles indicates the areas with the larger population. I hope you liked this article on how to perform stratified sampling with Python. Feel free to ask your valuable questions in the comments section below.