Access to safe drinking water is one of the essential needs of all human beings. From a legal point of view, access to drinking water is one of the fundamental human rights. Many factors affect water quality, it is also one of the major research areas in machine learning. So if you want to learn how to do water quality analysis with machine learning, this article is for you. In this article, I will walk you through water quality analysis with Machine Learning using Python.
Water Quality Analysis
One of the main areas of research in machine learning is the analysis of water quality. It is also known as water potability analysis because our task here is to understand all the factors that affect water potability and train a machine learning model that can classify whether a specific water sample is safe or unfit for consumption.
For the water quality analysis task, I will be using a Kaggle dataset that contains data on all of the major factors that affect the potability of water. All of the factors that affect water quality are very important, so we need to briefly explore each feature of this dataset before training a machine learning model to predict whether a water sample is safe or unsuitable for consumption. You can download the dataset I’m using for the water quality analysis task from here.
Water Quality Analysis using Python
I’ll start the water quality analysis task by importing the necessary Python libraries and the dataset:
ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
I can see null values in the first preview of this dataset itself, so before we go ahead, let’s remove all the rows that contain null values:
data = data.dropna() data.isnull().sum()
ph 0 Hardness 0 Solids 0 Chloramines 0 Sulfate 0 Conductivity 0 Organic_carbon 0 Trihalomethanes 0 Turbidity 0 Potability 0 dtype: int64
The Potability column of this dataset is the column we need to predict because it contains values 0 and 1 that indicate whether the water is potable (1) or unfit (0) for consumption. So let’s see the distribution of 0 and 1 in the Potability column:

So this is something you should note that this dataset is not balanced because samples of 0s are more than 1s.
As mentioned above, there are no factors that we cannot ignore that affect water quality, so let’s explore all the columns one by one. Let’s start by looking at the ph column:

The ph column represents the ph value of the water which is an important factor in evaluating the acid-base balance of the water. The pH value of drinking water should be between 6.5 and 8.5. Now let’s look at the second factor affecting water quality in the dataset:

The figure above shows the distribution of water hardness in the dataset. The hardness of water usually depends on its source, but water with a hardness of 120-200 milligrams is drinkable. Now let’s take a look at the next factor affecting water quality:

The figure above represents the distribution of total dissolved solids in water in the dataset. All organic and inorganic minerals present in water are called dissolved solids. Water with a very high number of dissolved solids is highly mineralized. Now let’s take a look at the next factor affecting water quality:

The figure above represents the distribution of chloramine in water in the dataset. Chloramine and chlorine are disinfectants used in public water systems. Now let’s take a look at the next factor affecting water quality:

The figure above shows the distribution of sulfate in water in the dataset. They are substances naturally present in minerals, soil and rocks. Water containing less than 500 milligrams of sulfate is safe to drink. Now let’s see the next factor:

The figure above represents the distribution of water conductivity in the dataset. Water is a good conductor of electricity, but the purest form of water is not a good conductor of electricity. Water with an electrical conductivity of less than 500 is drinkable. Now let’s see the next factor:

The figure above represents the distribution of organic carbon in water in the dataset. Organic carbon comes from the breakdown of natural organic materials and synthetic sources. Water containing less than 25 milligrams of organic carbon is considered safe to drink. Now let’s take a look at the next factor that affects the quality of drinking water:

The figure above represents the distribution of trihalomethanes or THMs in water in the dataset. THMs are chemicals found in chlorine-treated water. Water containing less than 80 milligrams of THMs is considered safe to drink. Now let’s take a look at the next factor in the dataset that affects drinking water quality:

The figure above represents the distribution of turbidity in water. The turbidity of water depends on the number of solids present in suspension. Water with a turbidity of fewer than 5 milligrams is considered drinkable.
Water Quality Prediction Model using Python
In the above section, we explored all the features that affect water quality. Now, the next step is to train a machine learning model for the task of water quality analysis using Python. For this task, I will be using the PyCaret library in Python. If you have never used this library before, you can easily install it on your system using the pip command:
- pip install pycaret
Before training a machine learning model, let’s have a look at the correlation of all the features with respect to the Potability column in the dataset:
correlation = data.corr() correlation["ph"].sort_values(ascending=False)
ph 1.000000 Hardness 0.108948 Organic_carbon 0.028375 Trihalomethanes 0.018278 Potability 0.014530 Conductivity 0.014128 Sulfate 0.010524 Chloramines -0.024768 Turbidity -0.035849 Solids -0.087615 Name: ph, dtype: float64
Now below is how you can see which machine learning algorithm is best for this dataset by using the PyCaret library in Python:
from pycaret.classification import * clf = setup(data, target = "Potability", silent = True, session_id = 786) compare_models()

According to the above result, the random forecast classification algorithm is best for training a machine learning model for the task of water quality analysis. So let’s train the model and examine its predictions:
model = create_model("rf") predict = predict_model(model, data=data) predict.head()

The above results are looking satisfactory. I hope you liked this Machine Learning project on Water Quality Analysis using Python.
Summary
So this is how you can analyze the quality of water and train a machine learning model to classify safe and unsafe water for drinking. Access to safe drinking water is one of the essential needs of all human beings. From a legal point of view, access to drinking water is one of the fundamental human rights. Many factors affect water quality, it is also one of the major research areas in machine learning. I hope you liked this article on Water Quality Analysis with Machine Learning using Python. Feel free to ask your valuable questions in the comments section below.