Water Quality Analysis

Access to safe drinking water is one of the essential needs of all human beings. From a legal point of view, access to drinking water is one of the fundamental human rights. Many factors affect water quality, it is also one of the major research areas in machine learning. So if you want to learn how to do water quality analysis with machine learning, this article is for you. In this article, I will walk you through water quality analysis with Machine Learning using Python.

Water Quality Analysis

One of the main areas of research in machine learning is the analysis of water quality. It is also known as water potability analysis because our task here is to understand all the factors that affect water potability and train a machine learning model that can classify whether a specific water sample is safe or unfit for consumption.

For the water quality analysis task, I will be using a Kaggle dataset that contains data on all of the major factors that affect the potability of water. All of the factors that affect water quality are very important, so we need to briefly explore each feature of this dataset before training a machine learning model to predict whether a water sample is safe or unsuitable for consumption. You can download the dataset I’m using for the water quality analysis task from here.

Water Quality Analysis using Python

I’ll start the water quality analysis task by importing the necessary Python libraries and the dataset:

phHardnessSolidsChloraminesSulfateConductivityOrganic_carbonTrihalomethanesTurbidityPotability
0NaN204.89045520791.3189817.300212368.516441564.30865410.37978386.9909702.9631350
13.716080129.42292118630.0578586.635246NaN592.88535915.18001356.3290764.5006560
28.099124224.23625919909.5417329.275884NaN418.60621316.86863766.4200933.0559340
38.316766214.37339422018.4174418.059332356.886136363.26651618.436524100.3416744.6287710
49.092223181.10150917978.9863396.546600310.135738398.41081311.55827931.9979934.0750750

I can see null values in the first preview of this dataset itself, so before we go ahead, let’s remove all the rows that contain null values:

data = data.dropna()
data.isnull().sum()
ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

The Potability column of this dataset is the column we need to predict because it contains values 0 and 1 that indicate whether the water is potable (1) or unfit (0) for consumption. So let’s see the distribution of 0 and 1 in the Potability column:

water quality dataset distribution

So this is something you should note that this dataset is not balanced because samples of 0s are more than 1s.

As mentioned above, there are no factors that we cannot ignore that affect water quality, so let’s explore all the columns one by one. Let’s start by looking at the ph column:

water quality analysis: ph value

The ph column represents the ph value of the water which is an important factor in evaluating the acid-base balance of the water. The pH value of drinking water should be between 6.5 and 8.5. Now let’s look at the second factor affecting water quality in the dataset:

hardness

The figure above shows the distribution of water hardness in the dataset. The hardness of water usually depends on its source, but water with a hardness of 120-200 milligrams is drinkable. Now let’s take a look at the next factor affecting water quality:

water quality analysis: solids

The figure above represents the distribution of total dissolved solids in water in the dataset. All organic and inorganic minerals present in water are called dissolved solids. Water with a very high number of dissolved solids is highly mineralized. Now let’s take a look at the next factor affecting water quality:

chloramines

The figure above represents the distribution of chloramine in water in the dataset. Chloramine and chlorine are disinfectants used in public water systems. Now let’s take a look at the next factor affecting water quality:

water quality analysis: sulfate

The figure above shows the distribution of sulfate in water in the dataset. They are substances naturally present in minerals, soil and rocks. Water containing less than 500 milligrams of sulfate is safe to drink. Now let’s see the next factor:

conductivity

The figure above represents the distribution of water conductivity in the dataset. Water is a good conductor of electricity, but the purest form of water is not a good conductor of electricity. Water with an electrical conductivity of less than 500 is drinkable. Now let’s see the next factor:

water quality analysis: organic carbon

The figure above represents the distribution of organic carbon in water in the dataset. Organic carbon comes from the breakdown of natural organic materials and synthetic sources. Water containing less than 25 milligrams of organic carbon is considered safe to drink. Now let’s take a look at the next factor that affects the quality of drinking water:

Trihalomethanes

The figure above represents the distribution of trihalomethanes or THMs in water in the dataset. THMs are chemicals found in chlorine-treated water. Water containing less than 80 milligrams of THMs is considered safe to drink. Now let’s take a look at the next factor in the dataset that affects drinking water quality:

water quality analysis: Turbidity

The figure above represents the distribution of turbidity in water. The turbidity of water depends on the number of solids present in suspension. Water with a turbidity of fewer than 5 milligrams is considered drinkable.

Water Quality Prediction Model using Python

In the above section, we explored all the features that affect water quality. Now, the next step is to train a machine learning model for the task of water quality analysis using Python. For this task, I will be using the PyCaret library in Python. If you have never used this library before, you can easily install it on your system using the pip command:

  • pip install pycaret

Before training a machine learning model, let’s have a look at the correlation of all the features with respect to the Potability column in the dataset:

correlation = data.corr()
correlation["ph"].sort_values(ascending=False)
ph                 1.000000
Hardness           0.108948
Organic_carbon     0.028375
Trihalomethanes    0.018278
Potability         0.014530
Conductivity       0.014128
Sulfate            0.010524
Chloramines       -0.024768
Turbidity         -0.035849
Solids            -0.087615
Name: ph, dtype: float64

Now below is how you can see which machine learning algorithm is best for this dataset by using the PyCaret library in Python:

from pycaret.classification import *
clf = setup(data, target = "Potability", silent = True, session_id = 786)
compare_models()
Model Selection for water quality analysis

According to the above result, the random forecast classification algorithm is best for training a machine learning model for the task of water quality analysis. So let’s train the model and examine its predictions:

model = create_model("rf")
predict = predict_model(model, data=data)
predict.head()
water quality analysis with machine learning

The above results are looking satisfactory. I hope you liked this Machine Learning project on Water Quality Analysis using Python.

Summary

So this is how you can analyze the quality of water and train a machine learning model to classify safe and unsafe water for drinking. Access to safe drinking water is one of the essential needs of all human beings. From a legal point of view, access to drinking water is one of the fundamental human rights. Many factors affect water quality, it is also one of the major research areas in machine learning. I hope you liked this article on Water Quality Analysis with Machine Learning using Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply