Rainfall Prediction with Machine Learning

Rainfall Prediction is one of the difficult and uncertain tasks that have a significant impact on human society. Timely and accurate forecasting can proactively help reduce human and financial loss. This study presents a set of experiments that involve the use of common machine learning techniques to create models that can predict whether it will rain tomorrow or not based on the weather data for that day in major cities in Australia.

I’ve always liked knowing the parameters meteorologists take into account before making a weather forecast, so I found the dataset interesting. From an expert’s point of view, however, this dataset is fairly straightforward. At the end of this article, you will learn:

Also, Read – Linear Search Algorithm with Python.

  • How is balancing done for an unbalanced dataset
  • How Label Coding Is Done for Categorical Variables
  • How sophisticated imputation like MICE is used
  • How outliers can be detected and excluded from the data
  • How the filter method and wrapper methods are used for feature selection
  • How to compare speed and performance for different popular models
  • Which metric can be the best to judge the performance on an unbalanced data set: precision and F1 score.

Let’s start this task of rainfall prediction by importing the data, you can download the dataset I am using in this task from here:

import pandas as pd from google.colab import files uploaded = files.upload() full_data = pd.read_csv('weatherAUS.csv') full_data.head()

Data Exploration

We will first check the number of rows and columns. Next, we’ll check the size of the dataset to decide if it needs size compression.

(142193, 24)
image for post

“RainToday” and “RainTomorrow” are objects (Yes / No). I will convert them to binary (1/0) for our convenience.

full_data['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True) full_data['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

Next, we will check if the dataset is unbalanced or balanced. If the data set is unbalanced, we need to either downsample the majority or oversample the minority to balance it.

import matplotlib.pyplot as plt fig = plt.figure(figsize = (8,5)) full_data.RainTomorrow.value_counts(normalize = True).plot(kind='bar', color= ['skyblue','navy'], alpha = 0.9, rot=0) plt.title('RainTomorrow Indicator No(0) and Yes(1) in the Imbalanced Dataset') plt.show()

We can observe that the presence of “0” and “1” is almost in the 78:22 ratio. So there is a class imbalance and we have to deal with it. To fight against the class imbalance, we will use here the oversampling of the minority class. Since the size of the dataset is quite small, majority class subsampling wouldn’t make much sense here.

Handling Class Imbalance For Rainfall Prediction

from sklearn.utils import resample no = full_data[full_data.RainTomorrow == 0] yes = full_data[full_data.RainTomorrow == 1] yes_oversampled = resample(yes, replace=True, n_samples=len(no), random_state=123) oversampled = pd.concat([no, yes_oversampled]) fig = plt.figure(figsize = (8,5)) oversampled.RainTomorrow.value_counts(normalize = True).plot(kind='bar', color= ['skyblue','navy'], alpha = 0.9, rot=0) plt.title('RainTomorrow Indicator No(0) and Yes(1) after Oversampling (Balanced Dataset)') plt.show()
image for post

Now, I will now check the missing data model in the dataset:

# Missing Data Pattern in Training Data import seaborn as sns sns.heatmap(oversampled.isnull(), cbar=False, cmap='PuBu')

Obviously, “Evaporation”, “Sunshine”, “Cloud9am”, “Cloud3pm” are the features with a high missing percentage. So we will check the details of the missing data for these 4 features.

total = oversampled.isnull().sum().sort_values(ascending=False) percent = (oversampled.isnull().sum()/oversampled.isnull().count()).sort_values(ascending=False) missing = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) missing.head(4)
                  Total	Percent
Sunshine	104831	0.475140
Evaporation	95411	0.432444
Cloud3pm	85614	0.388040
Cloud9am	81339	0.368664

We observe that the 4 features have less than 50 per cent missing data. So instead of rejecting them completely, we’ll consider them in our model with proper imputation.

Imputation and Transformation

We will impute the categorical columns with mode, and then we will use the label encoder to convert them to numeric numbers. Once all the columns in the full data frame are converted to numeric columns, we will impute the missing values ​​using the Multiple Imputation by Chained Equations (MICE) package.

Then we will detect outliers using the interquartile range and remove them to get the final working dataset. Finally, we will check the correlation between the different variables, and if we find a pair of highly correlated variables, we will discard one while keeping the other.

Index(['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm'], dtype='object')
# Impute categorical var with Mode oversampled['Date'] = oversampled['Date'].fillna(oversampled['Date'].mode()[0]) oversampled['Location'] = oversampled['Location'].fillna(oversampled['Location'].mode()[0]) oversampled['WindGustDir'] = oversampled['WindGustDir'].fillna(oversampled['WindGustDir'].mode()[0]) oversampled['WindDir9am'] = oversampled['WindDir9am'].fillna(oversampled['WindDir9am'].mode()[0]) oversampled['WindDir3pm'] = oversampled['WindDir3pm'].fillna(oversampled['WindDir3pm'].mode()[0])
# Convert categorical features to continuous features with Label Encoding from sklearn.preprocessing import LabelEncoder lencoders = {} for col in oversampled.select_dtypes(include=['object']).columns: lencoders[col] = LabelEncoder() oversampled[col] = lencoders[col].fit_transform(oversampled[col])
import warnings warnings.filterwarnings("ignore") # Multiple Imputation by Chained Equations from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer MiceImputed = oversampled.copy(deep=True) mice_imputer = IterativeImputer() MiceImputed.iloc[:, :] = mice_imputer.fit_transform(oversampled)

Thus, the dataframe has no “NaN” value. We will now detect and eliminate outliers from the inter-quartile interval-based data set.

# Detecting outliers with IQR Q1 = MiceImputed.quantile(0.25) Q3 = MiceImputed.quantile(0.75) IQR = Q3 - Q1 print(IQR)
Date             1535.000000
Location           25.000000
MinTemp             9.300000
MaxTemp            10.200000
Rainfall            2.400000
Evaporation         4.119679
Sunshine            5.947404
WindGustDir         9.000000
WindGustSpeed      19.000000
WindDir9am          8.000000
WindDir3pm          8.000000
WindSpeed9am       13.000000
WindSpeed3pm       11.000000
Humidity9am        26.000000
Humidity3pm        30.000000
Pressure9am         8.800000
Pressure3pm         8.800000
Cloud9am            4.000000
Cloud3pm            3.681346
Temp9am             9.300000
Temp3pm             9.800000
RainToday           1.000000
RISK_MM             5.200000
RainTomorrow        1.000000
dtype: float64
# Removing outliers from dataset MiceImputed = MiceImputed[~((MiceImputed < (Q1 - 1.5 * IQR)) |(MiceImputed > (Q3 + 1.5 * IQR))).any(axis=1)] MiceImputed.shape
(156852, 24)

We observe that the original dataset had the form (87927, 24). After running a code snippet for removing outliers, the dataset now has the form (86065, 24). As a result, the dataset is now free of 1862 outliers. We are now going to check multicollinearity, that is to say if a character is strongly correlated with another.

# Correlation Heatmap import numpy as np import matplotlib.pyplot as plt import seaborn as sns corr = MiceImputed.corr() mask = np.triu(np.ones_like(corr, dtype=np.bool)) f, ax = plt.subplots(figsize=(20, 20)) cmap = sns.diverging_palette(250, 25, as_cmap=True) sns.heatmap(corr, mask=mask, cmap=cmap, vmax=None, center=0,square=True, annot=True, linewidths=.5, cbar_kws={"shrink": .9})
rainfall prediction

The following feature pairs have a strong correlation with each other:

  • MaxTemp and MinTemp
  • Pressure9h and pressure3h
  • Temp9am and Temp3pm
  • Evaporation and MaxTemp
  • MaxTemp and Temp3pm But in no case is the correlation value equal to a perfect “1”. We are therefore not removing any functionality

However, we can delve deeper into the pairwise correlation between these highly correlated characteristics by examining the following pair diagram. Each of the paired plots shows very clearly distinct clusters of RainTomorrow’s “yes” and “no” clusters. There is very minimal overlap between them.

sns.pairplot( data=MiceImputed, vars=('MaxTemp','MinTemp','Pressure9am','Pressure3pm', 'Temp9am', 'Temp3pm', 'Evaporation'), hue='RainTomorrow' )
image for post

Feature Selection for Rainfall Prediction

I will use both the filter method and the wrapper method for feature selection to train our rainfall prediction model.

Selecting features by filtering method (chi-square value): before doing this, we must first normalize our data. We use MinMaxScaler instead of StandardScaler in order to avoid negative values.

# Standardizing data from sklearn import preprocessing r_scaler = preprocessing.MinMaxScaler() r_scaler.fit(MiceImputed) modified_data = pd.DataFrame(r_scaler.transform(MiceImputed), index=MiceImputed.index, columns=MiceImputed.columns)
# Feature Importance using Filter Method (Chi-Square) from sklearn.feature_selection import SelectKBest, chi2 X = modified_data.loc[:,modified_data.columns!='RainTomorrow'] y = modified_data[['RainTomorrow']] selector = SelectKBest(chi2, k=10) selector.fit(X, y) X_new = selector.transform(X) print(X.columns[selector.get_support(indices=True)])
Index(['Sunshine', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm',
       'Cloud9am', 'Cloud3pm', 'Temp3pm', 'RainToday', 'RISK_MM'],

We can observe that “Sunshine”, “Humidity9am”, “Humidity3pm”, “Pressure9am”, “Pressure3pm” have higher importance compared to other features.

Selection of features by wrapping method (random forest):

from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import RandomForestClassifier as rf X = MiceImputed.drop('RainTomorrow', axis=1) y = MiceImputed['RainTomorrow'] selector = SelectFromModel(rf(n_estimators=100, random_state=0)) selector.fit(X, y) support = selector.get_support() features = X.loc[:,support].columns.tolist() print(features) print(rf(n_estimators=100, random_state=0).fit(X,y).feature_importances_)
['Sunshine', 'Cloud3pm', 'RISK_MM']
[0.00205993 0.00215407 0.00259089 0.00367568 0.0102656  0.00252838
 0.05894157 0.00143001 0.00797518 0.00177178 0.00167654 0.0014278
 0.00187743 0.00760691 0.03091966 0.00830365 0.01193018 0.02113544
 0.04962418 0.00270103 0.00513723 0.00352198 0.76074491]

Training Rainfall Prediction Model with Different Models

We will divide the dataset into training (75%) and test (25%) sets respectively to train the rainfall prediction model. For best results, we will standardize our X_train and X_test data:

features = MiceImputed[['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday']] target = MiceImputed['RainTomorrow'] # Split into test and train from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=12345) # Normalize Features from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.fit_transform(X_test)
def plot_roc_cur(fper, tper): plt.plot(fper, tper, color='orange', label='ROC') plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend() plt.show()
import time from sklearn.metrics import accuracy_score, roc_auc_score, cohen_kappa_score, plot_confusion_matrix, roc_curve, classification_report def run_model(model, X_train, y_train, X_test, y_test, verbose=True): t0=time.time() if verbose == False: model.fit(X_train,y_train, verbose=0) else: model.fit(X_train,y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) roc_auc = roc_auc_score(y_test, y_pred) coh_kap = cohen_kappa_score(y_test, y_pred) time_taken = time.time()-t0 print("Accuracy = {}".format(accuracy)) print("ROC Area under Curve = {}".format(roc_auc)) print("Cohen's Kappa = {}".format(coh_kap)) print("Time taken = {}".format(time_taken)) print(classification_report(y_test,y_pred,digits=5)) probs = model.predict_proba(X_test) probs = probs[:, 1] fper, tper, thresholds = roc_curve(y_test, probs) plot_roc_cur(fper, tper) plot_confusion_matrix(model, X_test, y_test,cmap=plt.cm.Blues, normalize = 'all') return model, accuracy, roc_auc, coh_kap, time_taken

Plotting Decision Region for all Models

rainfall prediction

We can observe the difference in the class limits for different models, including the set one (the plot is done considering only the training data). CatBoost has the distinct regional border compared to all other models. However, the XGBoost and Random Forest models also have a much lower number of misclassified data points compared to other models.

Rainfall Prediction Model Comparison

Now we need to decide which model performed best based on Precision Score, ROC_AUC, Cohen’s Kappa and Total Run Time. One point to mention here is: we could have considered F1-Score as a better metric for judging model performance instead of accuracy, but we have already converted the unbalanced dataset to a balanced one, so consider accuracy as a metric for deciding the best model is justified in this case.

For a better decision, we chose “Cohen’s Kappa” which is actually an ideal choice as a metric to decide on the best model in case of unbalanced datasets. Let’s check which model worked well on which front:

accuracy_scores = [accuracy_lr, accuracy_dt, accuracy_nn, accuracy_rf, accuracy_lgb, accuracy_cb, accuracy_xgb] roc_auc_scores = [roc_auc_lr, roc_auc_dt, roc_auc_nn, roc_auc_rf, roc_auc_lgb, roc_auc_cb, roc_auc_xgb] coh_kap_scores = [coh_kap_lr, coh_kap_dt, coh_kap_nn, coh_kap_rf, coh_kap_lgb, coh_kap_cb, coh_kap_xgb] tt = [tt_lr, tt_dt, tt_nn, tt_rf, tt_lgb, tt_cb, tt_xgb] model_data = {'Model': ['Logistic Regression','Decision Tree','Neural Network','Random Forest','LightGBM','Catboost','XGBoost'], 'Accuracy': accuracy_scores, 'ROC_AUC': roc_auc_scores, 'Cohen_Kappa': coh_kap_scores, 'Time taken': tt} data = pd.DataFrame(model_data) fig, ax1 = plt.subplots(figsize=(12,10)) ax1.set_title('Model Comparison: Accuracy and Time taken for execution', fontsize=13) color = 'tab:green' ax1.set_xlabel('Model', fontsize=13) ax1.set_ylabel('Time taken', fontsize=13, color=color) ax2 = sns.barplot(x='Model', y='Time taken', data = data, palette='summer') ax1.tick_params(axis='y') ax2 = ax1.twinx() color = 'tab:red' ax2.set_ylabel('Accuracy', fontsize=13, color=color) ax2 = sns.lineplot(x='Model', y='Accuracy', data = data, sort=False, color=color) ax2.tick_params(axis='y', color=color)
rainfall prediction model comparison
fig, ax3 = plt.subplots(figsize=(12,10)) ax3.set_title('Model Comparison: Area under ROC and Cohens Kappa', fontsize=13) color = 'tab:blue' ax3.set_xlabel('Model', fontsize=13) ax3.set_ylabel('ROC_AUC', fontsize=13, color=color) ax4 = sns.barplot(x='Model', y='ROC_AUC', data = data, palette='winter') ax3.tick_params(axis='y') ax4 = ax3.twinx() color = 'tab:red' ax4.set_ylabel('Cohen_Kappa', fontsize=13, color=color) ax4 = sns.lineplot(x='Model', y='Cohen_Kappa', data = data, sort=False, color=color) ax4.tick_params(axis='y', color=color) plt.show()
image for post

We can observe that XGBoost, CatBoost and Random Forest performed better compared to other models. However, if speed is an important thing to consider, we can stick with Random Forest instead of XGBoost or CatBoost.

Also, Read – Proximity Analysis with Python.

I hope you liked this article on how we can create and compare different Rainfall prediction models. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Get Daily Newsletters


  1. Hi dear, It is a very interesting article. Thank you for your cooperation. If it is possible, please give me a code on “Road Traffic Accident Prediction”.

Leave a Reply