In this article, I will train a model to predict weather with machine learning. We will act as if we do not have access to the weather forecast. We have access to a century of historical averages of global temperatures, including global maximum temperatures, global minimum temperatures, and global land and ocean temperatures. Having all this, we know that this is a supervised regression machine learning problem.
Weather Dataset to Predict Weather
First of all, we need some data, the data I am using to predict weather with machine learning was created from one of the most prestigious research universities in the world, we will assume that the data in the dataset is true. You can easily download this data from here.
Now, let’s get started with reading the dataset:
import pandas as pd global_temp = pd.read_csv("GlobalTemperatures.csv") print(global_temp.shape) print(global_temp.columns) print(global_temp.info()) print(global_temp.isnull().sum())
Unfortunately, we’re not quite at the point where we can just feed the raw data into a model and have it send back a response. We will need to make some minor edits to put our data in a machine learning model.
The exact steps in data preparation will depend on the model used and the data collected, but some amount of data manipulation will be required. First, I’ll create a function called wrangle () in which I’ll call our dataframe:
#Data Preparation def wrangle(df): df = df.copy() df = df.drop(columns=["LandAverageTemperatureUncertainty", "LandMaxTemperatureUncertainty", "LandMinTemperatureUncertainty", "LandAndOceanAverageTemperatureUncertainty"], axis=1)
We want to make a copy of the dataframe so as not to corrupt the original. After that, we are going to remove the columns that have high cardinality.
High cardinality refers to columns whose values are very rare or unique. Given the frequency of high cardinality data in most time-series datasets, we will solve this problem directly by completely removing these high cardinality columns from our dataset so as not to confuse our model in the future.
Now, I will create a function to convert temperature, and to convert the columns into DateTime object:
def converttemp(x): x = (x * 1.8) + 32 return float(x) df["LandAverageTemperature"] = df["LandAverageTemperature"].apply(converttemp) df["LandMaxTemperature"] = df["LandMaxTemperature"].apply(converttemp) df["LandMinTemperature"] = df["LandMinTemperature"].apply(converttemp) df["LandAndOceanAverageTemperature"] = df["LandAndOceanAverageTemperature"].apply(converttemp) df["dt"] = pd.to_datetime(df["dt"]) df["Month"] = df["dt"].dt.month df["Year"] = df["dt"].dt.year df = df.drop("dt", axis=1) df = df.drop("Month", axis=1) df = df[df.Year >= 1850] df = df.set_index(["Year"]) df = df.dropna() return df global_temp = wrangle(global_temp) print(global_temp.head())
After calling our wrangle function to our global_temp dataframe, we can now see a new cleaned-up version of our global_temp dataframe with no missing values.
Now, before moving forward with training a model to predict weather with machine learning, let’s visualize this data to find correlations between the data:
import seaborn as sns import matplotlib.pyplot as plt corrMatrix = global_temp.corr() sns.heatmap(corrMatrix, annot=True) plt.show()
As we can see, and as some of you have probably guessed, the columns that we have chosen to keep moving forward are highly correlated with each other.
Separating Our Target to Predict Weather
Now we need to separate the data into features and targets. The target, also called Y, is the value we want to predict, in this case, the actual average land and ocean temperature and features are all the columns the model uses to make a prediction:
target = "LandAndOceanAverageTemperature" y = global_temp[target] x = global_temp[["LandAverageTemperature", "LandMaxTemperature", "LandMinTemperature"]]
Train Test Split
Now, to create a model to predict weather with machine learning we need to split the data by using the train_test_split method provided by scikit-learn:
from sklearn.model_selection import train_test_split xtrain, xval, ytrain, yval = train_test_split(x, y, test_size=0.25, random_state=42) print(xtrain.shape) print(xval.shape) print(ytrain.shape) print(yval.shape)
(1494, 3) (498, 3) (1494,) (498,)
Baseline Mean Absolute Error
Before we can make and evaluate any predictions on our machine learning model to predict weather, we need to establish a baseline, a sane metric that we hope to beat with our model. If our model cannot improve from the baseline then it will fail and we should try a different model or admit that machine learning is not suitable for our problem:
from sklearn.metrics import mean_squared_error ypred = [ytrain.mean()] * len(ytrain) print("Baseline MAE: ", round(mean_squared_error(ytrain, ypred), 5))
Training Model To Predict Weather
Now to predict weather with Machine Learning I will train a Random Forest algorithm which is capable of performing both the tasks of Classification as well as Regression:
from sklearn.feature_selection import SelectKBest from sklearn.ensemble import RandomForestRegressor forest = make_pipeline( SelectKBest(k="all"), StandardScaler(), RandomForestRegressor( n_estimators=100, max_depth=50, random_state=77, n_jobs=-1 ) ) forest.fit(xtrain, ytrain)
Model Evaluation of Machine Learning model to Predict Weather
To put our predictions in perspective, we can calculate a precision using the average percentage error subtracted from 100%:
import numpy as np errors = abs(ypred - yval) mape = 100 * (errors/ytrain) accuracy = 100 - np.mean(mape) print("Random Forest Model: ", round(accuracy, 2), "%")
Random Forest Model: 99.52 %
Our model has learned to predict weather conditions with machine learning for next year with 99% accuracy. I hope you liked this article on how to build a model to predict weather with machine learning. Feel free to ask you valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.