Weather Forecasting with Machine Learning

In this article, I will show how we can do Weather Forecasting with Machine Learning algorithm and compare some frameworks for further classification.

Also, read – 10 Machine Learning Projects to Boost your Portfolio

Lets start this task by importing the libraries

import numpy as np # For Linear Algebra
import pandas as pd # To Work With Data
# for visualizations
import as px 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime # Time Series analysis.

Download and read the data set

df = pd.read_csv("Weather.csv")

To look at first 5 rows of the data

df.head() # This will show us top 5 rows of the dataset by default.

We have got an unexpected column named Unnamed: 0. Well, this is a very common problem. We face this when our csv file has an index column which has no name. here is how we can get rid of it.

df = pd.read_csv("Weather.csv", index_col=0)

Now, we’ll make an attribute that would contain date (month, year). So that we could get temperature values with the timeline.

df1 = pd.melt(df, id_vars='YEAR', value_vars=df.columns[1:]) ## This will melt the data
df1['Date'] = df1['variable'] + ' ' + df1['YEAR'].astype(str)  
df1.loc[:,'Date'] = df1['Date'].apply(lambda x : datetime.strptime(x, '%b %Y')) ## Converting String to datetime object

Temperature through time

df1.columns=['Year', 'Month', 'Temprature', 'Date']
df1.sort_values(by='Date', inplace=True) ## To get the time series right.
fig = go.Figure(layout = go.Layout(yaxis=dict(range=[0, df1['Temprature'].max()+1])))
fig.add_trace(go.Scatter(x=df1['Date'], y=df1['Temprature']), )
fig.update_layout(title='Temprature Throught Timeline:',
                 xaxis_title='Time', yaxis_title='Temprature in Degrees')
        buttons=list([dict(label="Whole View", step="all"),
                      dict(count=1,label="One Year View",step="year",stepmode="todate")                      
weather forecasting with machine learning

On a closer look, by clicking on One Year View, we can see that the graph seems distorted because this is how the values really are. The temperature varies every year with months.


  • May 1921 has been the hottest month in india in the history. What could be the reason ?
  • Dec, Jan and Feb are the coldest months. One could group them together as “Winter”.
  • Apr, May, Jun, July and Aug are the hottest months. One could group them together as “Summer”.

But, since this is not how seasons work. We have four main seasons in India and this is how they are grouped:

  • Winter : December, January and February.
  • Summer(Also called, “Pre Monsoon Season”) : March, April and May.
  • Monsoon : June, July, August and September.
  • Autumn(Also called “Post Monsoon Season) : October and November.

We also will stick to these seasons for our analysis.

Warmest /Coldest/Average :

fig =, 'Month', 'Temprature')
fig.update_layout(title='Warmest, Coldest and Median Monthly Tempratue.')
weather forecasting with machine learning


  • January has the coldest Days in an Year.
  • May has the hottest days in an Year.
  • July is the month with least Standard Deviation which means, temperature in July vary least. We can expect any day in July to be a warm day.
from sklearn.cluster import KMeans
sse = []
target = df1['Temprature'].to_numpy().reshape(-1,1)
num_clusters = list(range(1, 10))

for k in num_clusters:
    km = KMeans(n_clusters=k)

fig = go.Figure(data=[
    go.Scatter(x = num_clusters, y=sse, mode='lines'),
    go.Scatter(x = num_clusters, y=sse, mode='markers')

fig.update_layout(title="Evaluation on number of clusters:",
                 xaxis_title = "Number of Clusters:",
                 yaxis_title = "Sum of Squared Distance",

A cluster size of 3 seems a good choice here

km = KMeans(3)['Temprature'].to_numpy().reshape(-1,1))
df1.loc[:,'Temp Labels'] = km.labels_
fig = px.scatter(df1, 'Date', 'Temprature', color='Temp Labels')
fig.update_layout(title = "Temprature clusters.",
                 xaxis_title="Date", yaxis_title="Temprature")


  • Despite having 4 seasons we can see 3 main clusters based on temperatures.
  • Jan, Feb and Dec are the coldest months.
  • Apr, May, Jun, Jul, Aug and Sep; all have hotter temperatures.
  • Mar, Oct and Nov are the months that have temperatures neither too hot nor too cold.
fig = px.histogram(x=df1['Temprature'], nbins=200, histnorm='density')
fig.update_layout(title='Frequency chart of temprature readings:',
                 xaxis_title='Temprature', yaxis_title='Count')
weather forecasting

There is a cluster from 26.2-27.5 and mean temperature for most months during history has been between 26.8-26.9

Let’s see if we can get some insights from yearly mean temperature data. I am going to treat this as a time series as well.

Yearly average temperature

df['Yearly Mean'] = df.iloc[:,1:].mean(axis=1) ## Axis 1 for row wise and axis 0 for columns.
fig = go.Figure(data=[
    go.Scatter(name='Yearly Tempratures' , x=df['YEAR'], y=df['Yearly Mean'], mode='lines'),
    go.Scatter(name='Yearly Tempratures' , x=df['YEAR'], y=df['Yearly Mean'], mode='markers')
fig.update_layout(title='Yearly Mean Temprature :',
                 xaxis_title='Time', yaxis_title='Temprature in Degrees')

We can see that the issue of global warning is true.

  • The yearly mean temperature was not increasing till 1980. It was only after 1979 that we can see the gradual increase in yearly mean temperature.
  • After 2015, yearly temperature has increased drastically.
  • But, There are some problems in this figure.
  • We are seeing a monthly like up-down pattern in yearly temperatures as well.
  • This is not understandable. Because with months, we have a phenomena of seasons and the earth the revolving around sun in a elliptic path. But this pattern is not expected in yearly temperature.

Monthly temperatures through history

fig = px.line(df1, 'Year', 'Temprature', facet_col='Month', facet_col_wrap=4)
fig.update_layout(title='Monthly temprature throught history:')
weather forecasting with machine learning

We can see clear positive trend lines. Let’s see if we could find any trend in seasonal mean temperatures.

Seasonal Weather Analysis

df['Winter'] = df[['DEC', 'JAN', 'FEB']].mean(axis=1)
df['Summer'] = df[['MAR', 'APR', 'MAY']].mean(axis=1)
df['Monsoon'] = df[['JUN', 'JUL', 'AUG', 'SEP']].mean(axis=1)
df['Autumn'] = df[['OCT', 'NOV']].mean(axis=1)
seasonal_df = df[['YEAR', 'Winter', 'Summer', 'Monsoon', 'Autumn']]
seasonal_df = pd.melt(seasonal_df, id_vars='YEAR', value_vars=seasonal_df.columns[1:])
seasonal_df.columns=['Year', 'Season', 'Temprature']

fig = px.scatter(seasonal_df, 'Year', 'Temprature', facet_col='Season', facet_col_wrap=2, trendline='ols')
fig.update_layout(title='Seasonal mean tempratures throught years:')
weather forecasting

We can again see a positive trend line between temperature and time. The trend line does not have a very high positive correlation with years but still it is not negligible.

Let’s try to find out if we can get something out of an animation

px.scatter(df1, 'Month', 'Temprature', size='Temprature', animation_frame='Year')

On first look, we can see some fluctuations but that doesn’t give much of insights for us. However, if we again see by arranging bar below to early years and late years we can notice the change.

But this is certainly not the best way to visualize it. Let’s find some better way.

Weather Forecasting with Machine Learning

Let’s try to forecast monthly mean temperature for year 2018.

# I am using decision tree regressor for prediction as the data does not actually have a linear trend.
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score 

df2 = df1[['Year', 'Month', 'Temprature']].copy()
df2 = pd.get_dummies(df2)
y = df2[['Temprature']]
x = df2.drop(columns='Temprature')

dtr = DecisionTreeRegressor()
train_x, test_x, train_y, test_y = train_test_split(x,y,test_size=0.3), train_y)
pred = dtr.predict(test_x)
r2_score(test_y, pred)


A high r2 value means that our predictive model is working good. Now, Let’s see the foretasted data for 2018.

next_Year = df1[df1['Year']==2017][['Year', 'Month']]
next_Year.Year.replace(2017,2018, inplace=True)
next_Year= pd.get_dummies(next_Year)
temp_2018 = dtr.predict(next_Year)

temp_2018 = {'Month':df1['Month'].unique(), 'Temprature':temp_2018}
temp_2018['Year'] = 2018
forecasted_temp = pd.concat([df1,temp_2018], sort=False).groupby(by='Year')['Temprature'].mean().reset_index()
fig = go.Figure(data=[
    go.Scatter(name='Yearly Mean Temprature', x=forecasted_temp['Year'], y=forecasted_temp['Temprature'], mode='lines'),
    go.Scatter(name='Yearly Mean Temprature', x=forecasted_temp ['Year'], y=forecasted_temp['Temprature'], mode='markers')
fig.update_layout(title='Forecasted Temprature:',
                 xaxis_title='Time', yaxis_title='Temprature in Degrees')
weather forecasting
Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1500


Leave a Reply