Linear Regression Model

Let’s train and run a Linear regression model to make Predictions, In this article, I will load the data, prepare it, create a scatter plot for visualization, and then train a linear regression model to make a prediction.

I will first create a function that will join our two datasets to be used in training our linear regression model. It’s the boring pandas code that will join the life satisfaction data from the OECD with GDP per capita data from the IMF.

You can download both these datasets below:

Now let’s, create a function, to move further with training our linear regression model.

# To support both python 2 and python 3 from __future__ import division, print_function, unicode_literals # Common imports import numpy as np import os # to make this notebook's output stable across runs np.random.seed(42) # To plot pretty figures %matplotlib inline import matplotlib import matplotlib.pyplot as plt plt.rcParams['axes.labelsize'] = 14 plt.rcParams['xtick.labelsize'] = 12 plt.rcParams['ytick.labelsize'] = 12 def prepare_country_stats(oecd_bli, gdp_per_capita): oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"] oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value") gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True) gdp_per_capita.set_index("Country", inplace=True) full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True) full_country_stats.sort_values(by="GDP per capita", inplace=True) remove_indices = [0, 1, 6, 8, 33, 34, 35] keep_indices = list(set(range(36)) - set(remove_indices)) return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]
Code language: Python (python)

Train and Run a Linear Regression Model

Now you are finally ready for training and running a linear regression model to make predictions. For example, say you want to know how happy Cypriots are, and the OECD data does not have the answer.

Fortunately, you can use your linear regression model to make a good prediction. Let’s train a Linear Regression Model:

import matplotlib import matplotlib.pyplot as plt import numpy as np import pandas as pd import sklearn.linear_model # Load the data oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',') gdp_per_capita = pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t', encoding='latin1', na_values="n/a") # Prepare the data country_stats = prepare_country_stats(oecd_bli, gdp_per_capita) X = np.c_[country_stats["GDP per capita"]] y = np.c_[country_stats["Life satisfaction"]] # Visualize the data country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction') # Select a linear model model = sklearn.linear_model.LinearRegression() # Train the model, y) # Make a prediction for Cyprus X_new = [[22587]] # Cyprus' GDP per capita print(model.predict(X_new)) # outputs [[ 5.96242338]]
Code language: Python (python)

linear regression

If you had used an instance-based learning algorithm instead, of a linear regression model, you have found that Slovenia has the closest GDP per capita to that of Cyprus, and since the Linear Regression Model tells you that Slovenians’ life satisfaction is 5.7, you would have predicted a life satisfaction pf 5.7 for Cyprus.

If you zoom out a bit and look at the two next-closest countries, you will find Portugal and Spain with life satisfaction of 5.1 and 6.5, respectively.

Averaging these three values, you get 5.77, which is very close to your Linear Model prediction. This simple algorithm is called k-Nearest Neighbors Regression.

Replacing the Linear Regression model with k-Nearest Neighbors regression in the above code is as simple as replacing these two lines:

import sklearn.linear_model model = sklearn.linear_model.LinearRegression()
Code language: Python (python)

with these two:

import sklearn.neighbors model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
Code language: Python (python)

Also, read – 10 Machine Learning Projects to Boost your Portfolio

If all went well, your model will make good predictions. If not, you may need to use more attributes (employment rate, health, air pollution, etc), get more or better quality training data, or perhaps select a more powerful model(e.g., a Polynomial Regression model).

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1370

Leave a Reply