House Price Prediction with Python

Predicting house prices can help to determine the selling price of a house of a particular region and can help people to find the right time to buy a home. In this article, I will introduce you to a machine learning project on house price prediction with Python.

House Price Prediction

In this task on House Price Prediction using machine learning, our task is to use data from the California census to create a machine learning model to predict house prices in the State. The data includes features such as population, median income, and median house prices for each block group in California.

Block groups are the smallest geographic unit which typically has a population of 600 to 3,000 people. We can call them districts for short. Ultimately, our machine learning model should learn from this data and be able to predict the median house price in any neighbourhood, given all other metrics.

House Price Prediction with Python

I hope you have understood the above problem statement about predicting the house prices. Now, I will take you through a machine learning project on House Price prediction with Python. Let’s start by importing the necessary Python libraries and the dataset:

import pandas as pd
housing = pd.read_csv("housing.csv")
housing.head()
housing data

Each row represents a district and there are 10 attributes in the dataset. Now let’s use the info() method which is useful for getting a quick description of the data, especially the total number of rows, the type of each attribute, and the number of non-zero values:

housing.info()
#   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

There are 20,640 instances in the dataset. Note that the total_bedrooms attribute has only 20,433 non-zero values, which means 207 districts do not contain values. We will have to deal with that later.

All attributes are numeric except for the ocean_proximity field. Its type is an object, so it can contain any type of Python object. You can find out which categories exist in that column and how many districts belong to each category by using the value_counts() method:

housing.ocean_proximity.value_counts()

Another quick way to get a feel for what kind of data you’re dealing with is to plot a histogram for each numerical attribute:

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(10, 8))
plt.show()
house price prediction histogram

The next step in this task of House Price Prediction is to split the data into training and test sets. Creating a test set is theoretically straightforward: select some instances at random, typically 20% of the dataset (or less if your dataset is very large), and set them aside:

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

Let’s take a closer look at the histogram of median income, as most median income values cluster around 1.5 to 6, but some median income goes well beyond 6.

It is important to have a sufficient number of instances in your dataset for each stratum, otherwise, the estimate of the importance of a stratum may be biased. This means that you should not have too many strata and that each stratum should be large enough:

import numpy as np
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])
housing['income_cat'].hist()
plt.show()
median income histogram

Stratified Sampling on Dataset

Now the next step is to perform some stratified sampling on the dataset. But why we need to do that you can learn everything about it from here. You are now ready to perform stratified sampling based on income category. For this you can use the StratifiedShuffleSplit class of Scikit-Learn:

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))
3    0.350533
2    0.318798
4    0.176357
5    0.114583
1    0.039729
Name: income_cat, dtype: float64

Now you need to remove the Income_cat attribute added by us to get the data back to its form:

for set_ in (strat_train_set, strat_test_set):
    set_.drop('income_cat', axis=1, inplace=True)
housing = strat_train_set.copy()

Now before creating a machine learning model for house price prediction with Python let’s visualize the data in terms of longitude and latitude:

housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4, s=housing['population']/100, label='population',
figsize=(12, 8), c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True)
plt.legend()
plt.show()
scatter plot

The graph shows house prices in California where red is expensive, blue is cheap, larger circles indicate areas with a larger population.

Finding Correlations

Since the dataset is not too large, you can easily calculate the standard correlation coefficient between each pair of attributes using the corr() method:

corr_matrix = housing.corr()
print(corr_matrix.median_house_value.sort_values(ascending=False))
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

Correlation ranges are between -1 and 1. When it is close to 1 it means that there is a positive correlation and when it is close to -1 it means that there is a negative correlation. When it is close to 0, it means that there is no linear correlation.

Another way to check the correlation between attributes is to use the pandas scatter_matrix() function, which plots each numeric attribute against every other numeric attribute:

house price prediction: correlation

And now let’s look at the correlation matrix again by adding three new columns to the dataset; rooms per household, bedrooms per room and population per household:

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]

corr_matrix = housing.corr()
print(corr_matrix["median_house_value"].sort_values(ascending=False))
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

Data Preparation

Now, this is the most important step before a train a machine learning model for the task of house price prediction. Now let’s perform all the necessary data transformations:

As you can see, there are many data transformation steps that need to be performed in the correct order. Fortunately, Scikit-Learn provides the Pipeline class to help you with such sequences of transformations. Here is a small pipeline for numeric attributes:

Linear Regression for House Price Prediction with Python

Now I will use the linear regression algorithm for the task of house price prediction with Python:

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

data = housing.iloc[:5]
labels = housing_labels.iloc[:5]
data_preparation = full_pipeline.transform(data)
print("Predictions: ", lin_reg.predict(data_preparation))
Predictions:  [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]

I hope you liked this article on Machine Learning project on House Price Prediction with Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

6 Comments

  1. Hi Sir
    I am unable to understand that you used during sorting the median income
    “It is important to have a sufficient number of instances in your dataset for each stratum, otherwise, the estimate of the importance of a stratum may be biased. This means that you should not have too many strata and that each stratum should be large enough”

    Please Help

  2. Hi Aman,
    Thanks for the project and great description! I would like to replicate it but use a dataset for Massachusetts rather than CA. Could you direct me to where you pulled your dataset from? I hope they have a similar one for Massachusetts.
    Thanks.

Leave a Reply