Diamond Price Prediction with Machine Learning

Gemstones like diamonds are always in demand because of their value in the investment market. This makes it very important for diamond dealers to predict its price. In this article, I’ll walk you through a task of Diamond Price Prediction with machine learning using Python.

Diamond Price Prediction with Machine Learning

For the task of predicting the price of diamond with machine learning, we need to create a machine learning model that will predict the price of a diamond using some features like weight, quality, measurements, etc.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

The dataset, which I’ll be using for the diamond price prediction task with machine learning, contains data for almost 54,000 diamonds. This is a very good data set for beginners as it contains almost all the important characteristics of diamonds such as price, cut quality, carat, weight, colour, clarity, length, width, depth, etc.

Diamond Price Prediction using Python

In this section, I will take you through the task of diamond price prediction with machine learning using Python programming language. Let’s get started with this task by importing the necessary libraries and the dataset:

010.23IdealESI261.555.03263.953.982.43
120.21PremiumESI159.861.03263.893.842.31
230.23GoodEVS156.965.03274.054.072.31
340.29PremiumIVS262.458.03344.204.232.63
450.31GoodJSI263.358.03354.344.352.75

There is a column named “table” in the dataset which refers to the flat facet of the diamond as seen when it is face up. The main purpose of this attribute is to refract light rays and allow rays reflected from and inside the diamond to meet the eyes of the observer. The ideal table size of a diamond will give it a stunning look. Now let’s move on to the next step which is data processing.

Data Processing:

I will now deal with the data which will include 3 main tasks such as data cleaning, identifying and removing outliers, and encoding categorical features.

The minimum value of “x”, “y”, “z” is zero, this indicates that there are erroneous values in the data which represent dimensionless or two-dimensional diamonds. So we need to filter out which ones are bad data points:

Now let’s visualize the data to observe the outliers in the dataset:

diamond price prediction: outliers

Some features with a data point that are far from the rest of the dataset will affect the outcome of our regression model, such as:

  1. y and z have dimensional outliers in our dataset that need to be eliminated.
  2. The depth should be capped but we have to look at the regression line to be sure.
  3. The table presented must also be capped.

Now let’s remove all the outliers in the dataset:

Now let’s have a look at the categorical features in the dataset:

Categorical variables:
['cut', 'color', 'clarity']

Now I will do some label encoding on the data to get rid of object dtype:

Finally, let’s have a look at the correlation between the features before training a model for the task of Diamond Price prediction:

coorelation

Observations:

  1. x, y and z show a strong correlation with the target column.
  2. The depth, cut and table columns show a weak correlation. We might consider giving up but let’s keep it.

Final Step: Dimond Price Prediction Model

Now let’s move to the final step for the task of creating a machine learning model for predicting the price of diamonds. Below is the complete process that we need to follow in this step:

  1. Features and target configuration
  2. Create a pipeline of scalars and standard models for five different regressors.
  3. Fit all models to training data
  4. Obtain the cross-validation mean on the training set for all negative mean squared error models
  5. Choose the model with the best cross-validation score
  6. Ride the best model on the training set

Now let’s implement all the steps mentioned above to train a machine learning model for the task of Diamond Price Prediction:

LinearRegression: -1348.811824 
DecisionTree: -749.317273 
RandomForest: -547.203809 
KNeighbors: -823.649442 
XGBRegressor: -545.458107 

From the above scores, XGBClassifier appears to be the model with the best score on the negative mean squared error. Let’s test this model on the test set and evaluate with different parameters:

R^2: 0.98108479806778
Adjusted R^2: 0.981072157032851
MAE: 278.09339997286685
MSE: 296738.36382521846
RMSE: 544.7369675588563

I hope you liked this article on how to train a model for the task of Diamond price prediction with Machine Learning using Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1537

Leave a Reply