Diamonds Analysis with Python

In this article, I will analyze Diamonds with python using data science tools. For this first problem, I want to choose a pretty simple dataset from Kaggle. You can easily download this dataset from here. Now let’s start with this simple project of Diamonds Analysis with Python.

I will start by importing the necessary libraries like NumPy and pandas to read and get some insights into the data:

Also, Read – Computer Vision Tutorial with Python.

import pandas as pd
import numpy as npCode language: JavaScript (javascript)

Data Science Project: Diamonds Analysis with Python

Read the diamond file in a pandas DataFrame. Note: We did not have to format and manipulate the data in this file for the task of diamond analysis with python. This is not a normal situation in real data science. You will often spend a lot of time getting your data where you want it – sometimes as much time as the rest of the project. Now let’s read the data:

df = pd.read_csv("diamonds.csv")Code language: JavaScript (javascript)

Just to check for consistency, let’s print the first five lines of the DataFrame:

print(df.head())Code language: CSS (css)
Unnamed: 0  carat      cut color clarity  ...  table  price     x     y     z
0           1   0.23    Ideal     E     SI2  ...   55.0    326  3.95  3.98  2.43
1           2   0.21  Premium     E     SI1  ...   61.0    326  3.89  3.84  2.31
2           3   0.23     Good     E     VS1  ...   65.0    327  4.05  4.07  2.31
3           4   0.29  Premium     I     VS2  ...   58.0    334  4.20  4.23  2.63
4           5   0.31     Good     J     SI2  ...   58.0    335  4.34  4.35  2.75

Here we are calculating some values ​​from the column named price. Note that we can use the column as part of the DataFrame object:

sum = df.price.sum()
print("Total $ Value of Diamonds: $", sum)

mean = df.price.mean()
print("Mean $ Value of Diamonds: $", mean)Code language: PHP (php)

Total $ Value of Diamonds: $ 212135217
Mean $ Value of Diamonds: $ 3932.799721913237

Now we run the built-in describe() function to first describe and summarize the data on carat:

descrip = df.carat.describe()
print(descrip)Code language: PHP (php)
count    53940.000000
mean         0.797940
std          0.474011
min          0.200000
25%          0.400000
50%          0.700000
75%          1.040000
max          5.010000

This following statement prints a description of all non-numeric columns in our DataFrame: in particular, the cut, color and lightness columns:

descrip = df.describe(include='object')
print(descrip)Code language: PHP (php)
          cut  color clarity
count   53940  53940   53940
unique      5      7       8
top     Ideal      G     SI1
freq    21551  11292   13065

Data Visualization with Matplotlib

Now we move to the data visualization part of our project on Diamonds Analysis with Python. Our first graph is a scatter plot showing the clarity of the diamond versus the carat size of the diamond:

import matplotlib.pyplot as plt
carat = df.carat
clarity = df.clarity
plt.scatter(clarity, carat)
plt.show()
Code language: JavaScript (javascript)
scatter plot

Now, the second visualization will be a bar plot to visualize the number of diamonds in each clarity type:

clarityindexes = df["clarity"].value_counts().index.tolist()
claritycount = df["clarity"].value_counts().values.tolist()

print(clarityindexes)
print(claritycount)
plt.bar(clarityindexes, claritycount)
plt.show()Code language: PHP (php)
bar plot

Diamonds Analysis with Python: Find Correlations

In this project on Diamonds analysis with Python, the last plot I’m going to show you is called a heat diagram. It is used to graphically show the correlations between numeric values ​​in our database. In this graph, we take all the numeric values ​​and create a correlation matrix that shows how closely they correlate with each other.

To quickly generate this graph, we need to use another package for Python and MatPlotLib called seaborn. Seaborn provides an API built on MatPlotLib that integrates with Pandas DataFrames, making it ideal for data science:

df = df.drop("Unnamed: 0", axis=1)
f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
print(corr)
import seaborn
seaborn.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool),
                cmap=seaborn.diverging_palette(220, 10, as_cmap=True),
                square=True, ax=ax)
plt.show()Code language: PHP (php)
diamonds analysis

The first thing to notice on the graph is that the redder the colour, the higher the correlation between the two variables. The diagonal band from top to left up to bottom shows that, for example, the carat is 100% correlated with the carat. No surprise there. The x, y, and z variables are quite correlated with each other, indicating that when the diamonds in our database increase in one dimension, they also increase in the other two dimensions.

Also, Read – Most Useful Python Libraries for Machine Learning.

And the price? As the carat and size increase, the price also increases. It’s logic. Interestingly, depth is not strongly correlated with price at all and, in fact, is somewhat negatively correlated. I hope you liked this article on Diamonds Analysis with Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning and Python.

Follow Us:

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1610

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading