In this article, I will analyze Diamonds with python using data science tools. For this first problem, I want to choose a pretty simple dataset from Kaggle. You can easily download this dataset from here. Now let’s start with this simple project of Diamonds Analysis with Python.
I will start by importing the necessary libraries like NumPy and pandas to read and get some insights into the data:
import pandas as pd import numpy as np
Data Science Project: Diamonds Analysis with Python
Read the diamond file in a pandas DataFrame. Note: We did not have to format and manipulate the data in this file for the task of diamond analysis with python. This is not a normal situation in real data science. You will often spend a lot of time getting your data where you want it – sometimes as much time as the rest of the project. Now let’s read the data:
df = pd.read_csv("diamonds.csv")
Just to check for consistency, let’s print the first five lines of the DataFrame:
Unnamed: 0 carat cut color clarity ... table price x y z 0 1 0.23 Ideal E SI2 ... 55.0 326 3.95 3.98 2.43 1 2 0.21 Premium E SI1 ... 61.0 326 3.89 3.84 2.31 2 3 0.23 Good E VS1 ... 65.0 327 4.05 4.07 2.31 3 4 0.29 Premium I VS2 ... 58.0 334 4.20 4.23 2.63 4 5 0.31 Good J SI2 ... 58.0 335 4.34 4.35 2.75
Here we are calculating some values from the column named price. Note that we can use the column as part of the DataFrame object:
sum = df.price.sum() print("Total $ Value of Diamonds: $", sum) mean = df.price.mean() print("Mean $ Value of Diamonds: $", mean)
Total $ Value of Diamonds: $ 212135217
Mean $ Value of Diamonds: $ 3932.799721913237
Now we run the built-in describe() function to first describe and summarize the data on carat:
descrip = df.carat.describe() print(descrip)
count 53940.000000 mean 0.797940 std 0.474011 min 0.200000 25% 0.400000 50% 0.700000 75% 1.040000 max 5.010000
This following statement prints a description of all non-numeric columns in our DataFrame: in particular, the cut, color and lightness columns:
descrip = df.describe(include='object') print(descrip)
cut color clarity count 53940 53940 53940 unique 5 7 8 top Ideal G SI1 freq 21551 11292 13065
Data Visualization with Matplotlib
Now we move to the data visualization part of our project on Diamonds Analysis with Python. Our first graph is a scatter plot showing the clarity of the diamond versus the carat size of the diamond:
import matplotlib.pyplot as plt carat = df.carat clarity = df.clarity plt.scatter(clarity, carat) plt.show()
Now, the second visualization will be a bar plot to visualize the number of diamonds in each clarity type:
clarityindexes = df["clarity"].value_counts().index.tolist() claritycount = df["clarity"].value_counts().values.tolist() print(clarityindexes) print(claritycount) plt.bar(clarityindexes, claritycount) plt.show()
Diamonds Analysis with Python: Find Correlations
In this project on Diamonds analysis with Python, the last plot I’m going to show you is called a heat diagram. It is used to graphically show the correlations between numeric values in our database. In this graph, we take all the numeric values and create a correlation matrix that shows how closely they correlate with each other.
To quickly generate this graph, we need to use another package for Python and MatPlotLib called seaborn. Seaborn provides an API built on MatPlotLib that integrates with Pandas DataFrames, making it ideal for data science:
df = df.drop("Unnamed: 0", axis=1) f, ax = plt.subplots(figsize=(10, 8)) corr = df.corr() print(corr) import seaborn seaborn.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=seaborn.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax) plt.show()
The first thing to notice on the graph is that the redder the colour, the higher the correlation between the two variables. The diagonal band from top to left up to bottom shows that, for example, the carat is 100% correlated with the carat. No surprise there. The x, y, and z variables are quite correlated with each other, indicating that when the diamonds in our database increase in one dimension, they also increase in the other two dimensions.
And the price? As the carat and size increase, the price also increases. It’s logic. Interestingly, depth is not strongly correlated with price at all and, in fact, is somewhat negatively correlated. I hope you liked this article on Diamonds Analysis with Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning and Python.