Galaxy Classification with Machine Learning

The first galaxy was observed by a Persian astronomer Abd al-Rahman over 1,000 years ago, and it was first believed to be an unknown extended structure. which is now known as Messier-31 or the infamous Andromeda Galaxy. From that point on, these unknown structures are more frequently observed and recorded, but it took more than 9 centuries for astronomers to manifest on an agreement that they were not just astronomical objects, but entire galaxies. In this article, I will introduce you to the Galaxy Classification Model with Machine Learning.

As the discoveries and classification of galaxies increased, several astronomers observed the divergent morphologies. Then, they started grouping previously reported galaxies and newly discovered galaxies based on morphological features which then formed a meaningful classification scheme.

Also, Read – My Journey From Commerce to Machine Learning.

Galaxy Classification Model

Astronomy in this contemporary era has evolved massively in parallel with advances in computing over the years. Sophisticated computational techniques such as machine learning models are much more efficient now due to the dramatically increased efficiency in computer performance and huge data available to us today. 

Long Centuries ago, the galaxy classification was done by hand with a massive group of experienced people, who used to evaluate the results by using cross-validation algorithm. With this inspiration here I will introduce you to a Galaxy Classification Model with Machine Learning.

The dataset that I am using is very large, so you need to show patience while downloading it. The dataset can be downloaded from here.

Exploring The Data

Now, let’s start this task of creating a Galaxy Classification Model by importing all the necessary packages:

Now, as you can see, I have imported all the packages, now let’s start reading the data and exploring it to have a quick look at what we are going to work with:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import cufflinks as cf
cf.go_offline()
%matplotlib inline 

#Reading the data
from google.colab import files
uploaded = files.upload()
zoo = pd.read_csv('GalaxyZoo1_DR_table2.csv')
zoo.head()Code language: PHP (php)
image for post

The first column is a unique identifier which cannot be a feature for our model, and the second and third columns are the absolute positions of galaxies which do not correlate with our classes/targets, so we can remove them all:

data = zoo.drop(['OBJID','RA','DEC'],axis=1)Code language: JavaScript (javascript)

As this is a Galaxy classification model, so we have to check the class imbalance, in a dataset where we perform classification task even though its class binary imbalance may have a major effect in the phase training, and ultimately on precision. To plot the value_counts for three-class columns, we can do it like the code below:

plt.figure(figsize=(10,7))
plt.title('Count plot for Galaxy types ')
countplt = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']]
sns.countplot(x="variable",hue='value', data=pd.melt(countplt))
plt.xlabel('Classes')
plt.show()Code language: JavaScript (javascript)
image for post

Splitting The Data

For any machine learning model that learns from data, this is a conventional method of dividing the original data into training sets and test sets, where the allocation percentages are 80% d training set and 20% test set. and the data set at least should have 1000 data points to avoid overfitting and to simply increase the training period of any model. So now let’s split the data into training and test sets:

X = data.drop(['SPIRAL','ELLIPTICAL','UNCERTAIN'],axis=1).values
y = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']].values
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=101)
# normalising the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)Code language: PHP (php)

Building Neural Networks for Galaxy Classification Model

Sequential, in Keras, allows us to build the Multilayered Perceptron model from scratch. We can add each layer with a unit number as a parameter of the Dense function where each unit number implies that many densely connected neurons. Now let’s build neural networks using TensorFlow and Keras:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense 
model = Sequential()

model.add(Dense(10,activation='relu'))
model.add(Dense(5,activation='relu'))

model.add(Dense(3, activation = 'softmax'))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
start = time.perf_counter()
Code language: JavaScript (javascript)

Now let’s fit the data into our neural network. It will take some time to run as the data is itself very large and neural network models take time to run:

model.fit(x=X_train,y=y_train,epochs=20)
print('\nTIME ELAPSED {}Seconds'.format(time.perf_counter() - start))Code language: PHP (php)
Epoch 1/20
16699/16699 [==============================] - 9s 551us/step - loss: 0.2877 - accuracy: 0.8750
Epoch 2/20
16699/16699 [==============================] - 9s 538us/step - loss: 0.2618 - accuracy: 0.8881
Epoch 3/20
16699/16699 [==============================] - 9s 551us/step - loss: 0.2595 - accuracy: 0.8891
Epoch 4/20
16699/16699 [==============================] - 9s 539us/step - loss: 0.2549 - accuracy: 0.8898
Epoch 5/20
16699/16699 [==============================] - 9s 537us/step - loss: 0.2470 - accuracy: 0.8916
Epoch 6/20
16699/16699 [==============================] - 9s 540us/step - loss: 0.2422 - accuracy: 0.8920
Epoch 7/20
16699/16699 [==============================] - 9s 541us/step - loss: 0.2387 - accuracy: 0.8929
Epoch 8/20
16699/16699 [==============================] - 9s 540us/step - loss: 0.2332 - accuracy: 0.8943
Epoch 9/20
16699/16699 [==============================] - 9s 540us/step - loss: 0.2297 - accuracy: 0.8952
Epoch 10/20
16699/16699 [==============================] - 9s 545us/step - loss: 0.2256 - accuracy: 0.8977
Epoch 11/20
16699/16699 [==============================] - 9s 546us/step - loss: 0.2235 - accuracy: 0.8986
Epoch 12/20
16699/16699 [==============================] - 11s 688us/step - loss: 0.2222 - accuracy: 0.8990
Epoch 13/20
16699/16699 [==============================] - 11s 644us/step - loss: 0.2217 - accuracy: 0.8994
Epoch 14/20
16699/16699 [==============================] - 9s 542us/step - loss: 0.2210 - accuracy: 0.8994
Epoch 15/20
16699/16699 [==============================] - 10s 571us/step - loss: 0.2208 - accuracy: 0.8995
Epoch 16/20
16699/16699 [==============================] - 10s 608us/step - loss: 0.2203 - accuracy: 0.8996
Epoch 17/20
16699/16699 [==============================] - 9s 565us/step - loss: 0.2201 - accuracy: 0.8993
Epoch 18/20
16699/16699 [==============================] - 9s 561us/step - loss: 0.2196 - accuracy: 0.8995
Epoch 19/20
16699/16699 [==============================] - 10s 602us/step - loss: 0.2192 - accuracy: 0.8998
Epoch 20/20
16699/16699 [==============================] - 10s 591us/step - loss: 0.2189 - accuracy: 0.8999

TIME ELAPSED 189.8537580230004Seconds

Now let’s plot the accuracy to have a look at the accuracy of the neural networks at each epoch:

mod_history = pd.DataFrame(model.history.history)
plt.figure(figsize=(10,7))
plt.style.use('seaborn-whitegrid')
plt.title('Model History')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.plot(mod_history['accuracy'],color='orange',lw=2)Code language: JavaScript (javascript)
galaxy classification

From this precision graph, we can deduce that after a certain epoch, i.e. approximately from the 6th epoch, the precision remained constant for all other epochs. Now let’s take our model through the confusion matrix algorithm and print a classification report:

y_pred = model.predict_classes(X_test)
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test.argmax(axis=1),y_pred)
print(classification_report(y_test.argmax(axis=1),y_pred))Code language: JavaScript (javascript)
            precision    recall  f1-score   support

           0       0.84      0.93      0.88     38281
           1       0.90      0.77      0.83     12554
           2       0.93      0.90      0.92     82754

    accuracy                           0.90    133589
   macro avg       0.89      0.87      0.87    133589
weighted avg       0.90      0.90      0.90    133589

Well, this is very basic astronomical data with features that I can’t even begin to interpret. But still, we got very good results. If I had an astronomy background to study, organize and add more features, this model will be sure to work well better than what he did.

Also, Read – Binary Search Algorithm with Python.

I hope you liked this article on Galaxy Classification model with Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1433

Leave a Reply