The first galaxy was observed by a Persian astronomer Abd al-Rahman over 1,000 years ago, and it was first believed to be an unknown extended structure. which is now known as Messier-31 or the infamous Andromeda Galaxy. From that point on, these unknown structures are more frequently observed and recorded, but it took more than 9 centuries for astronomers to manifest on an agreement that they were not just astronomical objects, but entire galaxies. In this article, I will introduce you to the Galaxy Classification Model with Machine Learning.
As the discoveries and classification of galaxies increased, several astronomers observed the divergent morphologies. Then, they started grouping previously reported galaxies and newly discovered galaxies based on morphological features which then formed a meaningful classification scheme.
Also, Read – My Journey From Commerce to Machine Learning.
Galaxy Classification Model
Astronomy in this contemporary era has evolved massively in parallel with advances in computing over the years. Sophisticated computational techniques such as machine learning models are much more efficient now due to the dramatically increased efficiency in computer performance and huge data available to us today.
Long Centuries ago, the galaxy classification was done by hand with a massive group of experienced people, who used to evaluate the results by using cross-validation algorithm. With this inspiration here I will introduce you to a Galaxy Classification Model with Machine Learning.
The dataset that I am using is very large, so you need to show patience while downloading it. The dataset can be downloaded from here.
Exploring The Data
Now, let’s start this task of creating a Galaxy Classification Model by importing all the necessary packages:
Now, as you can see, I have imported all the packages, now let’s start reading the data and exploring it to have a quick look at what we are going to work with:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
cf.go_offline()
%matplotlib inline
#Reading the data
from google.colab import files
uploaded = files.upload()
zoo = pd.read_csv('GalaxyZoo1_DR_table2.csv')
zoo.head()
Code language: PHP (php)

The first column is a unique identifier which cannot be a feature for our model, and the second and third columns are the absolute positions of galaxies which do not correlate with our classes/targets, so we can remove them all:
data = zoo.drop(['OBJID','RA','DEC'],axis=1)
Code language: JavaScript (javascript)
As this is a Galaxy classification model, so we have to check the class imbalance, in a dataset where we perform classification task even though its class binary imbalance may have a major effect in the phase training, and ultimately on precision. To plot the value_counts for three-class columns, we can do it like the code below:
plt.figure(figsize=(10,7))
plt.title('Count plot for Galaxy types ')
countplt = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']]
sns.countplot(x="variable",hue='value', data=pd.melt(countplt))
plt.xlabel('Classes')
plt.show()
Code language: JavaScript (javascript)

Splitting The Data
For any machine learning model that learns from data, this is a conventional method of dividing the original data into training sets and test sets, where the allocation percentages are 80% d training set and 20% test set. and the data set at least should have 1000 data points to avoid overfitting and to simply increase the training period of any model. So now let’s split the data into training and test sets:
X = data.drop(['SPIRAL','ELLIPTICAL','UNCERTAIN'],axis=1).values
y = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']].values
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=101)
# normalising the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Code language: PHP (php)
Building Neural Networks for Galaxy Classification Model
Sequential, in Keras, allows us to build the Multilayered Perceptron model from scratch. We can add each layer with a unit number as a parameter of the Dense function where each unit number implies that many densely connected neurons. Now let’s build neural networks using TensorFlow and Keras:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential()
model.add(Dense(10,activation='relu'))
model.add(Dense(5,activation='relu'))
model.add(Dense(3, activation = 'softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
start = time.perf_counter()
Code language: JavaScript (javascript)
Now let’s fit the data into our neural network. It will take some time to run as the data is itself very large and neural network models take time to run:
model.fit(x=X_train,y=y_train,epochs=20)
print('\nTIME ELAPSED {}Seconds'.format(time.perf_counter() - start))
Code language: PHP (php)
Epoch 1/20 16699/16699 [==============================] - 9s 551us/step - loss: 0.2877 - accuracy: 0.8750 Epoch 2/20 16699/16699 [==============================] - 9s 538us/step - loss: 0.2618 - accuracy: 0.8881 Epoch 3/20 16699/16699 [==============================] - 9s 551us/step - loss: 0.2595 - accuracy: 0.8891 Epoch 4/20 16699/16699 [==============================] - 9s 539us/step - loss: 0.2549 - accuracy: 0.8898 Epoch 5/20 16699/16699 [==============================] - 9s 537us/step - loss: 0.2470 - accuracy: 0.8916 Epoch 6/20 16699/16699 [==============================] - 9s 540us/step - loss: 0.2422 - accuracy: 0.8920 Epoch 7/20 16699/16699 [==============================] - 9s 541us/step - loss: 0.2387 - accuracy: 0.8929 Epoch 8/20 16699/16699 [==============================] - 9s 540us/step - loss: 0.2332 - accuracy: 0.8943 Epoch 9/20 16699/16699 [==============================] - 9s 540us/step - loss: 0.2297 - accuracy: 0.8952 Epoch 10/20 16699/16699 [==============================] - 9s 545us/step - loss: 0.2256 - accuracy: 0.8977 Epoch 11/20 16699/16699 [==============================] - 9s 546us/step - loss: 0.2235 - accuracy: 0.8986 Epoch 12/20 16699/16699 [==============================] - 11s 688us/step - loss: 0.2222 - accuracy: 0.8990 Epoch 13/20 16699/16699 [==============================] - 11s 644us/step - loss: 0.2217 - accuracy: 0.8994 Epoch 14/20 16699/16699 [==============================] - 9s 542us/step - loss: 0.2210 - accuracy: 0.8994 Epoch 15/20 16699/16699 [==============================] - 10s 571us/step - loss: 0.2208 - accuracy: 0.8995 Epoch 16/20 16699/16699 [==============================] - 10s 608us/step - loss: 0.2203 - accuracy: 0.8996 Epoch 17/20 16699/16699 [==============================] - 9s 565us/step - loss: 0.2201 - accuracy: 0.8993 Epoch 18/20 16699/16699 [==============================] - 9s 561us/step - loss: 0.2196 - accuracy: 0.8995 Epoch 19/20 16699/16699 [==============================] - 10s 602us/step - loss: 0.2192 - accuracy: 0.8998 Epoch 20/20 16699/16699 [==============================] - 10s 591us/step - loss: 0.2189 - accuracy: 0.8999 TIME ELAPSED 189.8537580230004Seconds
Now let’s plot the accuracy to have a look at the accuracy of the neural networks at each epoch:
mod_history = pd.DataFrame(model.history.history)
plt.figure(figsize=(10,7))
plt.style.use('seaborn-whitegrid')
plt.title('Model History')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.plot(mod_history['accuracy'],color='orange',lw=2)
Code language: JavaScript (javascript)

From this precision graph, we can deduce that after a certain epoch, i.e. approximately from the 6th epoch, the precision remained constant for all other epochs. Now let’s take our model through the confusion matrix algorithm and print a classification report:
y_pred = model.predict_classes(X_test)
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test.argmax(axis=1),y_pred)
print(classification_report(y_test.argmax(axis=1),y_pred))
Code language: JavaScript (javascript)
precision recall f1-score support 0 0.84 0.93 0.88 38281 1 0.90 0.77 0.83 12554 2 0.93 0.90 0.92 82754 accuracy 0.90 133589 macro avg 0.89 0.87 0.87 133589 weighted avg 0.90 0.90 0.90 133589
Well, this is very basic astronomical data with features that I can’t even begin to interpret. But still, we got very good results. If I had an astronomy background to study, organize and add more features, this model will be sure to work well better than what he did.
Also, Read – Binary Search Algorithm with Python.
I hope you liked this article on Galaxy Classification model with Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.