Data ETL Pipeline using Python

Developing Data ETL pipelines is one of the most valuable skills for Data Engineers. Data ETL is a process where data is extracted from a place, then the data is transformed in some way, and then data is loaded into a database. So ETL stands for Extracting, Transforming, and Loading the data. So, if you want to learn how to Develop a Data ETL pipeline, this article is for you. In this article, I will take you through how to develop a Data ETL pipeline using Python.

Data ETL Pipeline using Python

To develop a Data ETL pipeline using Python, the first step is to collect data from a data source. Let’s use the Fashion-MNIST dataset provided by the Keras library to keep things beginner-friendly:

import tensorflow.keras as keras
(xtrain, ytrain), (xtest, ytest) = keras.datasets.fashion_mnist.load_data()

Before moving forward, let’s have a look at the shape of the data:

print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)
(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)

Now let’s clean and transform the data. Here we will normalize the pixel values to be between 0 and 1 and reshape the data into a 4D tensor:

import numpy as np

xtrain = xtrain.astype('float32') / 255
xtest = xtest.astype('float32') / 255

xtrain = np.reshape(xtrain, (xtrain.shape[0], 28, 28, 1))
xtest = np.reshape(xtest, (xtest.shape[0], 28, 28, 1))

print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)
(60000, 28, 28, 1)
(60000,)
(10000, 28, 28, 1)
(10000,)

Now let’s load the data into a database. We can use SQLite to create a database and load the data into it:

import sqlite3

conn = sqlite3.connect('fashion_mnist.db')

conn.execute('''CREATE TABLE IF NOT EXISTS images
             (id INTEGER PRIMARY KEY AUTOINCREMENT,
             image BLOB NOT NULL,
             label INTEGER NOT NULL);''')

for i in range(xtrain.shape[0]):
    conn.execute('INSERT INTO images (image, label) VALUES (?, ?)',
                [sqlite3.Binary(xtrain[i]), ytrain[i]])

conn.commit()

for i in range(xtest.shape[0]):
    conn.execute('INSERT INTO images (image, label) VALUES (?, ?)',
                [sqlite3.Binary(xtest[i]), ytest[i]])

conn.commit()

conn.close()

In the above code:

  1. The first line imports a library called sqlite3, which allows us to work with SQLite databases in Python;
  2. We then create a connection to the database;
  3. Next, we create a table in the database called “images”;
  4. We use a loop to loop through each image in the training data and insert it into the “images” table (along with the labels);
  5. We use the commit() method to save the changes we made to the database;
  6. We then use another loop to loop through each image in the test data and insert it into the “images” table (along with the labels);
  7. We use the commit() method again to save the changes we made to the database;
  8. Finally, we close the connection to the database;

So this is how we can create a Data ETL pipeline using Python. Our ETL pipeline takes the Fashion MNIST dataset and stores it in an SQLite database so that we can easily access and manipulate the data later.

Now, this is how you can read the data you stored on the SQLite database:

import sqlite3
conn = sqlite3.connect('fashion_mnist.db')
cursor = conn.cursor()

cursor.execute('SELECT * FROM images')
rows = cursor.fetchall()

import pandas as pd
data = pd.read_sql_query('SELECT * FROM images', conn)

Summary

Data ETL is a process where data is extracted from a place, then the data is transformed in some way, and then data is loaded into a database. So ETL stands for Extracting, Transforming, and Loading the data. I hope you liked this article on developing a Data ETL pipeline using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1538

Leave a Reply