Developing Data ETL pipelines is one of the most valuable skills for Data Engineers. Data ETL is a process where data is extracted from a place, then the data is transformed in some way, and then data is loaded into a database. So ETL stands for Extracting, Transforming, and Loading the data. So, if you want to learn how to Develop a Data ETL pipeline, this article is for you. In this article, I will take you through how to develop a Data ETL pipeline using Python.
Data ETL Pipeline using Python
To develop a Data ETL pipeline using Python, the first step is to collect data from a data source. Let’s use the Fashion-MNIST dataset provided by the Keras library to keep things beginner-friendly:
import tensorflow.keras as keras (xtrain, ytrain), (xtest, ytest) = keras.datasets.fashion_mnist.load_data()
Before moving forward, let’s have a look at the shape of the data:
print(xtrain.shape) print(ytrain.shape) print(xtest.shape) print(ytest.shape)
(60000, 28, 28) (60000,) (10000, 28, 28) (10000,)
Now let’s clean and transform the data. Here we will normalize the pixel values to be between 0 and 1 and reshape the data into a 4D tensor:
import numpy as np xtrain = xtrain.astype('float32') / 255 xtest = xtest.astype('float32') / 255 xtrain = np.reshape(xtrain, (xtrain.shape[0], 28, 28, 1)) xtest = np.reshape(xtest, (xtest.shape[0], 28, 28, 1)) print(xtrain.shape) print(ytrain.shape) print(xtest.shape) print(ytest.shape)
(60000, 28, 28, 1) (60000,) (10000, 28, 28, 1) (10000,)
Now let’s load the data into a database. We can use SQLite to create a database and load the data into it:
import sqlite3 conn = sqlite3.connect('fashion_mnist.db') conn.execute('''CREATE TABLE IF NOT EXISTS images (id INTEGER PRIMARY KEY AUTOINCREMENT, image BLOB NOT NULL, label INTEGER NOT NULL);''') for i in range(xtrain.shape[0]): conn.execute('INSERT INTO images (image, label) VALUES (?, ?)', [sqlite3.Binary(xtrain[i]), ytrain[i]]) conn.commit() for i in range(xtest.shape[0]): conn.execute('INSERT INTO images (image, label) VALUES (?, ?)', [sqlite3.Binary(xtest[i]), ytest[i]]) conn.commit() conn.close()
In the above code:
- The first line imports a library called sqlite3, which allows us to work with SQLite databases in Python;
- We then create a connection to the database;
- Next, we create a table in the database called “images”;
- We use a loop to loop through each image in the training data and insert it into the “images” table (along with the labels);
- We use the commit() method to save the changes we made to the database;
- We then use another loop to loop through each image in the test data and insert it into the “images” table (along with the labels);
- We use the commit() method again to save the changes we made to the database;
- Finally, we close the connection to the database;
So this is how we can create a Data ETL pipeline using Python. Our ETL pipeline takes the Fashion MNIST dataset and stores it in an SQLite database so that we can easily access and manipulate the data later.
Now, this is how you can read the data you stored on the SQLite database:
import sqlite3 conn = sqlite3.connect('fashion_mnist.db') cursor = conn.cursor() cursor.execute('SELECT * FROM images') rows = cursor.fetchall() import pandas as pd data = pd.read_sql_query('SELECT * FROM images', conn)
Summary
Data ETL is a process where data is extracted from a place, then the data is transformed in some way, and then data is loaded into a database. So ETL stands for Extracting, Transforming, and Loading the data. I hope you liked this article on developing a Data ETL pipeline using Python. Feel free to ask valuable questions in the comments section below.