Convert Text into Numerical Data using Python

Text analysis is one of the major applications where machine learning algorithms are used. The process of converting textual data into numerical data is known as the process of vectorization in machine learning. It is an important task because you cannot use machine learning algorithms directly on a text as they only support numerical data. In this article, I will take you through how to convert text into numerical data using Python.

Convert Text into Numerical Data

Machine Learning algorithms cannot be used directly on any textual data as they can only process numerical data in the form of an array. This is why we need to convert text, images, audio or any type of data into numerical data first and then only we can use machine learning algorithms. The process of converting text into numerical data is known as vectorization.

Converting textual data to numeric data is not a difficult task as the Scikit-learn library in Python provides so many methods for this task. In the section below, I’ll walk you through how to convert text to numerical data using Python.

Convert Text into Numerical Data using Python

I hope you now have understood why we need to convert the textual data into numerical data before using machine learning algorithms. Now let’s see how to use the Scikit-learn library to convert textual data into numerical data using Python. I will start this task by importing the CountVectorizer class from the Scikit-learn library in Python:

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

Now I will store some texts into a Python list:

text = ["Hi, how are you", "I hope you are doing good", "My name is Aman Kharwal"]

Now I will fit the list into the CountVectorizer function to convert the list of texts into numerical data:

vect.fit(text)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Now let’s convert it into an array of numerical data:

train = vect.transform(text)
train.toarray()
array([[0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0]])

As we have now converted the textual data into numerical data we can also present it in the form of a pandas DataFrame:

import pandas as pd
data = pd.DataFrame(train.toarray(), columns=vect.get_feature_names())
data
Convert Text into Numerical Data using Python

Also, Read – Python Projects with Source Code.

Summary

The process of converting textual data into numerical data is known as vectorization in machine learning. I hope you liked this article on how to convert textual data into numerical data using Python. Feel free to ask your valuable questions in the comments section below.

Default image
Aman Kharwal
Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder
Articles: 1050

Leave a Reply