Process of NLP using Python

Natural Language Processing (NLP) is a subset of Artificial Intelligence where we aim to train computers to understand human languages. Some real-world applications of NLP are chatbots, Siri, and Google Translator. While working on any problem based on NLP, we should follow a process to prepare a vocabulary of words from a textual dataset. So, if you want to understand the process of solving any problem based on NLP, this article is for you. In this article, I will take you through the complete process of NLP using Python.

Process of NLP

To explain the process of NLP, I will take you through the sentiment classification task using Python. The steps to solve this NLP problem are:

  1. Finding a dataset for sentiment classification
  2. Preparing the dataset by tokenization, stopwords removal, and stemming
  3. Text vectorization
  4. Training a classification model for sentiment classification

Process of NLP using Python

Step 1: Finding a Dataset

The first step while working on any NLP problem is to find a textual dataset. In this problem, we need to find a dataset containing text about the sentiments of people towards a product or service. If the dataset you found is labelled, it’s perfect! If you found an unlabelled textual dataset, you can learn how to add labels to a dataset for sentiment classification from here.

I found an ideal dataset based on movie reviews for the sentiment classification task on Kaggle. You can download the dataset from here.

As we have found a dataset for sentiment classification, let’s move further by importing the necessary Python libraries and the dataset:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
import nltk
nltk.download('stopwords')

data = pd.read_csv("IMDB Dataset.csv")
print(data.head())
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

Step 2: Data Preparation, Tokenization, Stopwords Removal and Stemming

Our textual dataset needs preparation before being used for any problem based on NLP. Here we will:

  1. remove links and all the special characters from the review column
  2. tokenize and remove the stopwords from the review column
  3. stem the words in the review column
import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
data["review"] = data["review"].apply(clean)

Before moving forward, let’s have a quick look at the wordcloud of the review column:

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
text = " ".join(i for i in data.review)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Image for Post: Process of NLP using Python

Step 3: Text Vectorization

The next step is text vectorization. It means to transform all the text tokens into numerical vectors. Here I will first perform text vectorization on the feature column (review column) and then split the data into training and test sets:

x = np.array(data["review"])
y = np.array(data["sentiment"])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=42)

Step 4: Text Classification

The final step in the process of NLP is to classify or cluster texts. As we are working on the problem of sentiment classification, we will now train a text classification model. Here’s how to prepare a text classification model for sentiment classification:

from sklearn.linear_model import PassiveAggressiveClassifier
model = PassiveAggressiveClassifier()
model.fit(X_train,y_train)

The dataset we used to train a sentiment classification model contains movie reviews. So let’s test the model by giving a movie review as an input:

user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)
Enter a Text: one of the worst movies I have ever seen!
['negative']

So this is how you can solve any problem of NLP using the Python programming language.

Summary

While working on any problem of NLP, we first need to:

  1. find a textual dataset
  2. then prepare the dataset by tokenization, stopwords removal, and stemming
  3. then perform text vectorization
  4. and then the last step is text classification or clustering

I hope you liked this article on the complete process of NLP using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1534

Leave a Reply