Natural Language Processing (NLP) is a subset of Artificial Intelligence where we aim to train computers to understand human languages. Some real-world applications of NLP are chatbots, Siri, and Google Translator. While working on any problem based on NLP, we should follow a process to prepare a vocabulary of words from a textual dataset. So, if you want to understand the process of solving any problem based on NLP, this article is for you. In this article, I will take you through the complete process of NLP using Python.
Process of NLP
To explain the process of NLP, I will take you through the sentiment classification task using Python. The steps to solve this NLP problem are:
- Finding a dataset for sentiment classification
- Preparing the dataset by tokenization, stopwords removal, and stemming
- Text vectorization
- Training a classification model for sentiment classification
Process of NLP using Python
Step 1: Finding a Dataset
The first step while working on any NLP problem is to find a textual dataset. In this problem, we need to find a dataset containing text about the sentiments of people towards a product or service. If the dataset you found is labelled, it’s perfect! If you found an unlabelled textual dataset, you can learn how to add labels to a dataset for sentiment classification from here.
I found an ideal dataset based on movie reviews for the sentiment classification task on Kaggle. You can download the dataset from here.
As we have found a dataset for sentiment classification, let’s move further by importing the necessary Python libraries and the dataset:
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import BernoulliNB import nltk nltk.download('stopwords') data = pd.read_csv("IMDB Dataset.csv") print(data.head())
review sentiment 0 One of the other reviewers has mentioned that ... positive 1 A wonderful little production. <br /><br />The... positive 2 I thought this was a wonderful way to spend ti... positive 3 Basically there's a family where a little boy ... negative 4 Petter Mattei's "Love in the Time of Money" is... positive
Step 2: Data Preparation, Tokenization, Stopwords Removal and Stemming
Our textual dataset needs preparation before being used for any problem based on NLP. Here we will:
- remove links and all the special characters from the review column
- tokenize and remove the stopwords from the review column
- stem the words in the review column
import nltk import re nltk.download('stopwords') stemmer = nltk.SnowballStemmer("english") from nltk.corpus import stopwords import string stopword=set(stopwords.words('english')) def clean(text): text = str(text).lower() text = re.sub('\[.*?\]', '', text) text = re.sub('https?://\S+|www\.\S+', '', text) text = re.sub('<.*?>+', '', text) text = re.sub('[%s]' % re.escape(string.punctuation), '', text) text = re.sub('\n', '', text) text = re.sub('\w*\d\w*', '', text) text = [word for word in text.split(' ') if word not in stopword] text=" ".join(text) text = [stemmer.stem(word) for word in text.split(' ')] text=" ".join(text) return text data["review"] = data["review"].apply(clean)
Before moving forward, let’s have a quick look at the wordcloud of the review column:
import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator text = " ".join(i for i in data.review) stopwords = set(STOPWORDS) wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text) plt.figure( figsize=(15,10)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Step 3: Text Vectorization
The next step is text vectorization. It means to transform all the text tokens into numerical vectors. Here I will first perform text vectorization on the feature column (review column) and then split the data into training and test sets:
x = np.array(data["review"]) y = np.array(data["sentiment"]) cv = CountVectorizer() X = cv.fit_transform(x) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
Step 4: Text Classification
The final step in the process of NLP is to classify or cluster texts. As we are working on the problem of sentiment classification, we will now train a text classification model. Here’s how to prepare a text classification model for sentiment classification:
from sklearn.linear_model import PassiveAggressiveClassifier model = PassiveAggressiveClassifier() model.fit(X_train,y_train)
The dataset we used to train a sentiment classification model contains movie reviews. So let’s test the model by giving a movie review as an input:
user = input("Enter a Text: ") data = cv.transform([user]).toarray() output = model.predict(data) print(output)
Enter a Text: one of the worst movies I have ever seen! ['negative']
So this is how you can solve any problem of NLP using the Python programming language.
While working on any problem of NLP, we first need to:
- find a textual dataset
- then prepare the dataset by tokenization, stopwords removal, and stemming
- then perform text vectorization
- and then the last step is text classification or clustering
I hope you liked this article on the complete process of NLP using Python. Feel free to ask valuable questions in the comments section below.