Topic Modelling means assigning topic labels to a collection of text documents. The goal of topic modelling is to identify topics present in the text documents. So, if you want to learn how to perform topic modelling, this article is for you. In this article, I will take you through the task of Topic Modelling with Machine Learning using Python.
Topic Modelling
Topic Modelling is a Natural Language Processing technique to uncover hidden topics from text documents. It helps identify topics of the text documents to find relationships between the content of a text document and the topic.
To identify topics of any text document, we need to use algorithms that can analyze the frequency of words to identify relationships between the content and topics. To solve this problem, we need to have textual data. I found an ideal dataset for this task. You can download the dataset from here.
In the section below, I will take you through how to perform topic modelling with Machine Learning using the Python programming language.
Topic Modelling using Python
Let’s start this task by importing the necessary Python libraries and the dataset:
import numpy as np import pandas as pd import nltk from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer import string from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.decomposition import LatentDirichletAllocation nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') nltk.download('omw-1.4') data = pd.read_csv("articles.csv", encoding = 'latin1') print(data.head())
Article \ 0 Data analysis is the process of inspecting and... 1 The performance of a machine learning algorith... 2 You must have seen the news divided into categ... 3 When there are only two classes in a classific... 4 The Multinomial Naive Bayes is one of the vari... Title 0 Best Books to Learn Data Analysis 1 Assumptions of Machine Learning Algorithms 2 News Classification with Machine Learning 3 Multiclass Classification Algorithms in Machin... 4 Multinomial Naive Bayes in Machine Learning
As we are working on a Natural Language Processing problem, we need to clean the textual content by removing punctuation and stopwords. Here’s how we can clean the textual data:
def preprocess_text(text): # Convert text to lowercase text = text.lower() # Remove punctuation text = text.translate(str.maketrans('', '', string.punctuation)) # Tokenize text tokens = nltk.word_tokenize(text) # Remove stopwords stop_words = set(stopwords.words("english")) tokens = [word for word in tokens if word not in stop_words] # Lemmatize tokens lemma = WordNetLemmatizer() tokens = [lemma.lemmatize(word) for word in tokens] # Join tokens to form preprocessed text preprocessed_text = ' '.join(tokens) return preprocessed_text data['Article'] = data['Article'].apply(preprocess_text)
Now we need to convert the textual data into a numerical representation. We can use text vectorization here:
vectorizer = TfidfVectorizer() x = vectorizer.fit_transform(data['Article'].values)
Now we will use an algorithm to identify relationships between the textual data to assign topic labels. We can use the Latent Dirichlet Allocation algorithm for this task. Latent Dirichlet Allocation (LDA) is a generative probabilistic algorithm used to uncover the underlying topics in a corpus of textual data. Let’s use the LDA algorithm to assign topic labels:
lda = LatentDirichletAllocation(n_components=5, random_state=42) lda.fit(x) topic_modelling = lda.transform(x) topic_labels = np.argmax(topic_modelling, axis=1) data['topic_labels'] = topic_labels
Now here’s the final data with topic labels:
print(data.head())
Article \ 0 data analysis process inspecting exploring dat... 1 performance machine learning algorithm particu... 2 must seen news divided category go news websit... 3 two class classification problem problem binar... 4 multinomial naive bayes one variant naive baye... Title topic_labels 0 Best Books to Learn Data Analysis 2 1 Assumptions of Machine Learning Algorithms 3 2 News Classification with Machine Learning 1 3 Multiclass Classification Algorithms in Machin... 3 4 Multinomial Naive Bayes in Machine Learning 1
So this is how you can assign topic labels with Machine Learning using the Python programming language.
Summary
Topic Modelling is a Natural Language Processing technique to uncover hidden topics from text documents. It helps identify topics of the text documents to find relationships between the content of a text document and the topic. I hope you liked this article on Topic Modelling with Machine Learning using Python. Feel free to ask valuable questions in the comments section below.