Topic Modelling using Python

Topic Modelling means assigning topic labels to a collection of text documents. The goal of topic modelling is to identify topics present in the text documents. So, if you want to learn how to perform topic modelling, this article is for you. In this article, I will take you through the task of Topic Modelling with Machine Learning using Python.

Topic Modelling

Topic Modelling is a Natural Language Processing technique to uncover hidden topics from text documents. It helps identify topics of the text documents to find relationships between the content of a text document and the topic.

To identify topics of any text document, we need to use algorithms that can analyze the frequency of words to identify relationships between the content and topics. To solve this problem, we need to have textual data. I found an ideal dataset for this task. You can download the dataset from here.

In the section below, I will take you through how to perform topic modelling with Machine Learning using the Python programming language.

Topic Modelling using Python

Let’s start this task by importing the necessary Python libraries and the dataset:

import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

data = pd.read_csv("articles.csv", encoding = 'latin1')
print(data.head())
                                             Article  \
0  Data analysis is the process of inspecting and...   
1  The performance of a machine learning algorith...   
2  You must have seen the news divided into categ...   
3  When there are only two classes in a classific...   
4  The Multinomial Naive Bayes is one of the vari...   

                                               Title  
0                  Best Books to Learn Data Analysis  
1         Assumptions of Machine Learning Algorithms  
2          News Classification with Machine Learning  
3  Multiclass Classification Algorithms in Machin...  
4        Multinomial Naive Bayes in Machine Learning  

As we are working on a Natural Language Processing problem, we need to clean the textual content by removing punctuation and stopwords. Here’s how we can clean the textual data:

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize text
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens
    lemma = WordNetLemmatizer()
    tokens = [lemma.lemmatize(word) for word in tokens]
    # Join tokens to form preprocessed text
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

data['Article'] = data['Article'].apply(preprocess_text)

Now we need to convert the textual data into a numerical representation. We can use text vectorization here:

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(data['Article'].values)

Now we will use an algorithm to identify relationships between the textual data to assign topic labels. We can use the Latent Dirichlet Allocation algorithm for this task. Latent Dirichlet Allocation (LDA) is a generative probabilistic algorithm used to uncover the underlying topics in a corpus of textual data. Let’s use the LDA algorithm to assign topic labels:

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(x)

topic_modelling = lda.transform(x)

topic_labels = np.argmax(topic_modelling, axis=1)
data['topic_labels'] = topic_labels

Now here’s the final data with topic labels:

print(data.head())
                                             Article  \
0  data analysis process inspecting exploring dat...   
1  performance machine learning algorithm particu...   
2  must seen news divided category go news websit...   
3  two class classification problem problem binar...   
4  multinomial naive bayes one variant naive baye...   

                                               Title  topic_labels  
0                  Best Books to Learn Data Analysis             2  
1         Assumptions of Machine Learning Algorithms             3  
2          News Classification with Machine Learning             1  
3  Multiclass Classification Algorithms in Machin...             3  
4        Multinomial Naive Bayes in Machine Learning             1  

So this is how you can assign topic labels with Machine Learning using the Python programming language.

Summary

Topic Modelling is a Natural Language Processing technique to uncover hidden topics from text documents. It helps identify topics of the text documents to find relationships between the content of a text document and the topic. I hope you liked this article on Topic Modelling with Machine Learning using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

Leave a Reply