Topic Modeling with Machine Learning

Every day, businesses process large volumes of unstructured text. From customer interactions in emails to online reviews and reviews. To deal with this large amount of text, we use the concept of topic modeling. In this article, I’ll introduce you to Modeling Subjects with Machine Learning using Python.

Machine Learning Project on Topic Modeling

Topic modeling can be seen as a task of machine learning which can be used to present the huge volume of data generated due to advancements in computer and web technology in low dimension and to present the hidden concepts, important characteristics or latent variables of the data, depending on the context of the application of the identified text.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

In the section below, I will take you through a Machine Learning project on Topic Modeling with Python by using the BERTopic library. You can simply install this library by using the pip command; pip install bertopic==0.3.4. When working with BERTopic, be sure to select a GPU runtime. Otherwise, the algorithm may take some time to create the document embeds. If your device does not have GPU, you can use Google Colab for this task.

Topic Modeling with Machine Learning using Python

Now let’s start with the task of Topic Modeling with Machine Learning using Python by importing he necessary Python libraries and the dataset:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

So let’s start by creating topics, for example, I am using the popular 20 Newsgroups dataset provided by Scikit-Learn which has around 18,000 newsgroup articles on 20 topics. I will therefore select English as the main language for our documents:

model = BERTopic(language="english")
topics, probs = model.fit_transform(docs)

Now let’s extract the topics with the most number of frequencies:

model.get_topic_freq().head()
TopicCount
0-17288
1493992
230701
327684
411568

You can see -1 in the first row (index 0), -1 refers to all outliers and should generally be ignored. Next, let’s take a look at the most common topic generated:

model.get_topic(49)[:10]
[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

The model is therefore stochastic, which means that the topics may differ from one run to another. Now let’s take a look at the full list of support languages:

from bertopic import languages
print(languages)
['Afrikaans', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Azerbaijani', 'Basque', 'Belarusian', 'Bengali', 'Bengali Romanize', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Burmese zawgyi font', 'Catalan', 'Chinese (Simplified)', 'Chinese (Traditional)', 'Croatian', 'Czech', 'Danish', 'Dutch', 'English', 'Esperanto', 'Estonian', 'Filipino', 'Finnish', 'French', 'Galician', 'Georgian', 'German', 'Greek', 'Gujarati', 'Hausa', 'Hebrew', 'Hindi', 'Hindi Romanize', 'Hungarian', 'Icelandic', 'Indonesian', 'Irish', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kazakh', 'Khmer', 'Korean', 'Kurdish (Kurmanji)', 'Kyrgyz', 'Lao', 'Latin', 'Latvian', 'Lithuanian', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Marathi', 'Mongolian', 'Nepali', 'Norwegian', 'Oriya', 'Oromo', 'Pashto', 'Persian', 'Polish', 'Portuguese', 'Punjabi', 'Romanian', 'Russian', 'Sanskrit', 'Scottish Gaelic', 'Serbian', 'Sindhi', 'Sinhala', 'Slovak', 'Slovenian', 'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swedish', 'Tamil', 'Tamil Romanize', 'Telugu', 'Telugu Romanize', 'Thai', 'Turkish', 'Ukrainian', 'Urdu', 'Urdu Romanize', 'Uyghur', 'Uzbek', 'Vietnamese', 'Welsh', 'Western Frisian', 'Xhosa', 'Yiddish']

Now let’s take a look at the topic probabilities to understand how safe BERTopic is that certain topics can be found in a document:

model.visualize_distribution(probs[0])
topic modeling

Topic Reduction

Finally, we can also reduce the number of subjects after training a BERTopic model. The advantage of doing this is that you can decide the number of topics after knowing how many are actually created.

It is difficult to predict before training your model how many topics are in your documents and how many will be retrieved. Instead, we can decide afterwards how many subjects look realistic:

new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=60)

The reasoning for placing documents, topics and probs as parameters is that these values are not saved in BERTopic for any purpose. If you had a million documents, it seems very inefficient to save them to BERTopic instead of a dedicated database.

We can now use the update_topics function to update the subject representation with new parameters for the TF-IDF vectorization:

model.update_topics(docs, topics, n_gram_range=(1, 3), stop_words="english")

Topic Modeling

After training our model, we can use find_topics to search for topics similar to a search_term entry. Here we will be looking for topics that are closely related to the search term vehicle. Next, we extract the most similar topic and check the results:

similar_topics, similarity = model.find_topics("vehicle", top_n=5); similar_topics
model.get_topic(28)
[('car', 0.043816884839494336),
 ('dealer', 0.018083187684167435),
 ('ford', 0.008460673652078586),
 ('bought', 0.007589563051028973),
 ('dealership', 0.0071675465843055045),
 ('odometer', 0.0071675465843055045),
 ('consumer', 0.006287931894176063),
 ('salesman', 0.005942906070333744),
 ('dealers', 0.005691645933436952),
 ('mazda', 0.005379727257156868)]

You can get the full code used in this article for the task of Topic Modeling with Machine Learning using Python from below:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

model = BERTopic(language="english")
topics, probs = model.fit_transform(docs)
model.get_topic_freq().head()

model.get_topic(49)[:10]

from bertopic import languages
print(languages)

model.visualize_distribution(probs[0])

new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=60)
model.update_topics(docs, topics, n_gram_range=(1, 3), stop_words="english")

similar_topics, similarity = model.find_topics("vehicle", top_n=5); similar_topics
model.get_topic(28)

I hope you liked this article on Topic Modeling with Machine Learning using Python programming language. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1610

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading