WordCloud with Python

You must have seen a cloud filled with words in a lot of Analysis tasks and machine learning projects. A WordCloud represents the importance of each word in a set of words by analyzing the frequency of terms. In this article, I will take you through a detailed understanding of a WordCloud. At the end of this article, you would be able to create your own customised WordCloud that you have never gone through before.

The use of WordCloud is mostly in Natural Language Processing which is a field of Artificial Intelligence. The idea behind it is that it will represent the most used words in a paragraph, website, social media platforms or even in Speech to highlight the main focus of the article.

Exploring The Data

The dataset I will use in this article is based on wine reviews, you can download the dataset from here. Now let’s explore the data to know what we are going to work with then we will jump on WordClouds. I will start with importing all the libraries that we need for this task:

# Start with loading all necessary libraries
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
% matplotlib inlineCode language: Python (python)

Now let’s import the dataset using the pandas library and have a look at the first five rows of the data:

df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)
df.head()Code language: Python (python)
Output

A Basic WordCloud

A WordCloud is a method which is mostly used in NLP to see the most frequent words among the text we are analyzing. Now let’s set up a basic WordCloud:

# Start with one review:
text = df.description[0]

# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()Code language: Python (python)
simple wordcloud

Now let’s manipulate some arguments like font size, maximum words, and background colour:

# lower max_font_size, change the maximum number of word and lighten the background:
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()Code language: Python (python)
simple WordCloud

Now let’s combine all the reviews of wine we have in the data to set up and create a big WordCloud:

# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update(["drink", "now", "wine", "flavor", "flavors"])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()Code language: Python (python)
wordcloud

We can see in the above figure that the full-bodied and black cherry are the most used words in the data. Now let’s put all these WordCloud in the shape of a bottle of wine.

Creating WordCloud with Shapes

If you want a to put or create a WordCloud using a shape, then you need to find a PNG file of your desired shape. In our case, as we are using the reviews of the wine. I will use the form of the bottle of wine. You can download the shape below.

png
Download

As all images have different structures so they will result in different outcomes. I will prepare the WordCloud according to the shape of the bottle. If I took another form, then I need to make the data accordingly. So the code below is only meant to perform at it’s best for the shape that I have chosen.

wine_mask = np.array(Image.open("img/wine_mask.png"))
wine_maskCode language: Python (python)
array([[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
…,
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0]], dtype=uint8)
def transform_format(val):
    if val == 0:
        return 255
    else:
        return val
      
# Transform your mask into a new one that will work with the function:
transformed_wine_mask = np.ndarray((wine_mask.shape[0],wine_mask.shape[1]), np.int32)

for i in range(len(wine_mask)):
    transformed_wine_mask[i] = list(map(transform_format, wine_mask[i]))
# Check the expected result of your mask
transformed_wine_maskCode language: Python (python)
array([[255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], ..., [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255]])
# Create a word cloud image
wc = WordCloud(background_color="white", max_words=1000, mask=transformed_wine_mask,
               stopwords=stopwords, contour_width=3, contour_color='firebrick')

# Generate a wordcloud
wc.generate(text)

# store to file
wc.to_file("img/wine.png")

# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()Code language: Python (python)
wordcloud

Also, Read: MySQL with Python Tutorial.

Now, we have created a WordCloud in the shape of a wine bottle. It seems like the reviews of wine most often mention about black cherry, fruit flavors and full-bodied features of the wine. I hope you liked this article, feel free to ask your valuable questions in the comments section below.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1435

Leave a Reply