You must have seen a cloud filled with words in a lot of Analysis tasks and machine learning projects. A WordCloud represents the importance of each word in a set of words by analyzing the frequency of terms. In this article, I will take you through a detailed understanding of a WordCloud. At the end of this article, you would be able to create your own customised WordCloud that you have never gone through before.
The use of WordCloud is mostly in Natural Language Processing which is a field of Artificial Intelligence. The idea behind it is that it will represent the most used words in a paragraph, website, social media platforms or even in Speech to highlight the main focus of the article.
Exploring The Data
The dataset I will use in this article is based on wine reviews, you can download the dataset from here. Now let’s explore the data to know what we are going to work with then we will jump on WordClouds. I will start with importing all the libraries that we need for this task:
# Start with loading all necessary libraries import numpy as np import pandas as pd from os import path from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt % matplotlib inline
Now let’s import the dataset using the pandas library and have a look at the first five rows of the data:
df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0) df.head()
A Basic WordCloud
A WordCloud is a method which is mostly used in NLP to see the most frequent words among the text we are analyzing. Now let’s set up a basic WordCloud:
# Start with one review: text = df.description # Create and generate a word cloud image: wordcloud = WordCloud().generate(text) # Display the generated image: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Now let’s manipulate some arguments like font size, maximum words, and background colour:
# lower max_font_size, change the maximum number of word and lighten the background: wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text) plt.figure() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()
Now let’s combine all the reviews of wine we have in the data to set up and create a big WordCloud:
# Create stopword list: stopwords = set(STOPWORDS) stopwords.update(["drink", "now", "wine", "flavor", "flavors"]) # Generate a word cloud image wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text) # Display the generated image: # the matplotlib way: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
We can see in the above figure that the full-bodied and black cherry are the most used words in the data. Now let’s put all these WordCloud in the shape of a bottle of wine.
Creating WordCloud with Shapes
If you want a to put or create a WordCloud using a shape, then you need to find a PNG file of your desired shape. In our case, as we are using the reviews of the wine. I will use the form of the bottle of wine. You can download the shape below.
As all images have different structures so they will result in different outcomes. I will prepare the WordCloud according to the shape of the bottle. If I took another form, then I need to make the data accordingly. So the code below is only meant to perform at it’s best for the shape that I have chosen.
wine_mask = np.array(Image.open("img/wine_mask.png")) wine_mask
array([[0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], …, [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0]], dtype=uint8)
def transform_format(val): if val == 0: return 255 else: return val # Transform your mask into a new one that will work with the function: transformed_wine_mask = np.ndarray((wine_mask.shape,wine_mask.shape), np.int32) for i in range(len(wine_mask)): transformed_wine_mask[i] = list(map(transform_format, wine_mask[i])) # Check the expected result of your mask transformed_wine_mask
array([[255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], ..., [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255]])
# Create a word cloud image wc = WordCloud(background_color="white", max_words=1000, mask=transformed_wine_mask, stopwords=stopwords, contour_width=3, contour_color='firebrick') # Generate a wordcloud wc.generate(text) # store to file wc.to_file("img/wine.png") # show plt.figure(figsize=[20,10]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show()
Now, we have created a WordCloud in the shape of a wine bottle. It seems like the reviews of wine most often mention about black cherry, fruit flavors and full-bodied features of the wine. I hope you liked this article, feel free to ask your valuable questions in the comments section below.