Word Embeddings in Machine Learning

Word embeddings or word vectors represent each word numerically so that the vector matches how that word is used or what it means. Vector encodings are learned by considering the context in which the words appear.

Words that appear in similar contexts will have similar vectors. For example, the vectors for “leopard”, “lion” and “tiger” will be close to each other, while they will be far from “planet” and “castle”.

Also, Read – Machine Learning Project on Rainfall Prediction Model.

Word Embeddings in Action

Even cooler, the relationships between words can be examined with math operations. Subtracting the vectors for “male” and “female” will return another vector. If you add that to the vector for “king”, the result is close to the vector for “queen”.

Word embeddings example

These vectors can be used as features for machine learning models. Word embeddings will generally improve the performance of your models above encoding a bag of words. spaCy provides incorporations learned from a template called Word2Vec. You can access it by loading a large language model like en_core_web_lg. Then they will be available on the tokens of the vector attribute.

import numpy as np
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')Code language: PHP (php)

These are vectors of 300 dimensions, with a vector for each word. However, we only have document-level tags and our templates will not be able to use word-level embeds. So you need a vector representation for the whole document.

# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])
vectors.shapeCode language: PHP (php)
(12, 300)

There are many ways to combine all the word embeddings into a single document vector that we can use for training the model. A simple and surprisingly efficient approach is to simply average the vectors for each word in the document. Then you can use these document vectors for modelling.

spaCy calculates the average document vector you can get with doc.vector. Here is an example of loading spam data and converting it to document vectors. The dataset I am using here can be downloaded from here.

import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('spam.csv')

with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
doc_vectors.shapeCode language: PHP (php)
(5572, 300)

Classification Models for Word Embeddings

With document vectors, you can train scikit-learn models, xgboost models, or any other standard approach to modelling.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)Code language: JavaScript (javascript)

Here is an example using Support Vector Machines (SVM). Scikit-learn provides an SVM LinearSVC classifier. It works the same as other scikit-learn models.

from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )Code language: PHP (php)
Accuracy: 97.312%

Document Similarity

Documents with similar content usually have similar vectors. So you can find similar documents by measuring the similarity between vectors. A common metric for this is cosine similarity which measures the angle between two vectors, a and b.

def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)Code language: JavaScript (javascript)

Also, Read – Linear Search Algorithm with Python.

I hope you liked this article on Word Embeddings in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply