This article is a tutorial on NLP with Python. Here you will learn how to use the main NLP library known as spaCy to undertake some of the most important tasks of working with text data.
Introduction to Spacy for NLP with Python
Data comes in many different forms like timestamps, sensor readings, images, category labels, and more. But the text is still some of the most valuable data for those who know how to use it.
Also, Read – 100+ Machine Learning Projects Solved and Explained.
Spacy is one of the best known Python libraries for NLP. It relies on language-specific models and different sizes. At the end of this article, you will be able to use spaCy to:
- Basic word processing and pattern matching.
- Creation of machine learning models with text.
- Text representation with word embeddings that digitally capture the meaning of words and documents.
NLP with Python using Spacy
Let’s see how to work with Spacy for NLP with Python:
import spacy nlp = spacy.load('en')
With the above model loaded, you can process text like this:
doc = nlp("Tea is healthy and calming, don't you think?")
Tokenization:
Tokenization returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation:
for token in doc: print(token)
Tea is healthy and calming , do n't you think ?
Iterating a document gives you symbolic objects. Each of these tokens comes with additional information.
Text Preprocessing:
There are a few types of preprocessing to improve the way we model with words. The first is “lemmatizing”. The “lemma” of a word is its basic form. For example, “walk” is the lemmatization of the word “walking”. So when you lemmatize the word walk, you convert it to walk. Let’s see how to use Spacy for word processing:
print(f"Token ttLemma ttStopword".format('Token', 'Lemma', 'Stopword')) print("-"*40) for token in doc: print(f"{str(token)}tt{token.lemma_}tt{token.is_stop}")
Token Lemma Stopword
----------------------------------------
Tea tea False
is be True
healthy healthy False
and and True
calming calm False
, , False
do do True
n't not True
you -PRON- True
think think False
? ? False
Why are lemmas and stop word identification important? Linguistic data has a lot of noise mixed with informative content. In the above sentence, the important words are tea, healthy and calming. Removing stop words can help the predictive model focus on relevant words.
Pattern Matching:
Another common NLP task is to match tokens or phrases in chunks of text or entire documents. You can do pattern matches with regular expressions, but spaCy’s matching capabilities tend to be easier to use.
For matching individual tokens, you need to create a Matcher. When you want to match a list of terms, it is easier and more efficient to use PhraseMatcher:
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
Then you create a list of terms to match in the text. Sentence matching needs templates as document objects. The easiest way to do this is to understand a list using the NLP model:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel'] patterns = [nlp(text) for text in terms] matcher.add("TerminologyList", patterns)
Then you create a document from the text to search and use the phrase picker to find where the terms appear in the text:
TerminologyList iPhone 11
So this is how we can use Spacy for NLP with Python. I hope you liked this article on Spacy for NLP with Python. Feel free to ask your valuable questions in the comments section below.