In Machine Learning, spaCy is a very useful open-source library for advanced natural language processing (NLP) tasks for Python. If you work with a lot of text, you might want to learn more about it. For example, what is it? What do the words mean in context? Who does what to whom? Which companies and which products are mentioned? Which texts are similar to each other? In this article, I will take you through, spaCy in Machine Learning.
spaCy is specially designed for production use and helps you create applications that process and “understand” large volumes of text. It can be used to create systems for extracting information or understanding natural language, or for preprocessing text for deep learning.
Installing spaCy in your systems is a very easy task like installing all other packages in Python. You can easily install it by using the pip command in your terminal – pip install spacy.
You will also need to access at least one of the spaCy language models. spaCy can be used to analyze texts from different languages including English, German, Spanish and French, each with its models. We’re going to be working with English text for this simple analysis, so go ahead and take spaCy’s little English language template, again via the command line: python -m spacy download en_core_web_sm.
The task of Text processing now comes down to loading your language model and passing strings directly to it. Now let’s see what it does with a sample review:
import spacy nlp = spacy.load("en_core_web_sm") review = "I'am so happy I went to this awesome Vegas buffet!" doc = nlp(review)
To see the resulting output, we need to loop over the above NLP document:
for token in doc: print(token.text, token.pos_, token.lemma_, token.is_stop)
I'am PROPN I'am False so ADV so True happy ADJ happy False I PRON -PRON- True went VERB go False to ADP to True this DET this True awesome ADJ awesome False Vegas PROPN Vegas False buffet NOUN buffet False ! PUNCT ! False
spaCy does not explicitly divide the original text into a list, but tokens are accessible by the index range:
Output: I’am so happy I went
NLP consists of a lot of unique challenges, certainly with syntactic and semantic issues. spaCy identifies all the dependencies of each token as the text passes through the language model, let’s check the dependencies in our Text review:
for token in doc: print(token.text, token.dep_)
I'am ROOT so advmod happy amod I nsubj went ccomp to prep this det awesome amod Vegas compound buffet pobj ! punct
It looks somewhat interesting, but visualizing these relationships reveals an even fuller story. Start by loading a submodule called displaCy to help with visualization:
from spacy import displacy displacy.serve(doc)
Then we need to render the dependency tree from the document:
Named Entity Recognition with Spacy
Machine learning practitioners often seek to identify key elements and individuals in unstructured text. This task, called Named Entity Recognition (NER), runs automatically as the text passes through the language model. To see which tokens it identifies as named entities in our restaurant review, simply browse doc.ents:
for ent in doc.ents: print(ent.text, ent.label_)
It recognizes “Vegas” as a named entity, but what does the label “GPE” mean? If you don’t know what any of the abbreviations mean, just ask spaCy to explain it to you:
Countries, cities, states
Additionally, the displacement method of displaCy can highlight named entities if the style argument is specified:
The coloured texts represent named entities by type. Consider this more complicated example with four different types of entities:
document = nlp("One year ago, I visited the Eiffel Tower with Jeff in Paris, France") displacy.serve(document, style='ent')
I hope you liked this article on Spacy in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning. Don’t forget to subscribe for the daily newsletters below to get our notifications in your inbox.