Named Entity Recognition with Python

In machine learning, the recognition of named entities is an essential subtask of natural language processing. It tries to recognize and classify multi-word phrases with special meaning, e.g. people, organizations, places, dates, etc. In this article, I will introduce you to a machine learning project on Named Entity Recognition with Python.

Named Entity Recognition

Named entity recognition comes from information retrieval (IE). IE’s job is to transform unstructured data into structured information. In Named Entity Recognition, unstructured data is the text written in natural language and we want to extract important information in a well-defined format eg. relational database.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

For example, we want to monitor the news for mentions of Covid-19 patients and for each patient we need the name of the responsible medical organization, location and date.

The Named Entity Recognition task attempts to correctly detect and classify text expressions into a set of predefined classes. Classes can vary, but very often classes like people (PER), organizations (ORG) or places (LOC) are used.

There is an increase in the use of named entity recognition in information retrieval. Modern systems like Apache Lucene allow us to extend the query with custom properties. Since named entities are very important in many systems, it is essential to allow the user to use them.

This also applies to search engines like Google or Yahoo, which try to handle the query containing or asking for named entities differently, for example, they show a box with basic information about the named entities with a link to a database of knowledge. Also, the results of named entities are classified differently.

Named Entity Recognition with Python

Now, in this section, I will take you through a Machine Learning project on Named Entity Recognition with Python. I will start this task by importing the necessary Python libraries and the dataset:

import pandas as pd
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head()
Sentence #WordPOSTag
0Sentence: 1ThousandsNNSO
1NaNofINO
2NaNdemonstratorsNNSO
3NaNhaveVBPO
4NaNmarchedVBNO

I will train a neural network for the Named Entity Recognition (NER) task. So we need to make some modifications to the data to prepare it so that it can easily fit into a neutral network. I’ll start this step by extracting the mappings needed to train the neural network:

from itertools import chain
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    
    idx2tok = {idx:tok for  idx, tok in enumerate(vocab)}
    tok2idx = {tok:idx for  idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok


token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')
data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)

Now, I’m going to transform the columns in the data to extract the sequential data from our neural network:

data_fillna = data.fillna(method='ffill', axis=0)
# Groupby and collect columns
data_group = data_fillna.groupby(
['Sentence #'],as_index=False
)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))

I will now divide the data into training and test sets. I am going to create a function to split the data as LSTM layers only accept sequences of the same length. Thus, each sentence that appears as an integer in the data must be completed with the same length:

train_tokens length: 32372 
train_tokens length: 32372 
test_tokens length: 4796 
test_tags: 4796 
val_tokens: 10791 
val_tags: 10791

Training a Neural Network for NER

I will now proceed to train the neural network architecture of our model. So let’s start by importing all the packages we need to train our neural network. Next, I’ll create layers that will take the dimensions of the LSTM layer and give the maximum length and maximum tags as output:

Now I will create a helper function that will help us to give the summary of each layer of the neural network model for the task of recognizing named entities with Python:

Now I will create a function to train our model:

Testing the Named Entity Recognition Model

Now, I will use the spacy library in Python to test our NER model. I will add input of some lines about my self and let’s see what we will get after running the code:

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp('Hi, My name is Aman Kharwal \n I am from India \n I want to work with Google \n Steve Jobs is My Inspiration')
displacy.render(text, style = 'ent', jupyter=True)
Named Entity Recognition

So or trained Neural network performs very well. I hope you liked this article on Machine Learning project on Named Entity Recognition with Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply