Summarize Text with Machine Learning

Text Summarization involves condensing a piece of text into a shorter version, reducing the size of the original text while preserving key information and the meaning of the content. Since manual text synthesis is a long and generally laborious task, task automation is gaining in popularity and therefore a strong motivation for academic research. In this article, I will take you through the task of Natural Language Processing to summarize text with Machine Learning.

In Machine Learning, there are important applications for text summarization in various Natural Language Processing related tasks such as text classification, answering questions, legal text synthesis, news synthesis, and headline generation which can be achieved with Machine Learning. The intention to summarize a text is to create an accurate and fluid summary containing only the main points described in the document.

Also, Read – Scraping YouTube with Python.

Types of Approaches to Summarize Text

Before I dive into showing you how we can summarize text using machine learning and python, it is important to understand what are the types of text summarization to understand how the process works, so that we can use logic while using machine learning techniques to summarize the text.

Generally, Text Summarization is classified into two main types: Extraction Approach and Abstraction Approach. Now let’s go through both these approaches before we dive into the coding part.

The Extractive Approach

The Extractive approach takes sentences directly from the document according to a scoring function to form a cohesive summary. This method works by identifying the important sections of the text cropping and assembling parts of the content to produce a condensed version.

The Abstractive Approach

The Abstraction approach aims to produce a summary by interpreting the text using advanced natural language techniques to generate a new, shorter text – parts of which may not appear in the original document, which conveys the most information.

In this article, I will be using the extractive approach to summarize text using Machine Learning and Python. I will use the TextRank algorithm which is an extractive and unsupervised machine learning algorithm for text summarization.

Summarize Text with Machine Learning

So now, I hope you know what text summarization is and how it works. Now, without wasting any time let’s see how we can summarize text using machine learning. The dataset that I will use in this task can be downloaded from here. Now, let’s import the necessary packages that we need to get started with the task:

import pandas as pd
import numpy as np
import nltk'punkt')
import re
from nltk.corpus import stopwordsCode language: JavaScript (javascript)

Now, as I have imported the necessary packages, the next step is to look at the data to get some idea of what we are going to work with:

from google.colab import files
uploaded = files.upload()
df = pd.read_csv("tennis.csv")
df.head()Code language: JavaScript (javascript)
df['article_text'][1]Code language: CSS (css)
"BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had his first match-point chance to break serve again at 5-1. He then dropped his serve to love, and let another match point slip in Medvedev's next service game by netting a backhand. He clinched on his fourth chance when Medvedev netted from the baseline. Copil upset expectations of a Federer final against Alexander Zverev in a 6-3, 6-7 (6), 6-4 win over the fifth-ranked German in the earlier semifinal. The Romanian aims for a first title after arriving at Basel without a career win over a top-10 opponent. Copil has two after also beating No. 6 Marin Cilic in the second round. Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with a forehand volley winner to break Zverev for the second time in the semifinal. He came through two rounds of qualifying last weekend to reach the Basel main draw, including beating Zverev's older brother, Mischa. Federer had an easier time than in his only previous match against Medvedev, a three-setter at Shanghai two weeks ago."

Now, I will split the sequences into the data by tokenizing them using a list:

from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:

sentences = [y for x in sentences for y in x]Code language: JavaScript (javascript)

Now I am going to use the Glove method for word representation, it is an unsupervised learning algorithm developed by Stanford University to generate word integrations by aggregating the global word-to-word co-occurrence matrix from a corpus. To implement this method you have to download a file from here and store it into the same directory where your python file is:

word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs

clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]Code language: JavaScript (javascript)

Now, I will create vectors for the sentences:

sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    v = np.zeros((100,))
  sentence_vectors.append(v)Code language: PHP (php)

Finding Similarities to Summarize Text

The next step is to find similarities between the sentences, and I will use the cosine similarity approach for this task. Let’s create an empty similarity matrix for this task and fill it with cosine similarities of sentences:

sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]Code language: JavaScript (javascript)

Now I am going to convert the sim_mat similarity matrix into the graph, the nodes in this graph will represent the sentences and the edges will represent the similarity scores between the sentences:

import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)Code language: JavaScript (javascript)

Now, let’s summarize text:

ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
for i in range(5):
  print('\n')Code language: PHP (php)


Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. I think every person has different interests. I have friends that have completely different jobs and interests, and I've met them in very different parts of my life. I think everyone just thinks because we're tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do. There are so many other things that we're interested in, that we do.'

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

So, I will congratulate you as you have made a text summarization system using machine learning which will reduce your effort to find important and the most relevant information from any topic.

Also, Read – What is Cloud Computing in Machine Learning.

I hope you liked this article on summarize text using machine learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Articles: 76


  1. Dear Aman,

    Thank you so much for providing this article. It is very useful post.

    The file “glove.6B.100d.txt” appears to be missing from the directory. Can you please re-upload and update the link if you still have the file?

Leave a Reply