Sentence and Word Tokenization using Python

Tokenization is the first step you should perform after collecting a textual dataset in every problem based on Natural Language Processing. Sentence and Word tokenization are two different strategies of tokenization that you should know. In this article, I will take you through an introduction to Sentence and Word tokenization and their implementation using Python.

What is Tokenization?

Tokenization is the process of breaking up a piece of text into sentences or words. When we break down textual data into sentences or words, the output we get is known as tokens. There are two strategies for tokenization of a textual dataset:

  1. Sentence Tokenization: It means breaking a piece of text into sentences. For example, when you tokenize a paragraph, it splits the paragraph into sentences known as tokens. In many natural language processing problems, splitting text data into sentences is very useful. Sentences are separated by a full stop, so the process of sentence tokenization finds all the full stops in a piece of text to split the data into sentences.
  2. Word Tokenization: Word tokenization is the most common way of tokenization. It means to split the complete textual data into words. For example, when you tokenize a paragraph, it splits the paragraph into words known as tokens. Words are separated by a space, so the process of word tokenization finds all the spaces in a piece of text to split the data into words.

I hope you now have understood sentence and word tokenization. Now in the section below, I will take you through the implementation of sentence and word tokenization using Python.

Sentence and Word Tokenization using Python

Sentence tokenization means splitting the textual data into sentences. Here is the implementation of sentence tokenization using Python:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
sentence = "Hi, My name is Aman, I hope you like my work. You can follow me on Instagram for more resources. My username is 'the.clever.programmer'."
print(sent_tokenize(sentence))
['Hi, My name is Aman, I hope you like my work.', 'You can follow me on Instagram for more resources.', "My username is 'the.clever.programmer'."]

Word tokenization means splitting the textual data into words. Here is the implementation of word tokenization using Python:

from nltk.tokenize import TreebankWordTokenizer
word_token = TreebankWordTokenizer()
print(word_token.tokenize(sentence))
['Hi', ',', 'My', 'name', 'is', 'Aman', ',', 'I', 'hope', 'you', 'like', 'my', 'work.', 'You', 'can', 'follow', 'me', 'on', 'Instagram', 'for', 'more', 'resources.', 'My', 'username', 'is', "'the.clever.programmer", "'", '.']

Summary

Tokenizing sentences and words are two different strategies that you should know. Tokenizing sentences means dividing text data into sentences, and tokenizing words means dividing text data into words. I hope you liked this article on sentence and word tokenization using Python. Please feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1610

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading