Tokenization is the first step you should perform after collecting a textual dataset in every problem based on Natural Language Processing. Sentence and Word tokenization are two different strategies of tokenization that you should know. In this article, I will take you through an introduction to Sentence and Word tokenization and their implementation using Python.
What is Tokenization?
Tokenization is the process of breaking up a piece of text into sentences or words. When we break down textual data into sentences or words, the output we get is known as tokens. There are two strategies for tokenization of a textual dataset:
- Sentence Tokenization: It means breaking a piece of text into sentences. For example, when you tokenize a paragraph, it splits the paragraph into sentences known as tokens. In many natural language processing problems, splitting text data into sentences is very useful. Sentences are separated by a full stop, so the process of sentence tokenization finds all the full stops in a piece of text to split the data into sentences.
- Word Tokenization: Word tokenization is the most common way of tokenization. It means to split the complete textual data into words. For example, when you tokenize a paragraph, it splits the paragraph into words known as tokens. Words are separated by a space, so the process of word tokenization finds all the spaces in a piece of text to split the data into words.
I hope you now have understood sentence and word tokenization. Now in the section below, I will take you through the implementation of sentence and word tokenization using Python.
Sentence and Word Tokenization using Python
Sentence tokenization means splitting the textual data into sentences. Here is the implementation of sentence tokenization using Python:
import nltk nltk.download('punkt') from nltk.tokenize import sent_tokenize sentence = "Hi, My name is Aman, I hope you like my work. You can follow me on Instagram for more resources. My username is 'the.clever.programmer'." print(sent_tokenize(sentence))
['Hi, My name is Aman, I hope you like my work.', 'You can follow me on Instagram for more resources.', "My username is 'the.clever.programmer'."]
Word tokenization means splitting the textual data into words. Here is the implementation of word tokenization using Python:
from nltk.tokenize import TreebankWordTokenizer word_token = TreebankWordTokenizer() print(word_token.tokenize(sentence))
['Hi', ',', 'My', 'name', 'is', 'Aman', ',', 'I', 'hope', 'you', 'like', 'my', 'work.', 'You', 'can', 'follow', 'me', 'on', 'Instagram', 'for', 'more', 'resources.', 'My', 'username', 'is', "'the.clever.programmer", "'", '.']
Tokenizing sentences and words are two different strategies that you should know. Tokenizing sentences means dividing text data into sentences, and tokenizing words means dividing text data into words. I hope you liked this article on sentence and word tokenization using Python. Please feel free to ask valuable questions in the comments section below.