One of the most important ways to resize data in the machine learning process is to use the term frequency inverted document frequency, also known as the tf-idf method. In this article, I will walk you through what the tf-idf method is in Machine Learning and how to implement it using the Python programming language.
What is tf-idf?
The intuition of the tf-idf method is to give high weight to any term that often appears in a particular document, but not in many documents in the corpus. If a word appears often in a particular document, but not in many documents, it is likely to be very descriptive of the contents of that document.
Scikit-Learn implements the tf -idf method in two classes: TfidfTransformer, which takes in the sparse matrix output produced by CountVectorizer and transforms it, and TfidfVectorizer, which takes in text data and performs both feature extraction of the bag of words and the transformation tf -idf.
Why Use Tf-Idf Vectorization?
Suppose a search engine has a database with thousands of cat descriptions and a user wants to search for furry cats, then he/she issues the query “furry cat”. A search engine needs to decide which result should be returned from the database.
If the search engine has documents that match the exact query, there is no doubt, but what if it needs to decide between partial matches? To simplify, let’s say it has to choose between these two descriptions:
- “The pretty cat”
- “A furry kitten”
The first description contains 2 of 3 words of the query and the second only matches 1 of 3, then the search engine will choose the first description. How can TF-IDF help it to choose the second description instead of the first?
The TF is the same for every word, no difference here. However, one would expect the terms “cat” and “kitten” to appear in many documents (high frequency of documents implies low IDF), while the term “furry” will appear in fewer documents (IDF taller). Thus, the TF-IDF for cat & kitten has a low value while the TF-IDF is larger for “hairy”, that is to say, that in our database the word “hairy” has more power. discriminating as “cat” or “kitten”.
If we use the TF-IDF to weight the different words that match the query, “hairy” would be more relevant than “cat” and so we could choose “hairy kitten” as the best match.
Implementation with Python
Now let’s see how to implement the tf-idf method with Machine Learning using the Python programming language. The example below shows the implementation of tf-idf vectorization using Scikit-learn:
Output: (4, 9)
Keep in mind that tf-idf scaling is intended to find words that distinguish documents, but this is a purely unsupervised technique. Low tf-idf features are those that are either very commonly used in documents or used sparingly and only in very long documents.
I hope you liked this article on the TF-IDF vectorization in Machine Learning. Feel free to ask your valuable questions in the comments section below.