Search Queries Anomaly Detection using Python

Search Queries Anomaly Detection means identifying queries that are outliers according to their performance metrics. It is valuable for businesses to spot potential issues or opportunities, such as unexpectedly high or low CTRs. If you want to learn how to detect anomalies in search queries, this article is for you. In this article, I’ll take you through the task of Search Queries Anomaly Detection with Machine Learning using Python.

Search Queries Anomaly Detection: Process We Can Follow

Search Queries Anomaly Detection is a technique to identify unusual or unexpected patterns in search query data. Below is the process we can follow for the task of Search Queries Anomaly Detection:

  1. Gather historical search query data from the source, such as a search engine or a website’s search functionality.
  2. Conduct an initial analysis to understand the distribution of search queries, their frequency, and any noticeable patterns or trends.
  3. Create relevant features or attributes from the search query data that can aid in anomaly detection.
  4. Choose an appropriate anomaly detection algorithm. Common methods include statistical approaches like Z-score analysis and machine learning algorithms like Isolation Forests or One-Class SVM.
  5. Train the selected model on the prepared data.
  6. Apply the trained model to the search query data to identify anomalies or outliers.

So, the process starts with collecting a dataset based on search queries. I found an ideal dataset for this task. You can download the dataset from here.

Search Queries Anomaly Detection using Python

Now, let’s get started with the task of Search Queries Anomaly Detection by importing the necessary Python libraries and the dataset:

import pandas as pd
from collections import Counter
import re
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

queries_df = pd.read_csv("Queries.csv")
print(queries_df.head())
                                 Top queries  Clicks  Impressions     CTR  \
0                number guessing game python    5223        14578  35.83%   
1                        thecleverprogrammer    2809         3456  81.28%   
2           python projects with source code    2077        73380   2.83%   
3  classification report in machine learning    2012         4959  40.57%   
4                      the clever programmer    1931         2528  76.38%   

   Position  
0      1.61  
1      1.02  
2      5.94  
3      1.28  
4      1.09  

Exploratory Data Analysis

Let’s have a look at the column insights before moving forward:

print(queries_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None

Now, let’s convert the CTR column from a percentage string to a float:

# Cleaning CTR column
queries_df['CTR'] = queries_df['CTR'].str.rstrip('%').astype('float') / 100

Now, let’s analyze common words in each search query:

# Function to clean and split the queries into words
def clean_and_split(query):
    words = re.findall(r'\b[a-zA-Z]+\b', query.lower())
    return words

# Split each query into words and count the frequency of each word
word_counts = Counter()
for query in queries_df['Top queries']:
    word_counts.update(clean_and_split(query))

word_freq_df = pd.DataFrame(word_counts.most_common(20), columns=['Word', 'Frequency'])

# Plotting the word frequencies
fig = px.bar(word_freq_df, x='Word', y='Frequency', title='Top 20 Most Common Words in Search Queries')
fig.show()
Search Queries Anomaly Detection: Top 20 Most Common Words in Search Queries

Now, let’s have a look at the top queries by clicks and impressions:

# Top queries by Clicks and Impressions
top_queries_clicks_vis = queries_df.nlargest(10, 'Clicks')[['Top queries', 'Clicks']]
top_queries_impressions_vis = queries_df.nlargest(10, 'Impressions')[['Top queries', 'Impressions']]

# Plotting
fig_clicks = px.bar(top_queries_clicks_vis, x='Top queries', y='Clicks', title='Top Queries by Clicks')
fig_impressions = px.bar(top_queries_impressions_vis, x='Top queries', y='Impressions', title='Top Queries by Impressions')
fig_clicks.show()
fig_impressions.show()
Top Queries by Clicks
Top Queries by Impressions

Now, let’s analyze the queries with the highest and lowest CTRs:

# Queries with highest and lowest CTR
top_ctr_vis = queries_df.nlargest(10, 'CTR')[['Top queries', 'CTR']]
bottom_ctr_vis = queries_df.nsmallest(10, 'CTR')[['Top queries', 'CTR']]

# Plotting
fig_top_ctr = px.bar(top_ctr_vis, x='Top queries', y='CTR', title='Top Queries by CTR')
fig_bottom_ctr = px.bar(bottom_ctr_vis, x='Top queries', y='CTR', title='Bottom Queries by CTR')
fig_top_ctr.show()
fig_bottom_ctr.show()
Top Queries by CTR
Bottom Queries by CTR

Now, let’s have a look at the correlation between different metrics:

# Correlation matrix visualization
correlation_matrix = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']].corr()
fig_corr = px.imshow(correlation_matrix, text_auto=True, title='Correlation Matrix')
fig_corr.show()
Search Queries Anomaly Detection: Correlation Matrix

In this correlation matrix:

  1. Clicks and Impressions are positively correlated, meaning more Impressions tend to lead to more Clicks.
  2. Clicks and CTR have a weak positive correlation, implying that more Clicks might slightly increase the Click-Through Rate.
  3. Clicks and Position are weakly negatively correlated, suggesting that higher ad or page Positions may result in fewer Clicks.
  4. Impressions and CTR are negatively correlated, indicating that higher Impressions tend to result in a lower Click-Through Rate.
  5. Impressions and Position are positively correlated, indicating that ads or pages in higher Positions receive more Impressions.
  6. CTR and Position have a strong negative correlation, meaning that higher Positions result in lower Click-Through Rates.

Detecting Anomalies in Search Queries

Now, let’s detect anomalies in search queries. You can use various techniques for anomaly detection. A simple and effective method is the Isolation Forest algorithm, which works well with different data distributions and is efficient with large datasets:

from sklearn.ensemble import IsolationForest

# Selecting relevant features
features = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']]

# Initializing Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=0.01)  # contamination is the expected proportion of outliers

# Fitting the model
iso_forest.fit(features)

# Predicting anomalies
queries_df['anomaly'] = iso_forest.predict(features)

# Filtering out the anomalies
anomalies = queries_df[queries_df['anomaly'] == -1]

Here’s how to analyze the detected anomalies to understand their nature and whether they represent true outliers or data errors:

print(anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']])
                          Top queries  Clicks  Impressions     CTR  Position
0         number guessing game python    5223        14578  0.3583      1.61
1                 thecleverprogrammer    2809         3456  0.8128      1.02
2    python projects with source code    2077        73380  0.0283      5.94
4               the clever programmer    1931         2528  0.7638      1.09
15         rock paper scissors python    1111        35824  0.0310      7.19
21              classification report     933        39896  0.0234      7.53
34           machine learning roadmap     708        42715  0.0166      8.97
82                           r2 score     367        56322  0.0065      9.33
167               text to handwriting     222        11283  0.0197     28.52
929                     python turtle      52        18228  0.0029     18.75

The anomalies in our search query data are not just outliers. They are indicators of potential areas for growth, optimization, and strategic focus. These anomalies are reflecting emerging trends or areas of growing interest. Staying responsive to these trends will help in maintaining and growing the website’s relevance and user engagement.

Summary

So, Search Queries Anomaly Detection means identifying queries that are outliers according to their performance metrics. It is valuable for businesses to spot potential issues or opportunities, such as unexpectedly high or low CTRs. I hope you liked this article on Search Queries Anomaly Detection with Machine Learning using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1536

Leave a Reply