Anomaly Detection means detecting unexpected events in the dataset which differ from the norm. Anomaly Detection is very often used in unlabeled data. There are two most important assumptions in the task of Anomaly Detection: the first assumption says that Anomalies occurs very rarely in data, and the second assumption is that the features differ from the normal instances significantly.
Data Exploration
In this article, I will take you through the problem of Anomaly Detection with Machine Learning. The dataset I will use in this article can be downloaded from here. Now let’s import the necessary libraries and have a quick look at some insights from the data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
from sklearn.ensemble import IsolationForest
Code language: Python (python)
Distribution of Sales
df = pd.read_excel("Superstore.xls")
df['Sales'].describe()
Code language: Python (python)
count 9994.000000 mean 229.858001 std 623.245101 min 0.444000 25% 17.280000 50% 54.490000 75% 209.940000 max 22638.480000 Name: Sales, dtype: float64
plt.scatter(range(df.shape[0]), np.sort(df['Sales'].values))
plt.xlabel('index')
plt.ylabel('Sales')
plt.title("Sales distribution")
sns.despine()
Code language: Python (python)

sns.distplot(df['Sales'])
plt.title("Distribution of Sales")
sns.despine()
Code language: Python (python)

The sales distribution in the dataset is very far from a normal distribution and it is also having a positive thin long tail. Most of the mass of the sales distribution is concentrated on the left side in the output above. It shows that the sales distribution exceeds the normal distribution.
Profit Distribution
df['Profit'].describe()
Code language: Python (python)
count 9994.000000 mean 28.656896 std 234.260108 min -6599.978000 25% 1.728750 50% 8.666500 75% 29.364000 max 8399.976000 Name: Profit, dtype: float64
plt.scatter(range(df.shape[0]), np.sort(df['Profit'].values))
plt.xlabel('index')
plt.ylabel('Profit')
plt.title("Profit distribution")
sns.despine()
Code language: Python (python)

sns.distplot(df['Profit'])
plt.title("Distribution of Profit")
sns.despine()
Code language: Python (python)

The profit distribution of the data is resulting in both positive and negative tail. Although, the positive tail is longer than the negative tail. This shows that the profit distribution is skewed positively.
Also, Read: Ridge Regression in Machine Learning.
So we have now got two places where the data has very low probability to occur. One on the left and another on the right.
Anomaly Detection of Sales
In Machine Learning, Isolation Forest algorithm is used to detect the outliers that can return the anomaly scores of each instance. This algorithm is based on a tree-based model. In this algorithm, the splits are made by first selecting a random feature and then selecting a random value from the splits between the minimum and maximum values of the selected feature. Now let’s go through this algorithm:
isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(df['Sales'].values.reshape(-1, 1))
xx = np.linspace(df['Sales'].min(), df['Sales'].max(), len(df)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='r',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Sales')
plt.show();
Code language: Python (python)

According to the above output, it looks like the sales that exceed 1000 would be considered as an outlier.
Anomaly Detection of Profit
Now Let’s use the same isolation algorithm to detect anomaly on profit:
isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(df['Profit'].values.reshape(-1, 1))
xx = np.linspace(df['Profit'].min(), df['Profit'].max(), len(df)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='r',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Profit')
plt.show();
Code language: Python (python)

According to the output above, It looks like that the profit that lies below 100 or exceeds 100 would be considered as an outlier. I hope you liked this article on Anomaly Detection with Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.