In this article, you will learn what data leakage is and how to avoid it. If you don’t know how to prevent it, leaks will frequently occur and ruin your models in subtle and dangerous ways. It is, therefore, one of the most important concepts for all machine learning practitioners.
What is Data Leakage?
Data leakage generally occurs when our training data is fed with the information about the target, but similar data is available when the model is used in predictions. This leads to high performance on the drive assembly, but the model will perform poorly in production.
In simple words, data leakage makes a machine learning model look very precise until you start making predictions with the model and then the model becomes very inaccurate.
Data Leakage is of two types: target leakage and train-test contamination.
Target leakage
A target leak occurs when your predictors include data that will not be available at the time you make the predictions. It’s important to think of the target leak in terms of the timing or chronological order of data availability, and not just whether a feature makes good predictions.
Train-Test Contamination
A different type of leak occurs when you are not careful to distinguish training data from validation data. Validation is meant to be a measure of how well the model performs on data it has not previously considered. You can subtly corrupt this process if the validation data affects preprocessing behaviour. This is referred to as train-test contamination.
Data Leakage in Action
Here you will learn one way to detect and remove target leaks. I will use credit card apps dataset and ignore the master data setup code. The result is that the information about each credit card application is stored in an X DataFrame. I will use it to predict which applications have been accepted in a y series. You can download the dataset from here:
import pandas as pd
# Read the data
data = pd.read_csv('AER_credit_card_data.csv',
true_values = ['yes'], false_values = ['no'])
# Select target
y = data.card
# Select predictors
X = data.drop(['card'], axis=1)
print("Number of rows in the dataset:", X.shape[0])
X.head()
Code language: PHP (php)
Number of rows in the dataset: 1319
reports age income share expenditure owner selfemp dependents months majorcards active 0 0 37.66667 4.5200 0.033270 124.983300 True False 3 54 1 12 1 0 33.25000 2.4200 0.005217 9.854167 False False 3 34 1 13 2 0 33.66667 4.5000 0.004156 15.000000 True False 4 58 1 5 3 0 30.50000 2.5400 0.065214 137.869200 False False 0 25 1 7 4 0 32.16667 9.7867 0.067051 546.503300 True False 2 64 1 5
Since this is a small dataset, I will use cross-validation to ensure accurate measures of model quality:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y,
cv=5,
scoring='accuracy')
print("Cross-validation accuracy: %f" % cv_scores.mean())
Code language: PHP (php)
Cross-validation accuracy: 0.979525
With experience, you will find that it is very rare to find accurate models for 98% of the time. It does happen, but it’s quite rare that we have to inspect the data more closely to detect any target leaks. Here is a summary of the data, that you will observe:
card: 1 if credit card application accepted, 0 if not reports: Number of major derogatory reports age: Age n years plus twelfths of a year income: Yearly income (divided by 10,000) share: Ratio of monthly credit card expenditure to yearly income expenditure: Average monthly credit card expenditure owner: 1 if owns home, 0 if rents selfempl: 1 if self-employed, 0 if not dependents: 1 + number of dependents months: Months living at current address majorcards: Number of major credit cards held active: Number of active credit accounts
Some variables seem suspicious. For example, does an expense mean an expense on this card or cards used before the application? At this point, baseline data comparisons can be very helpful:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]
print('Fraction of those who did not receive a card and had no expenditures: %.2f'
%((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f'
%(( expenditures_cardholders == 0).mean()))
Code language: PHP (php)
Fraction of those who did not receive a card and had no expenditures: 1.00 Fraction of those who received a card and had no expenditures: 0.02
As noted above, all of those who did not receive a card had no spending, while only 2% of those who received a card had no spending. It is not surprising that our model appears to have high accuracy. But it also appears to be a case of goal leakage, where spending likely means spending on the card they requested.
Since the share is partly determined by expenditure, it should also be excluded. The active and major variables are a little less clear, but from the description, they look worrisome. In most of the situations, it’s better to play safe than sorry if you can’t track down the people who created the data to find out more. I will run a model with no target leak as follows:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)
# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y,
cv=5,
scoring='accuracy')
print("Cross-val accuracy: %f" % cv_scores.mean())
Code language: PHP (php)
Cross-val accuracy: 0.830924
This accuracy is a bit lower, which can be disappointing. However, we can expect it to be correct about 80% of the time when used on new applications when the leaky model would likely do a lot worse than that.
Also, Read – XGBoost Algorithm in Machine Learning.
Data Leakage can be a million-dollar mistake in many Machine Learning tasks. I hope you liked this article on how to handle data leakage in machine learning tasks. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.
Hi! First of all, great article!
Can you give me an example of train-test contamination?
Also, every time we have to look for data leakage after the accuracy score?
Thank you very much and keep going!
Sure, You will get it soon.