Data Leakage in Machine Learning

In this article, you will learn what data leakage is and how to avoid it. If you don’t know how to prevent it, leaks will frequently occur and ruin your models in subtle and dangerous ways. It is, therefore, one of the most important concepts for all machine learning practitioners.

What is Data Leakage?

Data leakage generally occurs when our training data is fed with the information about the target, but similar data is available when the model is used in predictions. This leads to high performance on the drive assembly, but the model will perform poorly in production.

In simple words, data leakage makes a machine learning model look very precise until you start making predictions with the model and then the model becomes very inaccurate.

Data Leakage is of two types: target leakage and train-test contamination.

Target leakage

A target leak occurs when your predictors include data that will not be available at the time you make the predictions. It’s important to think of the target leak in terms of the timing or chronological order of data availability, and not just whether a feature makes good predictions.

Train-Test Contamination

A different type of leak occurs when you are not careful to distinguish training data from validation data. Validation is meant to be a measure of how well the model performs on data it has not previously considered. You can subtly corrupt this process if the validation data affects preprocessing behaviour. This is referred to as train-test contamination.

Data Leakage in Action

Here you will learn one way to detect and remove target leaks. I will use credit card apps dataset and ignore the master data setup code. The result is that the information about each credit card application is stored in an X DataFrame. I will use it to predict which applications have been accepted in a y series. You can download the dataset from here:

import pandas as pd # Read the data data = pd.read_csv('AER_credit_card_data.csv', true_values = ['yes'], false_values = ['no']) # Select target y = data.card # Select predictors X = data.drop(['card'], axis=1) print("Number of rows in the dataset:", X.shape[0]) X.head()
Number of rows in the dataset: 1319
reports	age	income	share	expenditure	owner	selfemp	dependents	months	majorcards	active
0	0	37.66667	4.5200	0.033270	124.983300	True	False	3	54	1	12
1	0	33.25000	2.4200	0.005217	9.854167	False	False	3	34	1	13
2	0	33.66667	4.5000	0.004156	15.000000	True	False	4	58	1	5
3	0	30.50000	2.5400	0.065214	137.869200	False	False	0	25	1	7
4	0	32.16667	9.7867	0.067051	546.503300	True	False	2	64	1	5

Since this is a small dataset, I will use cross-validation to ensure accurate measures of model quality:

from sklearn.pipeline import make_pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score # Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!) my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100)) cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy') print("Cross-validation accuracy: %f" % cv_scores.mean())
Cross-validation accuracy: 0.979525

With experience, you will find that it is very rare to find accurate models for 98% of the time. It does happen, but it’s quite rare that we have to inspect the data more closely to detect any target leaks. Here is a summary of the data, that you will observe:

card: 1 if credit card application accepted, 0 if not
reports: Number of major derogatory reports
age: Age n years plus twelfths of a year
income: Yearly income (divided by 10,000)
share: Ratio of monthly credit card expenditure to yearly income
expenditure: Average monthly credit card expenditure
owner: 1 if owns home, 0 if rents
selfempl: 1 if self-employed, 0 if not
dependents: 1 + number of dependents
months: Months living at current address
majorcards: Number of major credit cards held
active: Number of active credit accounts

Some variables seem suspicious. For example, does an expense mean an expense on this card or cards used before the application? At this point, baseline data comparisons can be very helpful:

expenditures_cardholders = X.expenditure[y] expenditures_noncardholders = X.expenditure[~y] print('Fraction of those who did not receive a card and had no expenditures: %.2f' %((expenditures_noncardholders == 0).mean())) print('Fraction of those who received a card and had no expenditures: %.2f' %(( expenditures_cardholders == 0).mean()))
Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02

As noted above, all of those who did not receive a card had no spending, while only 2% of those who received a card had no spending. It is not surprising that our model appears to have high accuracy. But it also appears to be a case of goal leakage, where spending likely means spending on the card they requested.

Since the share is partly determined by expenditure, it should also be excluded. The active and major variables are a little less clear, but from the description, they look worrisome. In most of the situations, it’s better to play safe than sorry if you can’t track down the people who created the data to find out more. I will run a model with no target leak as follows:

# Drop leaky predictors from dataset potential_leaks = ['expenditure', 'share', 'active', 'majorcards'] X2 = X.drop(potential_leaks, axis=1) # Evaluate the model with leaky predictors removed cv_scores = cross_val_score(my_pipeline, X2, y, cv=5, scoring='accuracy') print("Cross-val accuracy: %f" % cv_scores.mean())
Cross-val accuracy: 0.830924

This accuracy is a bit lower, which can be disappointing. However, we can expect it to be correct about 80% of the time when used on new applications when the leaky model would likely do a lot worse than that.

Also, Read – XGBoost Algorithm in Machine Learning.

Data Leakage can be a million-dollar mistake in many Machine Learning tasks. I hope you liked this article on how to handle data leakage in machine learning tasks. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

2 Comments

  1. Hi! First of all, great article!
    Can you give me an example of train-test contamination?
    Also, every time we have to look for data leakage after the accuracy score?

    Thank you very much and keep going!

Leave a Reply