How to prevent data leakage?

2 min readMay 11, 2021

Data leakage in machine learning very bad, So before understanding how to prevent let’s see why it happens.

Data preprocessing is the first and crucial step in Machine Learning, these are the following step involved:

Data collection.
Identifying the missing data, and handling it.
Encoding the categorical data if present.
Splitting the data into training and testing datasets.
Feature Scaling.

So Feature Scaling is done after the split. If you ask why? then,

Feature Scaling in simple terms is bringing feature value to a specific range, the reason for doing this is because features with the higher value will have more weight. This will directly affect the x - coefficient, and overall sensitivity. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

There are two ways to scale:

Standardizing

It is a very effective technique that re-scales a feature value so that it has distribution with 0 mean value and variance equals to

2. Normalizing

This technique re-scales a feature or observation value with a distribution value between 0 and 1.

Data leakage happens when Feature Scaling is done before splitting the dataset. Standardizing and Normalizing are done by finding the mean, variance, min, max, and standard deviation. So when feature scaling is done before the splitting, we are finding mean and variance for the whole dataset. This is where the leakage happens.

How to prevent data leakage?

Written by Sri Vigneshwar DJ