Handling Missing Data — Data Preprocessing

In this blog let us learn how to handle missing data in a dataset, Data preprocessing is a very crucial step in Machine Learning, it is like the foundation, if we mess up with the preprocessing step then no matter what algorithm you use or you hyper tune, your model will fail. So let’s see how to handle it.

Sri Vigneshwar DJ
2 min readMay 9, 2021

To make it understandable we will implement using the library(sklearn) and without that.

With Library:

Step 1 — Importing the Library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2 - Importing the Dataset

dataset = pd.read_csv("your_file_path")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values #Here let's not worry about Y

Step 3 - Let’s see the missing data

Screenshot form notebook

Step 4 - Handling using Library

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan,strategy = 'mean')
imputer.fit(X[:,1:3])
X = imputer.transform(X[:,1:3])

sklearn library provides SimpleImputer class to handle missing data, here named-argument missing_values take type of missing value, and strategy takes what mathematical operation to be performed.

Read more about the SimpleImputer - https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

Here we can see that value is filled with the mean value of Salary.

--

--