Some Basics For Filling in Missing Values

Jacob Heyman
3 min readMar 22, 2021

--

Missing values in data are an inevitability and how you choose to deal with them is entirely dependent on your specific data. Lucky for us Pandas offers multiple options for filling in the missing values which can expedite any data cleaning process. In this blog I will show a few code examples of how to deal with missing values.

Just get rid of them

The quickest to deal with missing values is just to drop them all from your data set.

df.dropna()

This simple line of code will drop all Nan values from your data set. This is a perfect solution if you only have a few missing values. If you only want to drop rows with Nan values you can use

df.dropna(how='all)

Alternatively to drop columns with missing values you can use

df.dropna(axis=1)

These methods are only good for removing a few missing values in a dataset. If you have a lot of missing values there are other options that will not compromise your data.

Filling missing data

What if you are missing a lot of data in a specific column, but do not want to drop the row and limit the size of your data. Pandas offers several different methods for filling in the missing values.

df.fillna(0)

This is a simple solution to fill all the Nan values with 0. You can also manually specify what value to fill the column with.

df.fillna({5: 2, 7:2.5}

Here We fill in the missing values of row 5 and 7 with the values 2 and 2.5.

Another option is to use forward or backward fills. These are particularly useful for continuous data that is similar to it’s adjacent rows. Recently I have been working on cleaning some game player data where several match scores are missing. I used a front fill to fill in the missing values, because the score only changed by 10–20 points each match.

df['sr_finish'].fillna(method='ffill', inplace=True)

Here I selected the my target column and chose the forward fill method to fill in the Nan values. Alternatively you can us method=bfill to do a backfill. A limit can also be added to limit the amount of values filled in.

df['sr_finish'].fillna(method='ffill', limit=2)

This would only fill two missing values in with the previous existing value.

The fill method can also be used to fill in the missing values with averages

df.fillna(df.mean())

This method will fill in the Nan values with the average value for the column.

Cleaning is an art

Choosing which method to use for filling in missing values will help shape your data. Data cleaning is the most time consuming and difficult part of any analysis. Knowing the different options for cleaning can help expedite this process. I recommend trying out a few different filling methods to see which one works best for your specific data.

--

--

Jacob Heyman
Jacob Heyman

Written by Jacob Heyman

Data Scientist With A Background in Biology

No responses yet