Visualizing Missing Data

Jacob Heyman
3 min readMar 14, 2021

Revisiting a simple skill with some new tricks

Missing Data

Dealing with missing values in data is an inevitability in data science. Finding those missing values and determining what to do with them will make or break any project. At flatiron learning how to find missing values is one of the first methods we learn.

df.isnull.()sum()

This simple line of code is a standard when taking your initial look through your data. It finds all the missing values and returns a sum of all the missing values. This is great, but what if you want to know the more specifics about the missing values? Are there patterns in the missing data? Is there a specific column or row with a disproportionate amount of missing data? In this blog, I will show some code I have been using to help visualize missing values and help analyze the data.

Plot it

The first visualization is super simple. Just plot out the isnull function. Plotting out the missing values this way will help give a quick look at the missing values and how it is distributed throughout the data.

sns.heatmap(series.isnull(),yticklabels = False, cbar = False,cmap = 'tab20c_r')
plt.title('Missing Data: Series')
plt.show()

Here is a heat-map depicting the distribution of missing values in a series of vaccination data. As you can see with just a few simple lines of code there are a lot of missing values. This is a neat trick but what if we want to get a better feel for just how many missing values there are in this visualization.

Quantifying the missing values

The next step is my new favorite method for really seeing just how many missing values there are in your data.

rows = series.shape[0]
null_total = series.isnull().sum()
missing_percent = (null_total/rows)*100
missing_percent

With this code you can find the percentage of missing values

Now we can see the percentages of missing data for each column in the dataset. The only downside is that this list isn’t the most interpretable. To create a better visualization, these percentages can be turned into a dataframe

pd.DataFrame(missing_percent ,columns = ["missing_percent"])

Now we have a nice neat table that shows us the percentages of missing values for each column.

These are some simple tricks for some initial data analysis. I hope that you find this code useful. It is most defiantly one of my favorite new coding tricks that I will be using in all future projects.

--

--