Visualizing Missing Data
Revisiting a simple skill with some new tricks
Missing Data
Dealing with missing values in data is an inevitability in data science. Finding those missing values and determining what to do with them will make or break any project. At flatiron learning how to find missing values is one of the first methods we learn.
df.isnull.()sum()
This simple line of code is a standard when taking your initial look through your data. It finds all the missing values and returns a sum of all the missing values. This is great, but what if you want to know the more specifics about the missing values? Are there patterns in the missing data? Is there a specific column or row with a disproportionate amount of missing data? In this blog, I will show some code I have been using to help visualize missing values and help analyze the data.
Plot it
The first visualization is super simple. Just plot out the isnull function. Plotting out the missing values this way will help give a quick look at the missing values and how it is distributed throughout the data.
sns.heatmap(series.isnull(),yticklabels = False, cbar = False,cmap = 'tab20c_r')
plt.title('Missing Data: Series')
plt.show()
Here is a heat-map depicting the distribution of missing values in a series of vaccination data. As you can see with just a few simple lines of code there are a lot of missing values. This is a neat trick but what if we want to get a better feel for just how many missing values there are in this visualization.
Quantifying the missing values
The next step is my new favorite method for really seeing just how many missing values there are in your data.
rows = series.shape[0]
null_total = series.isnull().sum()missing_percent = (null_total/rows)*100
missing_percent
With this code you can find the percentage of missing values
Now we can see the percentages of missing data for each column in the dataset. The only downside is that this list isn’t the most interpretable. To create a better visualization, these percentages can be turned into a dataframe
pd.DataFrame(missing_percent ,columns = ["missing_percent"])
Now we have a nice neat table that shows us the percentages of missing values for each column.
These are some simple tricks for some initial data analysis. I hope that you find this code useful. It is most defiantly one of my favorite new coding tricks that I will be using in all future projects.