Visualizations I think are neat

Finding some references for data visualizations for future projects

Jacob Heyman
6 min readNov 19, 2020

My struggle with finding the right visualizations

Project week at Flatiron continues to be my favorite part of the bootcamp, however my use of visualizations has been pretty stale, and limited to the examples shown in class. A good data science project should always have banging visualizations to match the analysis. In order to up my “viz” game I began creating a notebook of different visualizations i found on Seaborn and Plotly’s website. My goal was to create a reference for my self to use on future projects. In this blog I will show some of the more interesting visualizations I have found and plan to use in upcoming products.

Jointplot but make it fancy

Comparing the relationship between features can help paint a picture of the nature of your data. There are many ways to display theses feature relationships, like scatter plots or kernal density. A Jointplot has the added benefit of not only visualizing the relationship of the two variables but also displaying the data distribution. A standard jointplot is a scatterplot with an additional grid with a histogram distribution. I found that a hexplot creates an interesting display of kernel density. Here is some simple code for a hexplot using the seaborn iris dataset.

#hexplot for petal length compared to sepel length
sns.jointplot(x=df["petal_length"], y=df["sepal_length"],
color= 'orchid', kind='hex',
marginal_kws=dict(bins=30, rug=True))
sns.jointplot(x=df["sepal_width"], y=df["sepal_length"],
color= 'orchid',kind='hex',
marginal_kws=dict(bins=30, rug=True))

To make the hex plot, use the kind=’hex’ parameter. I also set the number of bins for the margin histogram to 3o and included rugplot ticks. In these graphs, you can see the dark hexagons representing dense regions of related data, which match the margin distributions.

Jointgrid

For the next visualization in my notebook I played around with a jointgrid from seaborn. This guy is similar to the jointplot we did above with the added flexibility of selecting what kinds of graphs you want in your main plot and the marginal plots

plot = sns.JointGrid(data=df, x='sepal_width', y='sepal_length',
hue='species',palette='Set2',
height=7,marginal_ticks=True)
plot.plot(sns.scatterplot, sns.histplot)

Here we colored the target variable by assigning ‘hue’ to the target. We also set the x and y axis to the two features we are comparing. This visualization gives a good display of the different target variables and their distribution between the to selected variables.

Some weird things you can do with Distplots

Who doesn’t love a good kernel density plot. With seaborns displot, you can plot out a multitude of different distribution visualizations. Displot has the added benefit of allowing you to plot bivariate or multivariable distributions, just like we did above in the jointgrid.

sns.displot(data=df,
x='sepal_length',hue='species',
y='sepal_width',kind='kde',
height=9, palette='Set2',
rug=True)

Do I find this visualization super usefull for this dataset? Not really, but I do think it is neat looking. You can see 2 and a half distinct groupings of data distributions between the three target species. For kind I used kde for a kernel density estimation. Other options are histogram ecdf which is a proportion count of the observation. I added rug ticks to show the instances of those data points.

What a Clustermap

We all love a good heatmap to show the correlation of features, but what if you want to look at the hierarchical clusters of the data. The cluster map offers the advantage of displaying clusters along both the columns and rows of the the dataset. You also have the option to choose what kind of clustering you want to observe. For example corrolation:

df1 = df.drop('species',axis=1)sns.clustermap(df1, metric="correlation", 
standard_scale=1, cmap='coolwarm')

The cluster map adds a dendrogram to show the clustering. I also changed the color to coolwarm with cmap= to better visualize the differences in the plot.

In addition to correlation you can also plot euclidian cluster

sns.clustermap(df1, metric="euclidean", 
standard_scale=1, cmap='coolwarm')

I personally found these clustermaps a little difficult to work with and interoperate, but the euclidean does show some interesting clustering of features by the rows of the data. I will defiantly need to look further into labeling and interpretation.

Pairplot relationship goals

Pairplots are my favorite visualization to use in an EDA. They show a matrix of selected features and their relation to each-other. In the pair plot below I color coded the the target to see the relationship of the different targets across multiple feature combinations. I also used the markers= to change the shape of the scatter dots.

sns.pairplot(df, kind="scatter", 
hue="species",markers=["o", "s", "D"],
palette="Spectral")

Here we can see clear differentiation between the three target variables. The pairplot really helps paint a picture of how the data is distributed.

PCA PSA

This week we started learning about Principal component analysis and the goal of reducing the dimensionality of the dataset. Using PCA you can identify what the principal components are in your data and plot the components to get a two dimensional look. This visualization below shows a 2D representation of the first two principal components.

X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]pca = PCA(n_components=2)
components = pca.fit_transform(X)
fig = px.scatter(components, x=0, y=1, color=df['species'])
fig.show()

Looking at the top two components lets you observe a simplified display of the component variance. What if you want to get a better picture of the variance? You can take a look at the PCA in the 3rd dimension. These neat 3D graphs from plotly really help demonstrate the dimensionality of the dataset.

X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]pca = PCA(n_components=4)
components = pca.fit_transform(X)
total_var = pca.explained_variance_ratio_.sum() * 100fig = px.scatter_3d(
components, x=0, y=1, z=2, color=df['species'],
title=f'Total Explained Variance: {total_var:.2f}%',
labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.show()

Plotly lets you pan and rotate through the data, allowing you to see the different distributions of the PCA.

Thats a wrap

Again these are just some visualizations that I found and thought were cool. I most defiantly need to try them out on some other datasets and play around with the parameters. I collected them as a source of inspiration for my next project, and hope that some of these plots may also be useful to anyone else.

--

--

Jacob Heyman
Jacob Heyman

Written by Jacob Heyman

Data Scientist With A Background in Biology

No responses yet