Some code snippets for analyzing your image dataset

A different kind of EDA

Exploratory data analysis for images is totally different from working with a standard data set. You are working with a stack of images and not a dataframe. In this blog I’ll show some code snippets I have used to help display images and identify differences between the classes in my dataset.

EDA with the Alzheimer's data set

Im working with an Alzheimers data set. Alzheimers is categorized into increasing stages of deterioration of the hippocampus and grey matter in the brain. …


Image Classification

The use of image classification in the medical field is a growing area of study and interest. From identifying tumors in MRIs to creating AI that can detect cancer cells in blood, there are many applications for image classification. Creating these AI’s will help with early detection, more accurate diagnosis, and easier access to higher quality medicine anywhere on the globe. In this blog I will show the first few steps of prepping data for an image classifier.

Keras Image Data Generator

To read in Images I used the Keras Image data generator. In this blog I am using a kaggle data base for…


A list of some shortcuts I think are essential

Jupyter Notebook

Jupyter notebooks are an open source application that allows you to create notebooks with live code. The notebooks help you stay organized with easy to annotate code and I find them essential for any project. I got my first experience with Jupyter in the flatiron data science program and got a crash course in how to use the notebooks. In this blog I will share some shortcuts for Jupyter and some tricks I wish I had known when I first started using the notebooks. …


Prepping a Kaggle Data set for Logistic Regression

The Data

For this quick walkthrough, I chose a stroke dataset from Kaggle. More than 700,000 people in the US suffer from a stroke each year. There are multiple factors that contribute to someones risk of having a stroke. Understanding a patients potential risk for a stroke may help physicians administer precautionary care. With the power of machine learning, stroke patient data can be used to build a stroke risk classification model. This model could be deployed as an app or tool to help the user understand their risk of having a stroke. …


A quick review of some essential Genome Sequence analysis terminology

Next Generation Sequence Analysis

Next Generation Sequencing or (NGS) is the current method for sequencing genomic data. The process is a massive parallel sequencing of DNA that allows for fast and accurate genome sequencing. An accurate genome is essential for multiple applications across research and medicine. When working with NGS data, understanding the terminology is essential for beginning any analysis. In this short blog I will cover those essential terms so anyone can get a start on understanding NGS.

NGS Terminology

Contig: The joined collection of overlapping sequence clones


Some EDA and Visualizations made with Tableau

Overwatch

Overwatch is my favorite game. The amount of hours I have personally logged into this game over the years boarders on embarrassing. My obsession with this game, should explain why an Overwatch Kaggle dataset got me a little more excited about data then one should be. The data set was comprised of one players data over the course of 2017. In this blog I will show you some of my cleaning and EDA, as long as visualizations I made using Tableau.


A quick guide pooling layers when building a CNN

What are Pooling Layers

Pooling layers are an essential component of to a convoluted neural nets architecture. Pooling layers act to subsample the input image. Subsampling the image helps alleviate the computational load of the CNN and can help with overfitting the model. Pooling layers operate similarly to convolutional layers where the neurons are correlated to the output of neurons from the previous layer. Pooling layers differ from convolutional layers by having no weighted values. The pooling acts only to aggregate values with varying aggregation functions. When building a CNN two options for pooling are…


Missing values in data are an inevitability and how you choose to deal with them is entirely dependent on your specific data. Lucky for us Pandas offers multiple options for filling in the missing values which can expedite any data cleaning process. In this blog I will show a few code examples of how to deal with missing values.

Just get rid of them


Revisiting a simple skill with some new tricks

Missing Data

Dealing with missing values in data is an inevitability in data science. Finding those missing values and determining what to do with them will make or break any project. At flatiron learning how to find missing values is one of the first methods we learn.

df.isnull.()sum()

This simple line of code is a standard when taking your initial look through your data. It finds all the missing values and returns a sum of all the missing values. This is great, but what if you want to know the more specifics about the…


Creating a dashboard for the Kings County Housing Data Set

Looking Back on a Previous Project

In Phase 2 of Flatirons Data Science Program, we where given the Kings County housing data set and asked to create a Linear Regression model to predict the prices of housing listings from a hold out set of data. The project was framed around a friendly competition to see who could create the best predictive model with the lowest amount of error.

Looking Back at my own work, I can see that I was so focused on producing a working model, I totally ignored understanding the as housing listings. …

Jacob Heyman

Data Scientist With A Background in Biology

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store