A quick dive into CRISPR and the use of Machine learning for Genome engineering

A review of Crisper and the modeling process

What the heck is CRISPR?

So what does this have to do with machine learning?

The first steps of modeling

In this figure we can see two potential thresholds for sgRNA efficiency being split between high and low. By predicting whether the sgRNA efficiency is high or low, a classification model can be used to to predict the efficiency of target sights on the genome.

One major issue with this method is the that CRISPR data tends to be imbalanced. To get the best results from a model, the threshold of high and low needs to be adjusted. Sampling of the data is also utilized to counteract class imbalance. Oversampling of the minority class helps improve the models accuracy. The classification models are not perfect in predicting target efficiency, but act as a good base prediction while more accurate regression algorithms are researched.

In order to best predict the sgRNA efficiency the right data features need to be tailored to better train an ML model. One advantage in machine learning with CRISPR is the universal nature of sequence information. Knowing how nucleotides interact and having a databank of CRISPR interactions from previous experiments creates more training data for modeling. In order to train the model to create more accurate predictions, other features need to be considered, such as sgRNA size, and selecting the right epigenetic data to include without overfitting the model.

Selecting features is only part of ML process, the features also need to modified for the ML algorithims to understand the data. DNA is a string with a complex subset of code in the order of the 4 nucleotides. In order to help the algorithm the sequence data needs to be changed into numeric values with one hot encoding.

This figure shows the different methods of string processing. D shows the one hot encoding conversion of the nucleotides into an array of binary numbers.

When considering features for sequence data, the grouping of the nucleotides can also be considered. Nucleotide pair length and known domains in the genome can also be factors in feature creation and selection.

What Algorithm is right for us?

When we consider the sgRNA efficiency from above the interactions of the features plays a big roll on the sgRNA efficiency. On of the best ways to model the order of interactions between features is a decision tree. The tree splits the data into distinct groups that can better help predict the class of the sgRNA (high or low efficiency).

Here is an example of a decision tree that shows the nucleotide content of the sgRNA and the splits between features. The first split is on the percentage of certain nucleotides, creating pure nodes. The next split is on the position of specific nucleotides. This method helps identify which features most influence the sgRNA efficiency.

The best way to utilize a tree method for sgRNA is to use an ensamble method like random forest. This method utilizes a multitude of short trees in order to output a more accurate class prediction based on the average of all of the short trees.

The Limitations of current ML with CRISPR

The wrap up


O’Brien, Aidan R, et al. “Domain-Specific Introduction to Machine Learning Terminology, Pitfalls and Opportunities in CRISPR-Based Gene Editing.” Briefings in Bioinformatics, 2020, doi:10.1093/bib/bbz145.

Data Scientist With A Background in Biology