A quick dive into CRISPR and the use of Machine learning for Genome engineering
A review of Crisper and the modeling process
What the heck is CRISPR?
Not to nerd out, but CRISPR is one of the coolest new discoveries in biology. CRISPR standing for clustered regularly interspaced short palindromic repeats is a tool used in a multitude of studies, including molecular biology and gene therapy. CRISPR is a tool that was derived from a natural immune defense of certain bacteria, which targets foreign RNA and cleaves it. What makes CRISPR so amazing is its ability to target very specific locations on the genome with ‘programable’ single guide RNA, that pair with the genome target. Once paired with the sgRNA is paired with it’s target the CRISPR-associated protein 9 (Cas9) acts as a pair of molecular scissors creating a double strand break in the DNA. This process induces DNA repair pathways which repair the DNA. The cool thing about this process is that it can be exploited to replace the excised DNA with a new strand of engineered DNA. CRISPR cas9 can be used to replace large chunks of DNA or repair single nucleotide polymorphisms (mutations that can lead to diseases like cystic fibrosis). There are many other applications for CRISPR outside of DNA insertion as well, CRISPR can be used to control gene expression, as a genetic marker or used as a diagnostic tool.
So what does this have to do with machine learning?
In the quick intro to CRISPR above we talked about how CRISPR uses engineered sgRNA to bind and cut target locations on the genome. Designing the sgRNA to find the specific target is no easy task. DNA and the pathways that systems that interact with it are highly complex. To create an accurate CRISPR system, a multitude of factors need to be considered for the target to be reached. In order to understand this complex synergy of factors, researchers have turned to machine learning, using modeling systems to predict experimental success. Using existing research data, a ML model can be created by training on samples and learning the best feature relationships. The model can then be used to predict sgRNA design effectiveness, increasing the efficiency of CRISPR in vivo experiments. sgRNA does not have to be the only predicted target, ML can also be used to located unintended binding sites on the DNA. This ML maping of off target genome regions can be used with the predictions of sgRNA targets to better predict the efficiency of the CRISPR- cas9 system.
The first steps of modeling
To train a ML model for CRSPR- cas9 the target needs to be labeled. The label of the model could be knockdown efficiency, cleavage efficiency, or expression measured by fluorescence. Classification algorithms are used for discrete variables, while regression algorithms are used for continuous variables. For sgRNA efficiency, the efficiency is a continuous variable on a range of 0% to 100%. Due to the extreme complexity of the features in the system, and limited sample data size, researchers can sometimes choose to represent the continuous values discreetly. The sgRNA efficiency is assigned into classes in order to train a classification model.
In this figure we can see two potential thresholds for sgRNA efficiency being split between high and low. By predicting whether the sgRNA efficiency is high or low, a classification model can be used to to predict the efficiency of target sights on the genome.
One major issue with this method is the that CRISPR data tends to be imbalanced. To get the best results from a model, the threshold of high and low needs to be adjusted. Sampling of the data is also utilized to counteract class imbalance. Oversampling of the minority class helps improve the models accuracy. The classification models are not perfect in predicting target efficiency, but act as a good base prediction while more accurate regression algorithms are researched.
In order to best predict the sgRNA efficiency the right data features need to be tailored to better train an ML model. One advantage in machine learning with CRISPR is the universal nature of sequence information. Knowing how nucleotides interact and having a databank of CRISPR interactions from previous experiments creates more training data for modeling. In order to train the model to create more accurate predictions, other features need to be considered, such as sgRNA size, and selecting the right epigenetic data to include without overfitting the model.
Selecting features is only part of ML process, the features also need to modified for the ML algorithims to understand the data. DNA is a string with a complex subset of code in the order of the 4 nucleotides. In order to help the algorithm the sequence data needs to be changed into numeric values with one hot encoding.
This figure shows the different methods of string processing. D shows the one hot encoding conversion of the nucleotides into an array of binary numbers.
When considering features for sequence data, the grouping of the nucleotides can also be considered. Nucleotide pair length and known domains in the genome can also be factors in feature creation and selection.
What Algorithm is right for us?
When we first started talking about the modeling process we touched on how the target variable could be either discrete or continuous. One common method used in CRSPR modeling is the support vector machine (SVM). The SVM is ideal because it can be used for both classification and regression modeling, by transforming the features into infinite dimensional space where they can then be separated. A pitfall of this method would be the algorithms inability to represent which features played a roll in the decision process.
When we consider the sgRNA efficiency from above the interactions of the features plays a big roll on the sgRNA efficiency. On of the best ways to model the order of interactions between features is a decision tree. The tree splits the data into distinct groups that can better help predict the class of the sgRNA (high or low efficiency).
Here is an example of a decision tree that shows the nucleotide content of the sgRNA and the splits between features. The first split is on the percentage of certain nucleotides, creating pure nodes. The next split is on the position of specific nucleotides. This method helps identify which features most influence the sgRNA efficiency.
The best way to utilize a tree method for sgRNA is to use an ensamble method like random forest. This method utilizes a multitude of short trees in order to output a more accurate class prediction based on the average of all of the short trees.
The Limitations of current ML with CRISPR
The modeling process is limited by the data available to train the model. CRISPR is a relatively new tool and while some pathways have large data sets available, other pathways are limited by small available data. With models limited by new experimental data, the chances for bias models rise due to under-fit models. As experiment data increases, the models for CRISPR interactions will improve.
The wrap up
Using machine learning to improve CRISPR research is improving exponentially. Model driven selection of sgRNA is improving experimental design and creating larger data sets to further improve ML models. The complexity of the data and unknown environment factors still need to be further researched in order to create new complex models. Ml models and research design will continue to build off each-other expediting experiments and reducing the guess work of sgRNA engineering.
O’Brien, Aidan R, et al. “Domain-Specific Introduction to Machine Learning Terminology, Pitfalls and Opportunities in CRISPR-Based Gene Editing.” Briefings in Bioinformatics, 2020, doi:10.1093/bib/bbz145.