
Abstract Motivation CRISPR/Cas9 is driving a broad range of innovative applications from basic biology to biotechnology and medicine. One of its current issues is the effect of off-target editing that should be critically resolved and should be completely avoided in the ideal use of this system. Results We developed an ensemble learning method to detect the off-target sites of a single guide RNA (sgRNA) from its thousands of genome-wide candidates. Nucleotide mismatches between on-target and off-target sites have been studied recently. We confirm that there exists strong mismatch enrichment and preferences at the 5′-end close regions of the off-target sequences. Comparing with the on-target sites, sequences of no-editing sites can be also characterized by GC composition changes and position-specific mismatch binary features. Under this novel space of features, an ensemble strategy was applied to train a prediction model. The model achieved a mean score 0.99 of Aera Under Receiver Operating Characteristic curve and a mean score 0.45 of Aera Under Precision-Recall curve in cross-validations on big datasets, outperforming state-of-the-art methods in various test scenarios. Our predicted off-target sites also correspond very well to those detected by high-throughput sequencing techniques. Especially, two case studies for selecting sgRNAs to cure hearing loss and retinal degeneration partly prove the effectiveness of our method. Availability and implementation The python and matlab version of source codes for detecting off-target sites of a given sgRNA and the supplementary files are freely available on the web at https://github.com/penn-hui/OfftargetPredict. Supplementary information Supplementary data are available at Bioinformatics online.
anzsrc-for: 46 Information and computing sciences, 570, Base Composition, Genome, anzsrc-for: 01 Mathematical Sciences, High-Throughput Nucleotide Sequencing, 3 Good Health and Well Being, anzsrc-for: 49 Mathematical sciences, RNA, Guide, CRISPR-Cas Systems, Machine Learning, 3102 Bioinformatics and Computational Biology, anzsrc-for: 06 Biological Sciences, RNA, anzsrc-for: 31 Biological Sciences, anzsrc-for: 3102 Bioinformatics and Computational Biology, CRISPR-Cas Systems, anzsrc-for: 08 Information and Computing Sciences, Guide, Software, 31 Biological Sciences, Biotechnology
anzsrc-for: 46 Information and computing sciences, 570, Base Composition, Genome, anzsrc-for: 01 Mathematical Sciences, High-Throughput Nucleotide Sequencing, 3 Good Health and Well Being, anzsrc-for: 49 Mathematical sciences, RNA, Guide, CRISPR-Cas Systems, Machine Learning, 3102 Bioinformatics and Computational Biology, anzsrc-for: 06 Biological Sciences, RNA, anzsrc-for: 31 Biological Sciences, anzsrc-for: 3102 Bioinformatics and Computational Biology, CRISPR-Cas Systems, anzsrc-for: 08 Information and Computing Sciences, Guide, Software, 31 Biological Sciences, Biotechnology
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 48 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
