publication . Article . Other literature type . Preprint . 2017

Iterative random forests to discover predictive and stable high-order interactions.

Basu, Sumanta; Kumbier, Karl; Brown, James B; Yu, Bin;
Open Access English
  • Published: 20 Nov 2017 Journal: Proceedings of the National Academy of Sciences of the United States of America, volume 115, issue 8, pages 1,943-1,948 (issn: 0027-8424, eissn: 1091-6490, Copyright policy)
  • Publisher: National Academy of Sciences
  • Country: Algeria
Abstract
Significance We developed a predictive, stable, and interpretable tool: the iterative random forest algorithm (iRF). iRF discovers high-order interactions among biomolecules with the same order of computational cost as random forests. We demonstrate the efficacy of iRF by finding known and promising interactions among biomolecules, of up to fifth and sixth order, in two data examples in transcriptional regulation and alternative splicing.
Subjects
free text keywords: Animals, Drosophila, Computational Biology, Gene Expression Regulation, Developmental, Alternative Splicing, Algorithms, Models, Genetic, Gene Regulatory Networks, Genome-Wide Association Study, Human Genome, Genetics, Biotechnology, 1.1 Normal biological development and functioning, Generic Health Relevance, Biological Sciences, Systems Biology, Physical Sciences, Statistics, high-order interaction, random forests, stability, interpretable machine learning, genomics, Multidisciplinary
Funded by
NSF| Emerging Frontiers of Science of Information
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 0939370
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Computing and Communication Foundations
,
NIH| Nonparametric methods for functional and translational genomics
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 5R00HG006698-04
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NIH| Removing statistical bottle-necks in data analysis for the ENCODE Consortium
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 1U01HG007031-01
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NIH| Biomedical Big Data Training Program at UC Berkeley
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 1T32LM012417-01
  • Funding stream: NATIONAL LIBRARY OF MEDICINE
16 references, page 1 of 2

E. Allemand, M. P. Myers, J. Garcia-Bernardo, A. Harel-Bellan, A. R. Krainer, and C. Muchardt. A broad set of chromatin factors in uences splicing. PLoS genetics, 12(9):e1006318, 2016.

D. Amaratunga, J. Cabrera, and Y.-S. Lee. Enriched random forests. Bioinformatics, 24(18):2010{2014, 2008.

A. Anaissi, P. J. Kennedy, M. Goyal, and D. R. Catchpoole. A balanced iterative random forest for gene selection from microarray data. BMC bioinformatics, 14(1):261, 2013.

M. Levine. Computing away the magic? eLife, 2:e01135, 2013.

Q. Li, J. B. Brown, H. Huang, and P. J. Bickel. Measuring reproducibility of high-throughput experiments. The annals of applied statistics, pages 1752{1779, 2011. [OpenAIRE]

X.-y. Li, S. MacArthur, R. Bourgon, D. Nix, D. A. Pollard, V. N. Iyer, A. Hechmer, L. Simirenko, M. Stapleton, C. L. L. Hendriks, et al. Transcription factors bind thousands of active and inactive regions in the drosophila blastoderm. PLoS biology, 6(2):e27, 2008.

H.-L. Liang, C.-Y. Nien, H.-Y. Liu, M. M. Metzstein, N. Kirov, and C. Rushlow. The zinc- nger protein zelda is a key activator of the early zygotic genome in drosophila. Nature, 456(7220):400{403, 2008. [OpenAIRE]

C. Schulz and D. Tautz. Autonomous concentration-dependent activation and repression of kruppel by hunchback in the drosophila embryo. Development, 120(10):3043{3049, 1994. [OpenAIRE]

R. D. Shah and N. Meinshausen. Random intersection trees. The Journal of Machine Learning Research, 15(1):629{654, 2014.

B. R. So, L. Wan, Z. Zhang, P. Li, E. Babiash, J. Duan, I. Younis, and G. Dreyfuss. A U1 snRNP-speci c assembly pathway reveals the SMN complex as a versatile hub for RNP exchange. Nature structural & molecular biology, 2016.

M. H. Stoiber, S. Olson, G. E. May, M. O. Du , J. Manent, R. Obar, K. Guruharsha, P. J. Bickel, S. Artavanis-Tsakonas, J. B. Brown, et al. Extensive cross-regulation of post-transcriptional regulatory networks in drosophila. Genome research, 25(11):1692{1702, 2015. [OpenAIRE]

G. Struhl, P. Johnston, and P. A. Lawrence. Control of drosophila body pattern by the hunchback morphogen gradient. Cell, 69(2):237{249, 1992.

S. Thomas, X.-Y. Li, P. J. Sabo, R. Sandstrom, R. E. Thurman, T. K. Can eld, E. Giste, W. Fisher, A. Hammonds, S. E. Celniker, et al. Dynamic reprogramming of chromatin accessibility during drosophila embryo development. Genome biology, 12(5):1, 2011.

A. Weiner, D. Lara-Astiaso, V. Krupalnik, O. Gafni, E. David, D. R. Winter, J. H. Hanna, and I. Amit. CoChIP enables genome-wide mapping of histone mark co-occurrence at single-molecule resolution. Nature Biotechnology, 34(9):953{961, 2016.

Z. Xu, H. Chen, J. Ling, D. Yu, P. Stru , and S. Small. Impacts of the ubiquitous factor Zelda on Bicoiddependent DNA binding and transcription in Drosophila. Genes Dev., 28(6):608{621, Mar 2014.

16 references, page 1 of 2
Abstract
Significance We developed a predictive, stable, and interpretable tool: the iterative random forest algorithm (iRF). iRF discovers high-order interactions among biomolecules with the same order of computational cost as random forests. We demonstrate the efficacy of iRF by finding known and promising interactions among biomolecules, of up to fifth and sixth order, in two data examples in transcriptional regulation and alternative splicing.
Subjects
free text keywords: Animals, Drosophila, Computational Biology, Gene Expression Regulation, Developmental, Alternative Splicing, Algorithms, Models, Genetic, Gene Regulatory Networks, Genome-Wide Association Study, Human Genome, Genetics, Biotechnology, 1.1 Normal biological development and functioning, Generic Health Relevance, Biological Sciences, Systems Biology, Physical Sciences, Statistics, high-order interaction, random forests, stability, interpretable machine learning, genomics, Multidisciplinary
Funded by
NSF| Emerging Frontiers of Science of Information
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 0939370
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Computing and Communication Foundations
,
NIH| Nonparametric methods for functional and translational genomics
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 5R00HG006698-04
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NIH| Removing statistical bottle-necks in data analysis for the ENCODE Consortium
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 1U01HG007031-01
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NIH| Biomedical Big Data Training Program at UC Berkeley
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 1T32LM012417-01
  • Funding stream: NATIONAL LIBRARY OF MEDICINE
16 references, page 1 of 2

E. Allemand, M. P. Myers, J. Garcia-Bernardo, A. Harel-Bellan, A. R. Krainer, and C. Muchardt. A broad set of chromatin factors in uences splicing. PLoS genetics, 12(9):e1006318, 2016.

D. Amaratunga, J. Cabrera, and Y.-S. Lee. Enriched random forests. Bioinformatics, 24(18):2010{2014, 2008.

A. Anaissi, P. J. Kennedy, M. Goyal, and D. R. Catchpoole. A balanced iterative random forest for gene selection from microarray data. BMC bioinformatics, 14(1):261, 2013.

M. Levine. Computing away the magic? eLife, 2:e01135, 2013.

Q. Li, J. B. Brown, H. Huang, and P. J. Bickel. Measuring reproducibility of high-throughput experiments. The annals of applied statistics, pages 1752{1779, 2011. [OpenAIRE]

X.-y. Li, S. MacArthur, R. Bourgon, D. Nix, D. A. Pollard, V. N. Iyer, A. Hechmer, L. Simirenko, M. Stapleton, C. L. L. Hendriks, et al. Transcription factors bind thousands of active and inactive regions in the drosophila blastoderm. PLoS biology, 6(2):e27, 2008.

H.-L. Liang, C.-Y. Nien, H.-Y. Liu, M. M. Metzstein, N. Kirov, and C. Rushlow. The zinc- nger protein zelda is a key activator of the early zygotic genome in drosophila. Nature, 456(7220):400{403, 2008. [OpenAIRE]

C. Schulz and D. Tautz. Autonomous concentration-dependent activation and repression of kruppel by hunchback in the drosophila embryo. Development, 120(10):3043{3049, 1994. [OpenAIRE]

R. D. Shah and N. Meinshausen. Random intersection trees. The Journal of Machine Learning Research, 15(1):629{654, 2014.

B. R. So, L. Wan, Z. Zhang, P. Li, E. Babiash, J. Duan, I. Younis, and G. Dreyfuss. A U1 snRNP-speci c assembly pathway reveals the SMN complex as a versatile hub for RNP exchange. Nature structural & molecular biology, 2016.

M. H. Stoiber, S. Olson, G. E. May, M. O. Du , J. Manent, R. Obar, K. Guruharsha, P. J. Bickel, S. Artavanis-Tsakonas, J. B. Brown, et al. Extensive cross-regulation of post-transcriptional regulatory networks in drosophila. Genome research, 25(11):1692{1702, 2015. [OpenAIRE]

G. Struhl, P. Johnston, and P. A. Lawrence. Control of drosophila body pattern by the hunchback morphogen gradient. Cell, 69(2):237{249, 1992.

S. Thomas, X.-Y. Li, P. J. Sabo, R. Sandstrom, R. E. Thurman, T. K. Can eld, E. Giste, W. Fisher, A. Hammonds, S. E. Celniker, et al. Dynamic reprogramming of chromatin accessibility during drosophila embryo development. Genome biology, 12(5):1, 2011.

A. Weiner, D. Lara-Astiaso, V. Krupalnik, O. Gafni, E. David, D. R. Winter, J. H. Hanna, and I. Amit. CoChIP enables genome-wide mapping of histone mark co-occurrence at single-molecule resolution. Nature Biotechnology, 34(9):953{961, 2016.

Z. Xu, H. Chen, J. Ling, D. Yu, P. Stru , and S. Small. Impacts of the ubiquitous factor Zelda on Bicoiddependent DNA binding and transcription in Drosophila. Genes Dev., 28(6):608{621, Mar 2014.

16 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue