publication . Article . Other literature type . 2016

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Thompson, Jeffrey A.; Tan, Jie; Greene, Casey S.;
Open Access English
  • Published: 01 Jan 2016 Journal: PeerJ, volume 4 (issn: 2167-8359, eissn: 2167-8359, Copyright policy)
  • Publisher: PeerJ Inc.
Abstract
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simp...
Subjects
acm: ComputingMethodologies_PATTERNRECOGNITION
free text keywords: Computational Biology, RNA-sequencing, Quantile normalization, Training, Medicine, Machine learning, Distribution, Microarray, R, Normalization, Bioinformatics, Genomics, Nonparanormal transformation, Cross-platform normalization, Gene expression
Funded by
NIH| SYNERGY: The Dartmouth Center for clinical and Translational Science
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 5UL1TR001086-03
  • Funding stream: NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SCIENCES
,
NIH| Cancer Center Support Grant
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 3P30CA023108-38S2
  • Funding stream: NATIONAL CANCER INSTITUTE
,
NIH| Quantitative Biology Research Institute
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 5P20GM103534-04
  • Funding stream: NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
38 references, page 1 of 3

Atak, ZK, Gianfelici, V, Hulselmans, G, De Keersmaecker, K, Devasia, AG, Geerdens, E, Mentens, N, Chiaretti, S, Durinck, K, Uyttebroeck, A, Vandenberghe, P, Wlodarska, I, Cloos, J, Foà, R, Speleman, F, Cools, J, Aerts, S. Comprehensive analysis of transcriptome variation uncovers known and novel driver events in t-cell acute lymphoblastic leukemia. PLoS Genetics. 2013; 9 (12) [OpenAIRE] [DOI]

Bolstad, BM. Preprocesscore: A Collection of Pre-Processing Functions. 2015

Bolstad, BM, Irizarry, RA, Astrand, M, Speed, TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003; 19 (2): 185-193 [PubMed] [DOI]

Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490 (7418): 61-70 [OpenAIRE] [PubMed] [DOI]

Curtis, C, Shah, SP, Chin, S-F, Turashvili, G, Rueda, OM, Dunning, MJ, Speed, D, Lynch, AG, Samarajiwa, S, Yuan, Y, Gräf, S, Ha, G, Haffari, G, Bashashati, A, Russell, R, McKinney, S, Langerød, A, Green, A, Provenzano, E, Wishart, G, Pinder, S, Watson, P, Markowetz, F, Murphy, L, Ellis, I, Purushotham, A, Børresen-Dale, AL-L, Brenton, JD, Tavaré, S, Caldas, C, Aparicio, S. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012; 486 (7403): 346-352 [OpenAIRE] [PubMed] [DOI]

Forés-Martos, J, Cervera-Vidal, R, Chirivella, E, Ramos-Jarero, A, Climent, J. A genomic approach to study down syndrome and cancer inverse comorbidity: untangling the chromosome 21. Frontiers in Physiology. 2015; 6: 10 [OpenAIRE] [PubMed] [DOI]

Friedman, J, Hastie, T, Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010; 33 (1): 1-22 [OpenAIRE] [PubMed]

Geeleher, P, Cox, NJ, Huang, RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology. 2014; 15 (3): R47 [OpenAIRE] [PubMed] [DOI]

Goldman, M, Craft, B, Swatloski, T, Ellrott, K, Cline, M, Diekhans, M, Ma, S, Wilks, C, Stuart, J, Haussler, D, Zhu, J. The UCSC cancer genomics browser. Nucleic Acids Research. 2013; 41 (D1): D949-D954 [OpenAIRE] [PubMed] [DOI]

Hicks, SC, Irizarry, RA. Quantro: a data-driven approach to guide the choice of an appropriate normalization method. Genome Biology. 2015; 16 (1): 117 [OpenAIRE] [PubMed] [DOI]

Huttenhower, C, Schroeder, M, Chikina, MD, Troyanskaya, OG. The sleipnir library for computational functional genomics. Bioinformatics. 2008; 24 (13): 1559-1561 [OpenAIRE] [PubMed] [DOI]

Kaufman, L, Rousseeuw, PJ. Partitioning around medoids (program PAM). Finding Groups in Data: An Introduction to Cluster Analysis. 1990: 68-125 [DOI]

Kourou, K, Exarchos, TP, Exarchos, KP, Karamouzis, MV, Fotiadis, DI. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal. 2014; 13: 8-17 [OpenAIRE] [PubMed] [DOI]

Law, CW, Chen, Y, Shi, W, Smyth, GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology. 2014; 15 (2): R29 [OpenAIRE] [PubMed] [DOI]

Li, B, Shin, H, Gulbekyan, G, Pustovalova, O, Nikolsky, Y, Hope, A, Bessarabova, M, Schu, M, Kolpakova-Hart, E, Merberg, D, Dorner, A, Trepicchio, WL. Development of a drug-response modeling framework to identify cell line derived translational biomarkers that can predict treatment outcome to erlotinib or sorafenib. PLoS ONE. 2015; 10 (6) [OpenAIRE] [DOI]

38 references, page 1 of 3
Abstract
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simp...
Subjects
acm: ComputingMethodologies_PATTERNRECOGNITION
free text keywords: Computational Biology, RNA-sequencing, Quantile normalization, Training, Medicine, Machine learning, Distribution, Microarray, R, Normalization, Bioinformatics, Genomics, Nonparanormal transformation, Cross-platform normalization, Gene expression
Funded by
NIH| SYNERGY: The Dartmouth Center for clinical and Translational Science
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 5UL1TR001086-03
  • Funding stream: NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SCIENCES
,
NIH| Cancer Center Support Grant
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 3P30CA023108-38S2
  • Funding stream: NATIONAL CANCER INSTITUTE
,
NIH| Quantitative Biology Research Institute
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 5P20GM103534-04
  • Funding stream: NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
38 references, page 1 of 3

Atak, ZK, Gianfelici, V, Hulselmans, G, De Keersmaecker, K, Devasia, AG, Geerdens, E, Mentens, N, Chiaretti, S, Durinck, K, Uyttebroeck, A, Vandenberghe, P, Wlodarska, I, Cloos, J, Foà, R, Speleman, F, Cools, J, Aerts, S. Comprehensive analysis of transcriptome variation uncovers known and novel driver events in t-cell acute lymphoblastic leukemia. PLoS Genetics. 2013; 9 (12) [OpenAIRE] [DOI]

Bolstad, BM. Preprocesscore: A Collection of Pre-Processing Functions. 2015

Bolstad, BM, Irizarry, RA, Astrand, M, Speed, TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003; 19 (2): 185-193 [PubMed] [DOI]

Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490 (7418): 61-70 [OpenAIRE] [PubMed] [DOI]

Curtis, C, Shah, SP, Chin, S-F, Turashvili, G, Rueda, OM, Dunning, MJ, Speed, D, Lynch, AG, Samarajiwa, S, Yuan, Y, Gräf, S, Ha, G, Haffari, G, Bashashati, A, Russell, R, McKinney, S, Langerød, A, Green, A, Provenzano, E, Wishart, G, Pinder, S, Watson, P, Markowetz, F, Murphy, L, Ellis, I, Purushotham, A, Børresen-Dale, AL-L, Brenton, JD, Tavaré, S, Caldas, C, Aparicio, S. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012; 486 (7403): 346-352 [OpenAIRE] [PubMed] [DOI]

Forés-Martos, J, Cervera-Vidal, R, Chirivella, E, Ramos-Jarero, A, Climent, J. A genomic approach to study down syndrome and cancer inverse comorbidity: untangling the chromosome 21. Frontiers in Physiology. 2015; 6: 10 [OpenAIRE] [PubMed] [DOI]

Friedman, J, Hastie, T, Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010; 33 (1): 1-22 [OpenAIRE] [PubMed]

Geeleher, P, Cox, NJ, Huang, RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology. 2014; 15 (3): R47 [OpenAIRE] [PubMed] [DOI]

Goldman, M, Craft, B, Swatloski, T, Ellrott, K, Cline, M, Diekhans, M, Ma, S, Wilks, C, Stuart, J, Haussler, D, Zhu, J. The UCSC cancer genomics browser. Nucleic Acids Research. 2013; 41 (D1): D949-D954 [OpenAIRE] [PubMed] [DOI]

Hicks, SC, Irizarry, RA. Quantro: a data-driven approach to guide the choice of an appropriate normalization method. Genome Biology. 2015; 16 (1): 117 [OpenAIRE] [PubMed] [DOI]

Huttenhower, C, Schroeder, M, Chikina, MD, Troyanskaya, OG. The sleipnir library for computational functional genomics. Bioinformatics. 2008; 24 (13): 1559-1561 [OpenAIRE] [PubMed] [DOI]

Kaufman, L, Rousseeuw, PJ. Partitioning around medoids (program PAM). Finding Groups in Data: An Introduction to Cluster Analysis. 1990: 68-125 [DOI]

Kourou, K, Exarchos, TP, Exarchos, KP, Karamouzis, MV, Fotiadis, DI. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal. 2014; 13: 8-17 [OpenAIRE] [PubMed] [DOI]

Law, CW, Chen, Y, Shi, W, Smyth, GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology. 2014; 15 (2): R29 [OpenAIRE] [PubMed] [DOI]

Li, B, Shin, H, Gulbekyan, G, Pustovalova, O, Nikolsky, Y, Hope, A, Bessarabova, M, Schu, M, Kolpakova-Hart, E, Merberg, D, Dorner, A, Trepicchio, WL. Development of a drug-response modeling framework to identify cell line derived translational biomarkers that can predict treatment outcome to erlotinib or sorafenib. PLoS ONE. 2015; 10 (6) [OpenAIRE] [DOI]

38 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Article . Other literature type . 2016

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Thompson, Jeffrey A.; Tan, Jie; Greene, Casey S.;