publication . Article . 2017

ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics

Jiangming Sun;
  • Published: 01 Mar 2017
Abstract
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a...
Subjects
free text keywords: Physical and Theoretical Chemistry, Library and Information Sciences, Computer Graphics and Computer-Aided Design, Computer Science Applications, chEMBL, Chemogenomics, chemistry.chemical_compound, chemistry, Big data, business.industry, business, Cheminformatics, Data point, Computer science, Data science, Predictive modelling, PubChem, Quantitative structure–activity relationship, Data mining, computer.software_genre, computer, Bioinformatics, Database, Bioactivity, Chemical structure, Molecular fingerprints, Search engine, QSAR
Related Organizations
Funded by
EC| ExCAPE
Project
ExCAPE
Exascale Compound Activity Prediction Engine
  • Funder: European Commission (EC)
  • Project Code: 671555
  • Funding stream: H2020 | RIA
Communities
FET H2020FET HPC: HPC Core Technologies, Programming Environments and Algorithms for Extreme Parallelism and Extreme Data Applications
FET H2020FET HPC: Exascale Compound Activity Prediction Engine
41 references, page 1 of 3

Uhlen, M, Fagerberg, L, Hallstrom, BM, Lindskog, C, Oksvold, P, Mardinoglu, A. Proteomics. Tissue-based map of the human proteome. Science. 2015; 347: 1260419 [OpenAIRE] [PubMed] [DOI]

Weinstein, JN, Collisson, EA, Mills, GB, Shaw, KR, Ozenberger, BA. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013; 45: 1113-1120 [OpenAIRE] [PubMed] [DOI]

Muresan, S, Petrov, P, Southan, C, Kjellberg, MJ, Kogej, T, Tyrchan, C. Making every SAR point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data. Drug Discov Today. 2011; 16: 1019-1030 [OpenAIRE] [PubMed] [DOI]

Bredel, M, Jacoby, E. Chemogenomics: an emerging strategy for rapid target and drug discovery. Nat Rev Genet. 2004; 5: 262-275 [OpenAIRE] [PubMed] [DOI]

Wang, Y, Suzek, T, Zhang, J, Wang, J, He, S, Cheng, T. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014; 42: D1075-D1082 [OpenAIRE] [PubMed] [DOI]

Gilson, MK, Liu, T, Baitaluk, M, Nicola, G, Hwang, L, Chong, J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016; 44: D1045-D1053 [OpenAIRE] [PubMed] [DOI]

Bento, AP, Gaulton, A, Hersey, A, Bellis, LJ, Chambers, J, Davies, M. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014; 42: D1083-D1090 [OpenAIRE] [PubMed] [DOI]

Kim, S, Thiessen, PA, Bolton, EE, Chen, J, Fu, G, Gindulyte, A. PubChem substance and compound databases. Nucleic Acids Res. 2016; 44: D1202-D1213 [OpenAIRE] [PubMed] [DOI]

9.Olah M, Rad R, Ostopovici L, Bora A, Hadaruga N, Hadaruga D et al (2007) WOMBAT and WOMBAT-PK: bioactivity databases for lead and drug discovery. In: Schreiber SL, Kapoor TM, Wess G (eds) Chemical biology: from small molecules to systems biology and drug design. Wiley-VCH, pp 760–786

Mathias, SL, Hines-Kay, J, Yang, JJ, Zahoransky-Kohalmi, G, Bologa, CG, Ursu, O. The CARLSBAD database: a confederated database of chemical bioactivities. Database. 2013; 2013: bat044 [OpenAIRE] [PubMed] [DOI]

Williams, J. SCiFinder: information at the desktop for scientists. Online. 1995: 60-66

12.GOSTAR database release 2016. http://www.gostardb.com/. Accessed 1 Oct 2016

13.Reaxys database. http://www.reaxys.com. Accessed 1 Oct 2016

Lusci, A, Browning, M, Fooshee, D, Swamidass, J, Baldi, P. Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform. 2015; 7: 63 [OpenAIRE] [PubMed] [DOI]

Mervin, LH, Afzal, AM, Drakakis, G, Lewis, R, Engkvist, O, Bender, A. Target prediction utilising negative bioactivity data covering large chemical space. J Cheminform. 2015; 7: 51 [OpenAIRE] [PubMed] [DOI]

41 references, page 1 of 3
Abstract
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a...
Subjects
free text keywords: Physical and Theoretical Chemistry, Library and Information Sciences, Computer Graphics and Computer-Aided Design, Computer Science Applications, chEMBL, Chemogenomics, chemistry.chemical_compound, chemistry, Big data, business.industry, business, Cheminformatics, Data point, Computer science, Data science, Predictive modelling, PubChem, Quantitative structure–activity relationship, Data mining, computer.software_genre, computer, Bioinformatics, Database, Bioactivity, Chemical structure, Molecular fingerprints, Search engine, QSAR
Related Organizations
Funded by
EC| ExCAPE
Project
ExCAPE
Exascale Compound Activity Prediction Engine
  • Funder: European Commission (EC)
  • Project Code: 671555
  • Funding stream: H2020 | RIA
Communities
FET H2020FET HPC: HPC Core Technologies, Programming Environments and Algorithms for Extreme Parallelism and Extreme Data Applications
FET H2020FET HPC: Exascale Compound Activity Prediction Engine
41 references, page 1 of 3

Uhlen, M, Fagerberg, L, Hallstrom, BM, Lindskog, C, Oksvold, P, Mardinoglu, A. Proteomics. Tissue-based map of the human proteome. Science. 2015; 347: 1260419 [OpenAIRE] [PubMed] [DOI]

Weinstein, JN, Collisson, EA, Mills, GB, Shaw, KR, Ozenberger, BA. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013; 45: 1113-1120 [OpenAIRE] [PubMed] [DOI]

Muresan, S, Petrov, P, Southan, C, Kjellberg, MJ, Kogej, T, Tyrchan, C. Making every SAR point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data. Drug Discov Today. 2011; 16: 1019-1030 [OpenAIRE] [PubMed] [DOI]

Bredel, M, Jacoby, E. Chemogenomics: an emerging strategy for rapid target and drug discovery. Nat Rev Genet. 2004; 5: 262-275 [OpenAIRE] [PubMed] [DOI]

Wang, Y, Suzek, T, Zhang, J, Wang, J, He, S, Cheng, T. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014; 42: D1075-D1082 [OpenAIRE] [PubMed] [DOI]

Gilson, MK, Liu, T, Baitaluk, M, Nicola, G, Hwang, L, Chong, J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016; 44: D1045-D1053 [OpenAIRE] [PubMed] [DOI]

Bento, AP, Gaulton, A, Hersey, A, Bellis, LJ, Chambers, J, Davies, M. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014; 42: D1083-D1090 [OpenAIRE] [PubMed] [DOI]

Kim, S, Thiessen, PA, Bolton, EE, Chen, J, Fu, G, Gindulyte, A. PubChem substance and compound databases. Nucleic Acids Res. 2016; 44: D1202-D1213 [OpenAIRE] [PubMed] [DOI]

9.Olah M, Rad R, Ostopovici L, Bora A, Hadaruga N, Hadaruga D et al (2007) WOMBAT and WOMBAT-PK: bioactivity databases for lead and drug discovery. In: Schreiber SL, Kapoor TM, Wess G (eds) Chemical biology: from small molecules to systems biology and drug design. Wiley-VCH, pp 760–786

Mathias, SL, Hines-Kay, J, Yang, JJ, Zahoransky-Kohalmi, G, Bologa, CG, Ursu, O. The CARLSBAD database: a confederated database of chemical bioactivities. Database. 2013; 2013: bat044 [OpenAIRE] [PubMed] [DOI]

Williams, J. SCiFinder: information at the desktop for scientists. Online. 1995: 60-66

12.GOSTAR database release 2016. http://www.gostardb.com/. Accessed 1 Oct 2016

13.Reaxys database. http://www.reaxys.com. Accessed 1 Oct 2016

Lusci, A, Browning, M, Fooshee, D, Swamidass, J, Baldi, P. Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform. 2015; 7: 63 [OpenAIRE] [PubMed] [DOI]

Mervin, LH, Afzal, AM, Drakakis, G, Lewis, R, Engkvist, O, Bender, A. Target prediction utilising negative bioactivity data covering large chemical space. J Cheminform. 2015; 7: 51 [OpenAIRE] [PubMed] [DOI]

41 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Article . 2017

ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics

Jiangming Sun;