<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Compendium of primary head and neck cancer gene expression datasets with accompanying clinical data

Research datakeyboard_double_arrow_right Dataset 30 Jun 2023Publisher:Zenodo

Authors: Brennan, Kevin; Gentles, Andrew;

doi: 10.5281/zenodo.7679088 , 10.5281/zenodo.7679087

Compendium of primary head and neck cancer gene expression datasets with accompanying clinical data

- Summary
- Subjects
- Metrics

Abstract

Data Processing Methods: Curating HNC gene expression datasets Head and neck cancer (HNC) gene expression studies were primarily accessed from GEO and ArrayExpress. Relevant studies were identified using the search terms "Cancer" in combination with the terms "Head and neck", "Oral", "Laryngeal", "Oropharyngeal", and "Hypopharyngeal", and by reviewing all datasets that were retrieved by these searches. For GEO searches, datasets were restricted to those with a minimum of ten samples. We identified additional datasets by searching the reports that were associated with these datasets as well as additional review articles, until we were unable to identify any additional suitable datasets. Clinical data was accessed from the meta-data that accompanied each dataset within databases, as well as from relevant reports and correspondence with authors. All clinical meta-data related to survival (Any survival measure) and LNM was accessed. Also accessed, where available, were data indicating tumor grade. Other variables that were accessed included demographic information (Patient age, sex, and reported ancestry (Race or ethnicity)), clinicopathological variables (Tumor subsite, HPV status, measure of HPV status), details of the patient study (Country or sample collection), and data pertaining to HNC-related risk habits (Smoking and alcohol consumption status and intensity measures). For the TCGA study, HPV status data was accessed from a publication that applied VirusScan(1) to detect HPV RNA within raw RNA sequencing reads, representing the most complete source of HPV status data in terms of patient numbers. To spot-check the accuracy of clinical data, patient sex was inferred based on the ratio of expression of the XIST and RPS4Y1 genes and compared with clinical annotation of sex. This resulted in exclusion of two studies that had inconsistent clinical data. Processing gene expression data (Meta-analysis datasets) Gene expression datasets that were generated using Affymetrix arrays (N=21) were processed as follows. To ensure accurate annotation of microarray probes, raw data (.CEL files) were accessed and processed using the 'affy' R package in combination with platform-specific custom CDF files that were accessed from Brainarray (http://brainarray.mbni.med.umich.edu/). Expression datasets were normalized using the mas5 algorithm. Samples were next restricted to primary tumors, followed by quantile normalization of the expression data. Probe level data was next summarized to gene level data using the WGCNA package(2), using the default 'maxmean' method for probe filtering. For each gene, this method selects the probe with the maximum mean expression across all samples as a representative measure of the gene. Summarized gene data were log2 transformed and converted to standard gene expression scores. For each gene, standard gene expression scores were calculated for each patient sample by subtracting the mean expression of the gene and dividing by the standard deviation. Statistical pipelines that were used to perform meta-analyses were applied to standard scores. Eight datasets were generated using non-Affymetrix microarrays (Microarrays that were manufactured by Agilent, Illumina, and the German Cancer Research Center). These datasets were downloaded from GEO as series matrix files using the GEOquery R package. These datasets were preprocessed as follows: Gene names were converted to Entrez IDs using array annotation 'Platform' files that accompanied each dataset. Where Entrez IDs were not included in the annotation file, gene names were converted to Entrez IDs using biomaRt(3). Datasets were restricted to primary tumors and were filtered to remove samples with missing data for ten percent or greater of genes, and to remove genes that had missing data for ten percent or greater of samples. Datasets were then quantile normalized. For genes with multiple probes, the WGCNA package was used to identify the probe with the maximum mean expression across samples, which was selected a representative measure for each gene. Datasets were then log2 transformed if not already in log2 space and converted to standard gene expression scores as described for Affymetrix-based datasets. Preprocessed TCGA bulk RNA-Seq data (Gene-level HTSeq counts) were downloaded from TCGAbiolinks(4). TCGA data was processed for meta-analyses using an approach that was consistent with array-based datasets: The dataset was restricted to primary tumor samples and then quantile normalized. Gene names were converted from Ensembl IDs to Entrez IDs using biomaRt(5). Ensembl ID-level data was summarized to Entrez gene level data using the WGCNA package 'CollapseRows' function. the default 'maxmean' method was used to select features with higher expression where Entrez IDs matched multiple Ensembl IDs. The datasets were then log2 transformed and converted to standard gene expression scores as described for Affymetrix-based datasets. For applications other than meta-analyses, TCGA RNA-Seq data was processed using an alternative normalization approach in order to process primary HNC and tumor-adjacent normal samples in parallel, as quantile normalization assumes similar data distributions across samples(6). HTSeq counts were converted to standard scores such that expression data for each HNC sample had a mean of zero and standard deviation of 1. Standard scores were then log2 transformed and batch corrected (Correcting for sample plate) using COMBAT(7). Gene names were converted from Ensembl IDs to Entrez IDs using biomaRt(3). Ensembl ID-level data was summarized to Entrez gene level data using the WGCNA package 'CollapseRows' function(2). the default 'maxmean' method was used to select features with higher expression where Entrez IDs matched multiple Ensembl IDs. Bibliography 1. Cao S, Wylie KM, Wyczalkowski MA, Karpova A, Ley J, Sun S, et al. Dynamic host immune response in virus-associated cancers. Commun Biol. 2019; 2. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics [Internet]. 2008 Dec 29 [cited 2022 Feb 11];9(1):1–13. Available from: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559 3. Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal [Internet]. 2013;6(269):pl1. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23550210%5Cnhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4160307 4. Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res [Internet]. 2016 May 5 [cited 2022 Feb 11];44(8):e71. Available from: https://pubmed.ncbi.nlm.nih.gov/26704973/ 5. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics [Internet]. 2005 Aug 15 [cited 2022 Feb 11];21(16):3439–40. Available from: https://academic.oup.com/bioinformatics/article/21/16/3439/215235 6. Zhao Y, Wong L, Goh WW Bin. How to do quantile normalization correctly for gene expression data analyses. Sci Reports 2020 101 [Internet]. 2020 Sep 23 [cited 2022 Feb 11];10(1):1–11. Available from: https://www.nature.com/articles/s41598-020-72664-6 7. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

We assembled a compendium of 30 primary HNC gene expression datasets with accompanying clinical data, representing the largest such resource for HNC. This resource was specifically built to identify genes associated with two outcome variables: patient survival and lymph node metastasis (LNM) status. Meta-analyses were applied to uniformly preprocessed gene expression data, as in our PRECOG resource (Gentles et al, Nat Med, 2016). Briefly, datasets were quality controlled, normalized, log transformed, and standardized to calculate gene expression profiles. Clinical data were manually curated and included survival and LNM status as well as variables relevant to HNC prognosis, such as tumor grade, tumor subanatomic location, and HPV status. The resulting 30 cleaned studies included 2,134 HNC tumors. 1,666 patients (across 17 cohorts) had survival outcome data and 1,490 patients (21 cohorts) had LNM status. Fully processed datasets are provided here as a resource to enable efficient meta-analyses of gene expression data in head and neck cancer.

Related Organizations

Stanford University
United States

Keywords

Head and neck cancer, gene expression, meta-analysis

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average