<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

TCGADownloadHelper: simplifying TCGA data extraction and preprocessing

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 02 May 2025Publisher:Frontiers Media SAJournal:Frontiers in Genetics, volume 16 (eissn: 1664-8021,

Authors: Alexandra Anke Baumann; Alexandra Anke Baumann; Olaf Wolkenhauer; Olaf Wolkenhauer; Olaf Wolkenhauer; Markus Wolfien; Markus Wolfien;

doi: 10.3389/fgene.2025.1569290

pmid: 40385985

pmc: PMC12081331

TCGADownloadHelper: simplifying TCGA data extraction and preprocessing

- Summary
- Subjects
- Related research
  (3)
- Metrics

Abstract

The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.

Related Organizations

Leibniz Association
Germany
Ostschweizer Fachhochschule OST
Switzerland
San Raffaele Scientific Institute
Italy
Stellenbosch Univerisy
South Africa
Leibniz-Institute for Food Systems Biology at the Technical University of Munich
Germany

View all View all

Keywords

lung cancer, the cancer genome atlas (TCGA), Genetics, genomic data commons (GDC) portal, sample preprocessing, QH426-470, Genetics ; the cancer genome atlas (TCGA) ; sample preprocessing ; Jupyter Notebook ; lung cancer ; genomic data commons (GDC) portal, Jupyter Notebook, ddc: ddc:

3 Research products, page of 1

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

gold

Related to Research communities

European University for Smart Urban Coastal Sustainability

Cancer Research