Downloads TCGA data and puts it into a nice HDF file

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Downloads TCGA data from the Broad Institute's GDAC Firehose pipeline and stores it in convenient HDF files.

We've already done the hard work! Feel free to download the HDF files directly from Zenodo:

https://doi.org/10.5281/zenodo.4434748

tcga_omic.tar.gz : multi-omic data (1.6GB)
tcga_clinical.tar.gz : clinical annotations (6.2MB)

The Cancer Genome Atlas ( TCGA )

TCGA includes multi-omic and clinical data, collected from 11,000+ patients divided into 38 cancer types.
It is a really compelling dataset for supervised or unsupervised machine learning tasks.
You can access TCGA data via FireBrowse or the firehose_get command line tool.
Unfortunately, these tools (FireBrowse/Firehose) are inconvenient for the uninitiated.
- Unless you have a lot of bioinformatics background knowledge, it can be difficult to understand which data you should actually download.
- Once you've downloaded the data, it exists as several GB of zipped text files with long, complicated names.

This repository contains a complete workflow for (i) downloading useful kinds of TCGA data and (ii) storing it in a sensible format -- an HDF file.

Setup and execution

If you really want to run the code in this repo for yourself, then take the following steps:

Clone the repository: git clone git@github.com:dpmerrell/tcga-pipeline .
Set up your python environment. Install the dependencies: pip install -r requirements.txt . (I recommend doing this in a virtual environment.)
Make sure you have plenty of disk space. The downloaded, unzipped, and partially processed data will take a footprint of 280GB on disk . (In contrast, the final HDF files take <4GB unzipped.)
Make sure you have plenty of time. The downloads take a while -- consider running it overnight.
If you're feeling brave, adjust the parameters in the config.yaml file.
Run the Snakemake workflow: snakemake --cores 1 . Using more cores doesn't necessarily buy you any speed for downloading data.

Structure of the data

tcga data schematic

Multi-omic data

This workflow produces an HDF file of multi-omic data, with the structure illustrated above.

/omic_data . HDF5 group containing multi-omic data, along with row and column information.
/barcodes . HDF5 group containing full TCGA barcodes for data samples. I.e., it gives the barcode for each sample and and each omic type (when it exists). Full barcodes are potentially useful for modeling batch effects.

Omic data feature names take the following form:

GENE_DATATYPE

Possible values for DATATYPE include mutation , cnv , methylation , mrnaseq , and rppa .

There are many missing values, indicated by NaN s -- not all measurements were taken for all patients.

Clinical data

This workflow also produces an HDF file of clinical data. Its structure is also shown in the figure.

Some provenance details

We download data from particular points in the Broad Intitute GDAC Firehose pipeline .

Copy number variation
- CopyNumber_Gistic2 node in this DAG
Somatic Mutation annotations
- Mutation_Packager_Oncotated_Calls node in this dag
Methylation
- Methylation_Preprocess node in this DAG
- We recover the barcodes for methylation samples by inspecting the corresponding entries in "humanmethylation450" files whenever they exist, and "humanmethylation27" thereafter.
Gene expression
- mRNAseq_Preprocess node in this DAG
- We recover the barcodes for RNAseq samples by inspecting the corresponding entries in "illuminahiseq" files whenever they exist, and "illuminaga" thereafter.
Reverse Phase Protein Array
- RPPA_AnnotateWithGene node in this DAG
Clinical Data
- Clinical_Pick_Tier1 node in this DAG

Licensing/Legal stuff

The software in this repository is distributed under an MIT license. See LICENSE.txt for details.

Note: downloading data from the BROAD TCGA GDAC site constitutes agreement to the TCGA Data Usage Policy:

https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844333156/Data+Usage+Policy

See also this note from the NIH: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/citing-tcga

Guidelines for citing TCGA in your research
data usage in publications: ...all TCGA data are available without restrictions on their use in publications or presentations.

Code Snippets

import argparse
import numpy as np
import h5py


def get_patients_and_features(hdf_path_list):

    print("Getting patients and features")

    features = []
    ctype_to_patients = {}
    for hdf_path in hdf_path_list:

        with h5py.File(hdf_path, "r") as f:

            features += list(f["index"][:].astype(str))
            new_patients = f["columns"][:]
            new_ctypes = f["cancer_types"][:]

            for pat, ct in zip(new_patients, new_ctypes):
                if ct not in ctype_to_patients.keys():
                    ctype_to_patients[ct] = set([pat])
                else:
                    ctype_to_patients[ct].add(pat)

    return ctype_to_patients, features



def initialize_output(output_path, all_patients, all_ctypes, features):

    print("Initializing output")

    f_out = h5py.File(output_path, 'w')

    # OMIC DATA GROUP
    index = f_out.create_dataset("omic_data/features", shape=(len(features),),
                                 dtype=h5py.string_dtype('utf-8'))
    index[:] = features

    dataset = f_out.create_dataset("omic_data/data", shape=(len(features), len(all_patients)))
    dataset[:,:] = np.nan

    columns = f_out.create_dataset("omic_data/instances", shape=(len(all_patients),),
                                   dtype=h5py.string_dtype('utf-8'))
    columns[:] = all_patients

    ctypes = f_out.create_dataset("omic_data/cancer_types", shape=(len(all_ctypes),),
                                  dtype=h5py.string_dtype('utf-8'))
    ctypes[:] = all_ctypes

    # BARCODE GROUP
    omic_types = [str(feat).split("_")[-1] for feat in features]
    unique_omic_types = sorted(set(omic_types))

    barcodes = f_out.create_dataset("barcodes/data", shape=(len(unique_omic_types), len(all_patients)),
                                    dtype=h5py.string_dtype('utf-8'))
    barcodes[:,:] = ""

    barcode_columns = f_out.create_dataset("barcodes/instances", shape=(len(all_patients),),
                                           dtype=h5py.string_dtype('utf-8'))
    barcode_columns[:] = all_patients

    barcode_index = f_out.create_dataset("barcodes/features", shape=(len(unique_omic_types),),
                                    dtype=h5py.string_dtype('utf-8'))
    barcode_index[:] = unique_omic_types

    return f_out


def compute_chunks(idx_ls):

    idx_ls_srt = sorted(idx_ls)
    chunks = []

    cur_start = idx_ls_srt[0]
    cur_end = idx_ls_srt[0]
    for idx in idx_ls_srt[1:]:
        if idx == cur_end + 1:
            cur_end += 1
        else:
            chunks.append([cur_start, cur_end])
            cur_start = idx
            cur_end = idx
    chunks.append([cur_start, cur_end])

    return chunks


def add_dataset(input_path, patient_to_col, f_out, leading, lagging):

    print("Adding data from ", input_path)
    with h5py.File(input_path, "r") as f_in:

        # Update the leading row index
        leading += len(f_in["index"]) 

        # Need to build maps between
        # columns of f_in and columns of f_out.
        # The columns of f_in are a subset of the 
        # columns of f_out.
        new_data = f_in["data"][:,:]
        in_out_idx = [patient_to_col[pat] for pat in f_in["columns"][:]]
        out_in_idx = {out_idx: in_idx for in_idx, out_idx in enumerate(in_out_idx)}

        # Writing to HDF is much more efficient if we
        # write "chunks" of columns at a time
        # (rather than individual columns)
        out_column_chunks = compute_chunks(in_out_idx)
        for chunk in out_column_chunks:
            mapped_chunk = [out_in_idx[out_idx] for out_idx in range(chunk[0],chunk[1]+1)]
            f_out["omic_data/data"][lagging:leading, chunk[0]:chunk[1]+1] = new_data[:,mapped_chunk]

        barcode_omics = f_out["barcodes/features"][:].astype(str)
        omic_type = f_in["index"][:1].astype(str)[0].split("_")[-1]
        barcode_omic_row = np.argwhere(barcode_omics == omic_type)[0][0]
        f_out["barcodes/data"][barcode_omic_row, in_out_idx] = f_in["barcodes"][:]

    # Update the lagging row index
    lagging = leading

    return leading, lagging


def add_all_datasets(input_path_list, ctype_to_patients, features, output_path):

    # Some bookkeeping data structures
    ctypes = sorted(list(ctype_to_patients.keys()))
    all_patients = sum([sorted(list(ctype_to_patients[k])) for k in ctypes], start=[])
    patient_to_ctype = {pat: k for k in ctypes for pat in ctype_to_patients[k]}
    all_ctypes = [patient_to_ctype[pat] for pat in all_patients] 
    patient_to_col = {pat: idx for idx, pat in enumerate(all_patients)}

    f_out = initialize_output(output_path, all_patients, all_ctypes, features)

    leading = 0
    lagging = 0
    for i, input_path in enumerate(input_path_list):
        leading, lagging = add_dataset(input_path, patient_to_col, f_out, 
                                       leading, lagging)

    f_out.close()

    return


if __name__=="__main__":

    argparser = argparse.ArgumentParser()
    argparser.add_argument("--hdf-path-list", nargs="+", help="list of HDF files to merge")
    argparser.add_argument("--output", help="path to the output: a combined HDF file")

    args = argparser.parse_args()

    ctype_to_patients, features = get_patients_and_features(args.hdf_path_list) 

    add_all_datasets(args.hdf_path_list, ctype_to_patients, features, args.output)

Python numpy h5py From line 2 of scripts/merge_all_omic_data.py

import argparse
import pandas as pd
import h5py


def standardize_patient_id(patient_id):
    std_id = patient_id.upper()
    std_id = std_id.replace("_","-")
    std_id = "-".join(std_id.split("-")[:3])
    return std_id

def read_clinical_data(data_filename):

    index_col = "bcr_patient_barcode"

    # standardize the gene IDs 
    df = pd.read_csv(data_filename, sep="\t", skiprows=[1,2])

    # index the df by gene
    df.set_index(index_col, inplace=True)

    # standardize the patient IDs and put the columns in sorted order
    df.columns = df.columns.map(standardize_patient_id)
    srt_cols = sorted(df.columns)
    df = df[srt_cols]

    return df


def read_all_data(data_filenames, all_cancer_types):

    ctype_ls = []
    combined = pd.DataFrame()

    for data_file, cancer_type in zip(data_filenames, all_cancer_types):
        new_df = read_clinical_data(data_file)
        ctype_ls += [cancer_type]*new_df.shape[1]
        combined = pd.concat((combined, new_df), axis=1)

    return combined, ctype_ls


def store_all_data(combined_data, cancer_type_list, output_filename):

    combined_data = combined_data.astype(str)

    with h5py.File(output_filename, "w") as out_f:

        dset = out_f.create_dataset("data", combined_data.shape,
                                    dtype=h5py.string_dtype('utf-8'))
        dset[:,:] = combined_data.values

        rowset = out_f.create_dataset("index", combined_data.index.shape,
                                      dtype=h5py.string_dtype('utf-8'))
        rowset[:] = combined_data.index.values

        colset = out_f.create_dataset("columns", combined_data.columns.shape,
                                      dtype=h5py.string_dtype('utf-8'))
        colset[:] = combined_data.columns.values

        ctype_set = out_f.create_dataset("cancer_types", (len(cancer_type_list),),
                                         dtype=h5py.string_dtype('utf-8'))
        ctype_set[:] = cancer_type_list

    return 


if __name__=="__main__":

    parser = argparse.ArgumentParser("Get data from text files; merge it; and put it in an HDF file")
    parser.add_argument("--csv-files", nargs="+", help="a list of paths to clinical CSV files")
    parser.add_argument("--cancer-types", nargs="+", help="a list of cancer type names")
    parser.add_argument("hdf_file", help="path to the output HDF file")

    args = parser.parse_args()

    combined_df, ctype_ls = read_all_data(args.csv_files, args.cancer_types) 

    store_all_data(combined_df, ctype_ls, args.hdf_file)

Python Pandas h5py From line 2 of scripts/merge_clinical_data.py

shell:
    "python {input.src} {input.hdf} {output.img} {params.opts}"

SnakeMake From line 76 of main/Snakefile

shell:
    "python scripts/merge_all_omic_data.py --hdf-path-list {input} --output {output}"

SnakeMake From line 84 of main/Snakefile

shell:
    "python {input.script} {wildcards.omic_type} {output} --csv-files {input.csv_files} --cancer-types {params.ctypes}"

SnakeMake From line 105 of main/Snakefile

shell:
    "python {input.script} {wildcards.omic_type} {output} --csv-files {input.restored} --cancer-types {params.ctypes}"

SnakeMake From line 118 of main/Snakefile

shell:
    "python {input.script} {input.data_tsv} {input.barcoded_tsv} {output} {input.barcode_backup_tsv}"

SnakeMake From line 143 of main/Snakefile

shell:
    "python {input.script} {wildcards.omic_type} {output} --csv-files {input} --cancer-types {params.ctypes}"

SnakeMake From line 163 of main/Snakefile

shell:
    "python {input.script} {input.annotations_txt} {input.mutsig} {output}"

SnakeMake From line 179 of main/Snakefile

shell:
    "python scripts/merge_clinical_data.py {output} --csv-files {input} --cancer-types {CANCER_TYPES}"

SnakeMake From line 188 of main/Snakefile

shell:
    "tar -xvzf {input} --directory {wildcards.run_type}__{TIMESTAMP}/{wildcards.cancer_type}/{TIMESTAMP_ABBR}/"

SnakeMake From line 201 of main/Snakefile

shell:
    "./{input} -b -tasks {wildcards.data_type} {wildcards.run_type} {TIMESTAMP} {wildcards.cancer_type}"

SnakeMake From line 213 of main/Snakefile

shell:
    "./{input} -b -tasks {wildcards.analysis_id} analyses {TIMESTAMP} {wildcards.cancer_type}"

SnakeMake From line 226 of main/Snakefile

shell:
    "unzip {input}"

SnakeMake From line 235 of main/Snakefile

shell:
    "wget {FIREHOSE_GET_URL}"

SnakeMake From line 242 of main/Snakefile

ShowHide 13 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/dpmerrell/tcga-pipeline

Name: tcga-pipeline

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Pandas Snakemake h5py numpy

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free