Downloads TCGA data and puts it into a nice HDF file

public public 1yr ago 0 bookmarks

Downloads TCGA data from the Broad Institute's GDAC Firehose pipeline and stores it in convenient HDF files.

We've already done the hard work! Feel free to download the HDF files directly from Zenodo:

https://doi.org/10.5281/zenodo.4434748

  • tcga_omic.tar.gz : multi-omic data (1.6GB)

  • tcga_clinical.tar.gz : clinical annotations (6.2MB)

The Cancer Genome Atlas ( TCGA )

  • TCGA includes multi-omic and clinical data, collected from 11,000+ patients divided into 38 cancer types.

  • It is a really compelling dataset for supervised or unsupervised machine learning tasks.

  • You can access TCGA data via FireBrowse or the firehose_get command line tool.

  • Unfortunately, these tools (FireBrowse/Firehose) are inconvenient for the uninitiated.

    • Unless you have a lot of bioinformatics background knowledge, it can be difficult to understand which data you should actually download.

    • Once you've downloaded the data, it exists as several GB of zipped text files with long, complicated names.

This repository contains a complete workflow for (i) downloading useful kinds of TCGA data and (ii) storing it in a sensible format -- an HDF file.

Setup and execution

If you really want to run the code in this repo for yourself, then take the following steps:

  1. Clone the repository: git clone git@github.com:dpmerrell/tcga-pipeline .

  2. Set up your python environment. Install the dependencies: pip install -r requirements.txt . (I recommend doing this in a virtual environment.)

  3. Make sure you have plenty of disk space. The downloaded, unzipped, and partially processed data will take a footprint of 280GB on disk . (In contrast, the final HDF files take <4GB unzipped.)

  4. Make sure you have plenty of time. The downloads take a while -- consider running it overnight.

  5. If you're feeling brave, adjust the parameters in the config.yaml file.

  6. Run the Snakemake workflow: snakemake --cores 1 . Using more cores doesn't necessarily buy you any speed for downloading data.

Structure of the data

tcga data schematic

Multi-omic data

This workflow produces an HDF file of multi-omic data, with the structure illustrated above.

  • /omic_data . HDF5 group containing multi-omic data, along with row and column information.

  • /barcodes . HDF5 group containing full TCGA barcodes for data samples. I.e., it gives the barcode for each sample and and each omic type (when it exists). Full barcodes are potentially useful for modeling batch effects.

Omic data feature names take the following form:

GENE_DATATYPE

Possible values for DATATYPE include mutation , cnv , methylation , mrnaseq , and rppa .

There are many missing values, indicated by NaN s -- not all measurements were taken for all patients.

Clinical data

This workflow also produces an HDF file of clinical data. Its structure is also shown in the figure.

Some provenance details

We download data from particular points in the Broad Intitute GDAC Firehose pipeline .

  • Copy number variation

  • Somatic Mutation annotations

    • Mutation_Packager_Oncotated_Calls node in this dag
  • Methylation

    • Methylation_Preprocess node in this DAG

    • We recover the barcodes for methylation samples by inspecting the corresponding entries in "humanmethylation450" files whenever they exist, and "humanmethylation27" thereafter.

  • Gene expression

    • mRNAseq_Preprocess node in this DAG

    • We recover the barcodes for RNAseq samples by inspecting the corresponding entries in "illuminahiseq" files whenever they exist, and "illuminaga" thereafter.

  • Reverse Phase Protein Array

  • Clinical Data

Licensing/Legal stuff

(c) David Merrell 2021

The software in this repository is distributed under an MIT license. See LICENSE.txt for details.

Note: downloading data from the BROAD TCGA GDAC site constitutes agreement to the TCGA Data Usage Policy:

https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844333156/Data+Usage+Policy

See also this note from the NIH: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/citing-tcga

  • Guidelines for citing TCGA in your research

  • data usage in publications: ...all TCGA data are available without restrictions on their use in publications or presentations.

Code Snippets

  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import argparse
import numpy as np
import h5py


def get_patients_and_features(hdf_path_list):

    print("Getting patients and features")

    features = []
    ctype_to_patients = {}
    for hdf_path in hdf_path_list:

        with h5py.File(hdf_path, "r") as f:

            features += list(f["index"][:].astype(str))
            new_patients = f["columns"][:]
            new_ctypes = f["cancer_types"][:]

            for pat, ct in zip(new_patients, new_ctypes):
                if ct not in ctype_to_patients.keys():
                    ctype_to_patients[ct] = set([pat])
                else:
                    ctype_to_patients[ct].add(pat)

    return ctype_to_patients, features



def initialize_output(output_path, all_patients, all_ctypes, features):

    print("Initializing output")

    f_out = h5py.File(output_path, 'w')

    # OMIC DATA GROUP
    index = f_out.create_dataset("omic_data/features", shape=(len(features),),
                                 dtype=h5py.string_dtype('utf-8'))
    index[:] = features

    dataset = f_out.create_dataset("omic_data/data", shape=(len(features), len(all_patients)))
    dataset[:,:] = np.nan

    columns = f_out.create_dataset("omic_data/instances", shape=(len(all_patients),),
                                   dtype=h5py.string_dtype('utf-8'))
    columns[:] = all_patients

    ctypes = f_out.create_dataset("omic_data/cancer_types", shape=(len(all_ctypes),),
                                  dtype=h5py.string_dtype('utf-8'))
    ctypes[:] = all_ctypes

    # BARCODE GROUP
    omic_types = [str(feat).split("_")[-1] for feat in features]
    unique_omic_types = sorted(set(omic_types))

    barcodes = f_out.create_dataset("barcodes/data", shape=(len(unique_omic_types), len(all_patients)),
                                    dtype=h5py.string_dtype('utf-8'))
    barcodes[:,:] = ""

    barcode_columns = f_out.create_dataset("barcodes/instances", shape=(len(all_patients),),
                                           dtype=h5py.string_dtype('utf-8'))
    barcode_columns[:] = all_patients

    barcode_index = f_out.create_dataset("barcodes/features", shape=(len(unique_omic_types),),
                                    dtype=h5py.string_dtype('utf-8'))
    barcode_index[:] = unique_omic_types

    return f_out


def compute_chunks(idx_ls):

    idx_ls_srt = sorted(idx_ls)
    chunks = []

    cur_start = idx_ls_srt[0]
    cur_end = idx_ls_srt[0]
    for idx in idx_ls_srt[1:]:
        if idx == cur_end + 1:
            cur_end += 1
        else:
            chunks.append([cur_start, cur_end])
            cur_start = idx
            cur_end = idx
    chunks.append([cur_start, cur_end])

    return chunks


def add_dataset(input_path, patient_to_col, f_out, leading, lagging):

    print("Adding data from ", input_path)
    with h5py.File(input_path, "r") as f_in:

        # Update the leading row index
        leading += len(f_in["index"]) 

        # Need to build maps between
        # columns of f_in and columns of f_out.
        # The columns of f_in are a subset of the 
        # columns of f_out.
        new_data = f_in["data"][:,:]
        in_out_idx = [patient_to_col[pat] for pat in f_in["columns"][:]]
        out_in_idx = {out_idx: in_idx for in_idx, out_idx in enumerate(in_out_idx)}

        # Writing to HDF is much more efficient if we
        # write "chunks" of columns at a time
        # (rather than individual columns)
        out_column_chunks = compute_chunks(in_out_idx)
        for chunk in out_column_chunks:
            mapped_chunk = [out_in_idx[out_idx] for out_idx in range(chunk[0],chunk[1]+1)]
            f_out["omic_data/data"][lagging:leading, chunk[0]:chunk[1]+1] = new_data[:,mapped_chunk]

        barcode_omics = f_out["barcodes/features"][:].astype(str)
        omic_type = f_in["index"][:1].astype(str)[0].split("_")[-1]
        barcode_omic_row = np.argwhere(barcode_omics == omic_type)[0][0]
        f_out["barcodes/data"][barcode_omic_row, in_out_idx] = f_in["barcodes"][:]

    # Update the lagging row index
    lagging = leading

    return leading, lagging


def add_all_datasets(input_path_list, ctype_to_patients, features, output_path):

    # Some bookkeeping data structures
    ctypes = sorted(list(ctype_to_patients.keys()))
    all_patients = sum([sorted(list(ctype_to_patients[k])) for k in ctypes], start=[])
    patient_to_ctype = {pat: k for k in ctypes for pat in ctype_to_patients[k]}
    all_ctypes = [patient_to_ctype[pat] for pat in all_patients] 
    patient_to_col = {pat: idx for idx, pat in enumerate(all_patients)}

    f_out = initialize_output(output_path, all_patients, all_ctypes, features)

    leading = 0
    lagging = 0
    for i, input_path in enumerate(input_path_list):
        leading, lagging = add_dataset(input_path, patient_to_col, f_out, 
                                       leading, lagging)

    f_out.close()

    return


if __name__=="__main__":

    argparser = argparse.ArgumentParser()
    argparser.add_argument("--hdf-path-list", nargs="+", help="list of HDF files to merge")
    argparser.add_argument("--output", help="path to the output: a combined HDF file")

    args = argparser.parse_args()

    ctype_to_patients, features = get_patients_and_features(args.hdf_path_list) 

    add_all_datasets(args.hdf_path_list, ctype_to_patients, features, args.output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import argparse
import pandas as pd
import h5py


def standardize_patient_id(patient_id):
    std_id = patient_id.upper()
    std_id = std_id.replace("_","-")
    std_id = "-".join(std_id.split("-")[:3])
    return std_id

def read_clinical_data(data_filename):

    index_col = "bcr_patient_barcode"

    # standardize the gene IDs 
    df = pd.read_csv(data_filename, sep="\t", skiprows=[1,2])

    # index the df by gene
    df.set_index(index_col, inplace=True)

    # standardize the patient IDs and put the columns in sorted order
    df.columns = df.columns.map(standardize_patient_id)
    srt_cols = sorted(df.columns)
    df = df[srt_cols]

    return df


def read_all_data(data_filenames, all_cancer_types):

    ctype_ls = []
    combined = pd.DataFrame()

    for data_file, cancer_type in zip(data_filenames, all_cancer_types):
        new_df = read_clinical_data(data_file)
        ctype_ls += [cancer_type]*new_df.shape[1]
        combined = pd.concat((combined, new_df), axis=1)

    return combined, ctype_ls


def store_all_data(combined_data, cancer_type_list, output_filename):

    combined_data = combined_data.astype(str)

    with h5py.File(output_filename, "w") as out_f:

        dset = out_f.create_dataset("data", combined_data.shape,
                                    dtype=h5py.string_dtype('utf-8'))
        dset[:,:] = combined_data.values

        rowset = out_f.create_dataset("index", combined_data.index.shape,
                                      dtype=h5py.string_dtype('utf-8'))
        rowset[:] = combined_data.index.values

        colset = out_f.create_dataset("columns", combined_data.columns.shape,
                                      dtype=h5py.string_dtype('utf-8'))
        colset[:] = combined_data.columns.values

        ctype_set = out_f.create_dataset("cancer_types", (len(cancer_type_list),),
                                         dtype=h5py.string_dtype('utf-8'))
        ctype_set[:] = cancer_type_list

    return 


if __name__=="__main__":

    parser = argparse.ArgumentParser("Get data from text files; merge it; and put it in an HDF file")
    parser.add_argument("--csv-files", nargs="+", help="a list of paths to clinical CSV files")
    parser.add_argument("--cancer-types", nargs="+", help="a list of cancer type names")
    parser.add_argument("hdf_file", help="path to the output HDF file")

    args = parser.parse_args()

    combined_df, ctype_ls = read_all_data(args.csv_files, args.cancer_types) 

    store_all_data(combined_df, ctype_ls, args.hdf_file)
76
77
shell:
    "python {input.src} {input.hdf} {output.img} {params.opts}"
SnakeMake From line 76 of main/Snakefile
84
85
shell:
    "python scripts/merge_all_omic_data.py --hdf-path-list {input} --output {output}"
SnakeMake From line 84 of main/Snakefile
105
106
shell:
    "python {input.script} {wildcards.omic_type} {output} --csv-files {input.csv_files} --cancer-types {params.ctypes}"
SnakeMake From line 105 of main/Snakefile
118
119
shell:
    "python {input.script} {wildcards.omic_type} {output} --csv-files {input.restored} --cancer-types {params.ctypes}"
SnakeMake From line 118 of main/Snakefile
143
144
shell:
    "python {input.script} {input.data_tsv} {input.barcoded_tsv} {output} {input.barcode_backup_tsv}"
SnakeMake From line 143 of main/Snakefile
163
164
shell:
    "python {input.script} {wildcards.omic_type} {output} --csv-files {input} --cancer-types {params.ctypes}"
SnakeMake From line 163 of main/Snakefile
179
180
shell:
    "python {input.script} {input.annotations_txt} {input.mutsig} {output}"
SnakeMake From line 179 of main/Snakefile
188
189
shell:
    "python scripts/merge_clinical_data.py {output} --csv-files {input} --cancer-types {CANCER_TYPES}"
SnakeMake From line 188 of main/Snakefile
201
202
shell:
    "tar -xvzf {input} --directory {wildcards.run_type}__{TIMESTAMP}/{wildcards.cancer_type}/{TIMESTAMP_ABBR}/"
SnakeMake From line 201 of main/Snakefile
213
214
shell:
    "./{input} -b -tasks {wildcards.data_type} {wildcards.run_type} {TIMESTAMP} {wildcards.cancer_type}"
SnakeMake From line 213 of main/Snakefile
226
227
shell:
    "./{input} -b -tasks {wildcards.analysis_id} analyses {TIMESTAMP} {wildcards.cancer_type}"
SnakeMake From line 226 of main/Snakefile
235
236
shell:
    "unzip {input}"
SnakeMake From line 235 of main/Snakefile
242
243
shell:
    "wget {FIREHOSE_GET_URL}"
SnakeMake From line 242 of main/Snakefile
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/dpmerrell/tcga-pipeline
Name: tcga-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...