Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Downloads TCGA data from the Broad Institute's GDAC Firehose pipeline and stores it in convenient HDF files.
We've already done the hard work! Feel free to download the HDF files directly from Zenodo:
https://doi.org/10.5281/zenodo.4434748
-
tcga_omic.tar.gz
: multi-omic data (1.6GB) -
tcga_clinical.tar.gz
: clinical annotations (6.2MB)
The Cancer Genome Atlas ( TCGA )
-
TCGA includes multi-omic and clinical data, collected from 11,000+ patients divided into 38 cancer types.
-
It is a really compelling dataset for supervised or unsupervised machine learning tasks.
-
You can access TCGA data via FireBrowse or the
firehose_get
command line tool. -
Unfortunately, these tools (FireBrowse/Firehose) are inconvenient for the uninitiated.
-
Unless you have a lot of bioinformatics background knowledge, it can be difficult to understand which data you should actually download.
-
Once you've downloaded the data, it exists as several GB of zipped text files with long, complicated names.
-
This repository contains a complete workflow for (i) downloading useful kinds of TCGA data and (ii) storing it in a sensible format -- an HDF file.
Setup and execution
If you really want to run the code in this repo for yourself, then take the following steps:
-
Clone the repository:
git clone git@github.com:dpmerrell/tcga-pipeline
. -
Set up your python environment. Install the dependencies:
pip install -r requirements.txt
. (I recommend doing this in a virtual environment.) -
Make sure you have plenty of disk space. The downloaded, unzipped, and partially processed data will take a footprint of 280GB on disk . (In contrast, the final HDF files take <4GB unzipped.)
-
Make sure you have plenty of time. The downloads take a while -- consider running it overnight.
-
If you're feeling brave, adjust the parameters in the
config.yaml
file. -
Run the Snakemake workflow:
snakemake --cores 1
. Using more cores doesn't necessarily buy you any speed for downloading data.
Structure of the data
Multi-omic data
This workflow produces an HDF file of multi-omic data, with the structure illustrated above.
-
/omic_data
. HDF5 group containing multi-omic data, along with row and column information. -
/barcodes
. HDF5 group containing full TCGA barcodes for data samples. I.e., it gives the barcode for each sample and and each omic type (when it exists). Full barcodes are potentially useful for modeling batch effects.
Omic data feature names take the following form:
GENE_DATATYPE
Possible values for
DATATYPE
include
mutation
,
cnv
,
methylation
,
mrnaseq
, and
rppa
.
There are many missing values, indicated by
NaN
s -- not all measurements were taken for all patients.
Clinical data
This workflow also produces an HDF file of clinical data. Its structure is also shown in the figure.
Some provenance details
We download data from particular points in the Broad Intitute GDAC Firehose pipeline .
-
Copy number variation
-
CopyNumber_Gistic2
node in this DAG
-
-
Somatic Mutation annotations
-
Mutation_Packager_Oncotated_Calls
node in this dag
-
-
Methylation
-
Methylation_Preprocess
node in this DAG -
We recover the barcodes for methylation samples by inspecting the corresponding entries in "humanmethylation450" files whenever they exist, and "humanmethylation27" thereafter.
-
-
Gene expression
-
mRNAseq_Preprocess
node in this DAG -
We recover the barcodes for RNAseq samples by inspecting the corresponding entries in "illuminahiseq" files whenever they exist, and "illuminaga" thereafter.
-
-
Reverse Phase Protein Array
-
RPPA_AnnotateWithGene
node in this DAG
-
-
Clinical Data
-
Clinical_Pick_Tier1
node in this DAG
-
Licensing/Legal stuff
(c) David Merrell 2021
The software in this repository is distributed under an MIT license. See
LICENSE.txt
for details.
Note: downloading data from the BROAD TCGA GDAC site constitutes agreement to the TCGA Data Usage Policy:
https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844333156/Data+Usage+Policy
See also this note from the NIH: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/citing-tcga
-
Guidelines for citing TCGA in your research
-
data usage in publications: ...all TCGA data are available without restrictions on their use in publications or presentations.
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | import argparse import numpy as np import h5py def get_patients_and_features(hdf_path_list): print("Getting patients and features") features = [] ctype_to_patients = {} for hdf_path in hdf_path_list: with h5py.File(hdf_path, "r") as f: features += list(f["index"][:].astype(str)) new_patients = f["columns"][:] new_ctypes = f["cancer_types"][:] for pat, ct in zip(new_patients, new_ctypes): if ct not in ctype_to_patients.keys(): ctype_to_patients[ct] = set([pat]) else: ctype_to_patients[ct].add(pat) return ctype_to_patients, features def initialize_output(output_path, all_patients, all_ctypes, features): print("Initializing output") f_out = h5py.File(output_path, 'w') # OMIC DATA GROUP index = f_out.create_dataset("omic_data/features", shape=(len(features),), dtype=h5py.string_dtype('utf-8')) index[:] = features dataset = f_out.create_dataset("omic_data/data", shape=(len(features), len(all_patients))) dataset[:,:] = np.nan columns = f_out.create_dataset("omic_data/instances", shape=(len(all_patients),), dtype=h5py.string_dtype('utf-8')) columns[:] = all_patients ctypes = f_out.create_dataset("omic_data/cancer_types", shape=(len(all_ctypes),), dtype=h5py.string_dtype('utf-8')) ctypes[:] = all_ctypes # BARCODE GROUP omic_types = [str(feat).split("_")[-1] for feat in features] unique_omic_types = sorted(set(omic_types)) barcodes = f_out.create_dataset("barcodes/data", shape=(len(unique_omic_types), len(all_patients)), dtype=h5py.string_dtype('utf-8')) barcodes[:,:] = "" barcode_columns = f_out.create_dataset("barcodes/instances", shape=(len(all_patients),), dtype=h5py.string_dtype('utf-8')) barcode_columns[:] = all_patients barcode_index = f_out.create_dataset("barcodes/features", shape=(len(unique_omic_types),), dtype=h5py.string_dtype('utf-8')) barcode_index[:] = unique_omic_types return f_out def compute_chunks(idx_ls): idx_ls_srt = sorted(idx_ls) chunks = [] cur_start = idx_ls_srt[0] cur_end = idx_ls_srt[0] for idx in idx_ls_srt[1:]: if idx == cur_end + 1: cur_end += 1 else: chunks.append([cur_start, cur_end]) cur_start = idx cur_end = idx chunks.append([cur_start, cur_end]) return chunks def add_dataset(input_path, patient_to_col, f_out, leading, lagging): print("Adding data from ", input_path) with h5py.File(input_path, "r") as f_in: # Update the leading row index leading += len(f_in["index"]) # Need to build maps between # columns of f_in and columns of f_out. # The columns of f_in are a subset of the # columns of f_out. new_data = f_in["data"][:,:] in_out_idx = [patient_to_col[pat] for pat in f_in["columns"][:]] out_in_idx = {out_idx: in_idx for in_idx, out_idx in enumerate(in_out_idx)} # Writing to HDF is much more efficient if we # write "chunks" of columns at a time # (rather than individual columns) out_column_chunks = compute_chunks(in_out_idx) for chunk in out_column_chunks: mapped_chunk = [out_in_idx[out_idx] for out_idx in range(chunk[0],chunk[1]+1)] f_out["omic_data/data"][lagging:leading, chunk[0]:chunk[1]+1] = new_data[:,mapped_chunk] barcode_omics = f_out["barcodes/features"][:].astype(str) omic_type = f_in["index"][:1].astype(str)[0].split("_")[-1] barcode_omic_row = np.argwhere(barcode_omics == omic_type)[0][0] f_out["barcodes/data"][barcode_omic_row, in_out_idx] = f_in["barcodes"][:] # Update the lagging row index lagging = leading return leading, lagging def add_all_datasets(input_path_list, ctype_to_patients, features, output_path): # Some bookkeeping data structures ctypes = sorted(list(ctype_to_patients.keys())) all_patients = sum([sorted(list(ctype_to_patients[k])) for k in ctypes], start=[]) patient_to_ctype = {pat: k for k in ctypes for pat in ctype_to_patients[k]} all_ctypes = [patient_to_ctype[pat] for pat in all_patients] patient_to_col = {pat: idx for idx, pat in enumerate(all_patients)} f_out = initialize_output(output_path, all_patients, all_ctypes, features) leading = 0 lagging = 0 for i, input_path in enumerate(input_path_list): leading, lagging = add_dataset(input_path, patient_to_col, f_out, leading, lagging) f_out.close() return if __name__=="__main__": argparser = argparse.ArgumentParser() argparser.add_argument("--hdf-path-list", nargs="+", help="list of HDF files to merge") argparser.add_argument("--output", help="path to the output: a combined HDF file") args = argparser.parse_args() ctype_to_patients, features = get_patients_and_features(args.hdf_path_list) add_all_datasets(args.hdf_path_list, ctype_to_patients, features, args.output) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import argparse import pandas as pd import h5py def standardize_patient_id(patient_id): std_id = patient_id.upper() std_id = std_id.replace("_","-") std_id = "-".join(std_id.split("-")[:3]) return std_id def read_clinical_data(data_filename): index_col = "bcr_patient_barcode" # standardize the gene IDs df = pd.read_csv(data_filename, sep="\t", skiprows=[1,2]) # index the df by gene df.set_index(index_col, inplace=True) # standardize the patient IDs and put the columns in sorted order df.columns = df.columns.map(standardize_patient_id) srt_cols = sorted(df.columns) df = df[srt_cols] return df def read_all_data(data_filenames, all_cancer_types): ctype_ls = [] combined = pd.DataFrame() for data_file, cancer_type in zip(data_filenames, all_cancer_types): new_df = read_clinical_data(data_file) ctype_ls += [cancer_type]*new_df.shape[1] combined = pd.concat((combined, new_df), axis=1) return combined, ctype_ls def store_all_data(combined_data, cancer_type_list, output_filename): combined_data = combined_data.astype(str) with h5py.File(output_filename, "w") as out_f: dset = out_f.create_dataset("data", combined_data.shape, dtype=h5py.string_dtype('utf-8')) dset[:,:] = combined_data.values rowset = out_f.create_dataset("index", combined_data.index.shape, dtype=h5py.string_dtype('utf-8')) rowset[:] = combined_data.index.values colset = out_f.create_dataset("columns", combined_data.columns.shape, dtype=h5py.string_dtype('utf-8')) colset[:] = combined_data.columns.values ctype_set = out_f.create_dataset("cancer_types", (len(cancer_type_list),), dtype=h5py.string_dtype('utf-8')) ctype_set[:] = cancer_type_list return if __name__=="__main__": parser = argparse.ArgumentParser("Get data from text files; merge it; and put it in an HDF file") parser.add_argument("--csv-files", nargs="+", help="a list of paths to clinical CSV files") parser.add_argument("--cancer-types", nargs="+", help="a list of cancer type names") parser.add_argument("hdf_file", help="path to the output HDF file") args = parser.parse_args() combined_df, ctype_ls = read_all_data(args.csv_files, args.cancer_types) store_all_data(combined_df, ctype_ls, args.hdf_file) |
76 77 | shell: "python {input.src} {input.hdf} {output.img} {params.opts}" |
84 85 | shell: "python scripts/merge_all_omic_data.py --hdf-path-list {input} --output {output}" |
105 106 | shell: "python {input.script} {wildcards.omic_type} {output} --csv-files {input.csv_files} --cancer-types {params.ctypes}" |
118 119 | shell: "python {input.script} {wildcards.omic_type} {output} --csv-files {input.restored} --cancer-types {params.ctypes}" |
143 144 | shell: "python {input.script} {input.data_tsv} {input.barcoded_tsv} {output} {input.barcode_backup_tsv}" |
163 164 | shell: "python {input.script} {wildcards.omic_type} {output} --csv-files {input} --cancer-types {params.ctypes}" |
179 180 | shell: "python {input.script} {input.annotations_txt} {input.mutsig} {output}" |
188 189 | shell: "python scripts/merge_clinical_data.py {output} --csv-files {input} --cancer-types {CANCER_TYPES}" |
201 202 | shell: "tar -xvzf {input} --directory {wildcards.run_type}__{TIMESTAMP}/{wildcards.cancer_type}/{TIMESTAMP_ABBR}/" |
213 214 | shell: "./{input} -b -tasks {wildcards.data_type} {wildcards.run_type} {TIMESTAMP} {wildcards.cancer_type}" |
226 227 | shell: "./{input} -b -tasks {wildcards.analysis_id} analyses {TIMESTAMP} {wildcards.cancer_type}" |
235 236 | shell: "unzip {input}" |
242 243 | shell: "wget {FIREHOSE_GET_URL}" |
Support
- Future updates
Related Workflows





