Proyecto: Evaluación de la regulación trans en cáncer: un enfoque de biología de sistemas

public public 1yr ago 0 bookmarks

This pipelie serves to get differential expression and co-expression matrices for normal and cancer tissues from the TCGA dataset, downloaded from GDC, and the Toil Xena dataset that integrates both TCGA and GTEx data.

Folders in this repository

  • bin: It contains the GDC Data Transfer Tool command line for Linux

  • config: It contains the config.yaml file used for the Snakemake pipeline

  • input: With the GENECODE annotation files originally used for TCGA data (v22, according to this link ). It also contains GENECODE v37 (April, 2021) to filter genes and BioMart Ensembl Genes 80 to add GC content for QC purposes.

  • workflow: It has all the rules and scripts for the Snakemake pipeline

The pipeline

These are the main steps in the pipeline:

  1. Get data:

    • Get data from Xena: It downloads counts, sample information and annotations directly from the Xena-Toil S3 bucket.

      • Rules file: xena.smk
    • Get data from GDC: It queries GDC to creat a manifest and uses the gdc-client tool to download files.

      • Rules file: gdc.smk

      • Script: queryGDC.py

  2. Get raw matrix: This step integrates, for each tissue, raw counts with their respective annotation to obtain a raw matrix.

    • Rules: raw.smk

    • Scripts: addAnnotations.R and getRawMatrix.R

  3. QC: It gets NOISeq plots, filters genes with low expression (mean raw counts < 10), gets PCA and density plots (expression per sample) and removes samples with mean expression greater and lower than 2sd from the total mean.

    • Rules: qc.smk

    • Scripts: NOISeqPlots.R, filterLowExpression.R, PCA.R, densityPlot.R and filterOutliers.R

  4. Normalization: In the first run, it performs a test with different normalization methods and integrates the results. When normalization steps are defined for each tissue, it gets normalized data and runs arsyn for batch correction.

    • Rules: normalization.smk

    • Scripts: normalizationPlots.R, normalizationTest.R, usrNormalization.R, and runArsyn.R After this step, we get a final expression matrix for normal and cancer tissues, having the same gene set.

  5. Get final output:

    • Co-expression computation: It gets PCA and density plots after normalization and runs the ARACNE algorithm using a singularity image to get co-expression matrix for each tissue and condition. It can also obtain a pearson correlation matrix, if output files are required.

      • Rules: correlation.smk

      • Scripts: aracne_matrix.py, getPearsonMatrix.R

    • Differential expression: This step, like the co-expression computation, can be run after normalization and it gets gene differential expression and its associated plots.

      • Rules: deg.smk

      • Scripts: deg.R

How to run it

Unfortunately, the HTSeq raw counts used in this pipeline were removed by GDC on March, 2022 ( Data Release 32 ). This means that the GDC query used here (in the Get data from GDC step) will not return any result. However, all the data generated by this pipeline, including raw counts, can be found here:

Code Snippets

16
17
script:
    "../scripts/aracne_matrix.py"
28
29
script:
    "../scripts/getPearsonMatrix.R"
12
13
script:
    "../scripts/deg.R"
SnakeMake From line 12 of rules/deg.smk
 7
 8
 9
10
11
12
shell:
    """
    mkdir -p {output[0]};
    ./bin/gdc-client download -d {output[0]} -m {input[0]} --retry-amount 3;
    touch {output[1]}
    """
SnakeMake From line 7 of rules/gdc.smk
25
26
script:
    "../scripts/queryGDC.py"
SnakeMake From line 25 of rules/gdc.smk
10
11
script:
    "../scripts/normalizationPlots.R"
25
26
script:
    "../scripts/normalizationTest.R"
44
45
script:
    "../scripts/userNormalization.R"
63
64
script:
    "../scripts/runArsyn.R"
11
12
script:
    "../scripts/NOISeqPlots.R"
SnakeMake From line 11 of rules/qc.smk
22
23
script:
    "../scripts/filterLowExpression.R"
SnakeMake From line 22 of rules/qc.smk
37
38
script:
    "../scripts/PCA.R"
SnakeMake From line 37 of rules/qc.smk
51
52
script:
    "../scripts/densityPlot.R"
SnakeMake From line 51 of rules/qc.smk
62
63
script:
    "../scripts/filterOutliers.R"
SnakeMake From line 62 of rules/qc.smk
12
13
script:
    "../scripts/addAnnotations.R"
SnakeMake From line 12 of rules/raw.smk
30
31
script:
        "../scripts/getRawMatrix.R"
SnakeMake From line 30 of rules/raw.smk
6
7
8
shell:
    "mkdir -p {params.xenadir};"
    "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz -O {output}"
15
16
17
shell:
    "mkdir -p {params.xenadir};"
    "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGTEX_phenotype.txt.gz -O {output}"
SnakeMake From line 15 of rules/xena.smk
24
25
26
shell:
    "mkdir -p {params.xenadir};"
    "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/probeMap%2Fgencode.v23.annotation.gene.probemap -O {output}"
SnakeMake From line 24 of rules/xena.smk
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
## Get the Work and Data dir
library(readr)
library(dplyr)
library(rtracklayer)

RDATADIR <- paste(snakemake@params[["tissue_dir"]], "rdata", sep="/")
dir.create(RDATADIR)

IS_XENA <- snakemake@params[["is_xena"]]
##############################################################################
## Get annotation data
## 1. Read the original annotation data
## 2. Read the new annotation data
## 3. Keep only genes/protein-coding present in original and new data
## 4. Query bioMart ensembl 80 for GC content 
## 5. Save data
###############################################################################
if(!IS_XENA) {
  cat("Getting annotation file \n")
  cat("Reading original file \n")
  ## Annotation file used for RNA-seq pipeline according to:
  ## https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files
  annot <- rtracklayer::import('input/gencode.v22.annotation.gtf.gz')
  annot <- as.data.frame(annot)

  ## Only protein coding genes
  annot <- annot %>% dplyr::select(gene_id, seqnames, start, end, width, type, 
                                   gene_type, gene_name) %>% 
    filter(type == "gene" & gene_type == "protein_coding")

  annot <- annot %>% mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1))) %>%
    select(-gene_type)

} else {
  annot <- read_tsv(snakemake@input[["xena_annot"]],
                    col_names  = c("gene_id", "gene_name", "seqnames", "start", "end", "strand"), 
                    skip = 1)
  annot <- annot %>% dplyr::mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1)), 
                                   width = end - start)
  join_by <- "ensembl_id"
}

cat("Reading new file \n")
## Newest gencode file. April, 2021.
annot_new <-  rtracklayer::import('input/gencode.v37.annotation.gtf.gz')
annot_new <- as.data.frame(annot_new)
annot_new <- annot_new %>% dplyr::select(gene_id, gene_name, type, gene_type) %>% 
  dplyr::filter(type == "gene" & gene_type == "protein_coding")

annot_new <- annot_new %>% dplyr::mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1)))

## Keep genes that remain in the newest annotation file
## but get the newest names and keep only conventional chromosomes
## remove duplicates
annot <- annot %>% dplyr::select(-gene_name) %>%
  inner_join(annot_new %>% dplyr::select(ensembl_id, gene_name, gene_type), by = "ensembl_id") %>%
  filter(seqnames %in% paste0("chr", c(as.character(1:22), "X", "Y"))) %>% distinct()

cat('Annotation file new/old merge: ', paste(dim(annot), collapse=", "), '\n')

## We need GC content per gene for normalization. 
## Query Biomart 80 (accoring to gencode.v22.annotation.gtf, version 79 was used
## but it is no longer accessible in the website)
## http://may2015.archive.ensembl.org/biomart
biomart <- read_tsv("input/Biomart_Ensembl80_GRCh38_p2.txt", 
                    col_names = c("ensembl_id", "version", "gc"), skip = 1)

## Get only genes matching ensemblID
biomart <- biomart %>% mutate(gene_id = paste(ensembl_id, version, sep="."))

if(!IS_XENA) {
  annot <- annot %>%
    inner_join(biomart %>% dplyr::select(gene_id, gc), by = "gene_id")
} else {
  annot <- annot %>%
    inner_join(biomart %>% dplyr::select(ensembl_id, gc), by = "ensembl_id")
}

annot <- annot %>% mutate(chr = gsub("chr", "", seqnames)) %>%
  dplyr::rename(length = width) %>%
  dplyr::select(-seqnames) %>%   dplyr::select(gene_id, chr, everything())

cat('Annotation file. Final dimension: ', paste(dim(annot), collapse=", "), '\n')
save(annot, file=snakemake@output[["annot_rdata"]], compress="xz")
write_tsv(annot, snakemake@output[["annot_tsv"]])
cat('annot.RData saved \n')

##############################################################################
## Merging count and annotation
## 1. Build counts matrix. M=normal|tumour
## 2. Build targets matrix. targets=normal+tumor
## 3. Check M y targets integrity
## 4. Filter by annotation file 
## 5. Save the clean data
##############################################################################
{
  cat('Merging counts and annotations \n')

  normal_samples <- list(matrix=read_tsv(snakemake@input[["normal_matrix"]]),
                         targets=read_tsv(snakemake@input[["normal_targets"]])) 
  cancer_samples <- list(matrix=read_tsv(snakemake@input[["cancer_matrix"]]),
                         targets=read_tsv(snakemake@input[["cancer_targets"]])) 
  ## Raw counts
  M <- normal_samples$matrix %>% inner_join(cancer_samples$matrix, by = "gene_id")
  cat('Total number of features and samples: ', paste(dim(M), collapse=" ,"), '\n')

  ## Samples
  targets <- bind_rows(normal_samples$targets, cancer_samples$targets)

  ## Check M y targets integrity. Remove gene_ids col
  stopifnot(nrow(targets) == (ncol(M)-1))
  cat('Number of counts columns match sample number\n')

  ## Filter counts by annotation data
  cat('Adding biomart data\n')
  M <- M %>% semi_join(annot, by = "gene_id")
  annot <- annot %>% semi_join(M, by = "gene_id")

  cat('Total number of annotated (genes/protein-coding) features:', nrow(M), '\n')
  cat('Total number of samples:', ncol(M)-1, '\n')

  no_gc <- sum(is.na(annot$gc))
  cat("There are",no_gc, "entries with no GC info \n")

  ids <- annot %>% filter(!is.na(gc) & !is.na(length)) %>%
    select(gene_id) %>% unlist(use.names = F)

  M <- M %>% filter(gene_id %in% ids) %>% arrange(gene_id)
  annot <- annot %>% filter(gene_id %in% ids) %>% arrange(gene_id)
  cat("Non GC and lenght annotated genes removed.\n")

  ## Save it as a matrix
  ids <- M$gene_id
  MM <- M %>% select(-gene_id) %>% as.matrix() 
  rownames(MM) <- ids
  MM <- MM[,targets$id]

  ## Make sure they are factors
  targets$group <- factor(targets$group, levels=c("cancer", "normal"))

  ##Save clean data
  cat('Saving raw full data \n')
  full <- list(M = MM, annot = annot, targets = targets)

  save(full, file=snakemake@output[["raw_rdata"]], compress="xz")
  cat("raw_full.RData saved \n")
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from MultiAracne import Aracne
import sys
import pathlib

sys.stdout = open(snakemake.log[0], "w")

exp_matrix = snakemake.input[0]
outmatrix = snakemake.output[0]
outdir = snakemake.params[0] + "/correlation/output_" + snakemake.params[1] + "_" + snakemake.params[2]
procs = int(snakemake.threads)

print(f"Saving files in {outdir}")
pathlib.Path(outdir).mkdir(parents=True, exist_ok=True)

ma = Aracne(exp_matrix)
ma.run(processes=procs, outdir=outdir, pval=1)
ma.join_adj(outdir, outdir+"/matrix.adj")
Aracne.adj_to_matrix(outdir+"/matrix.adj", outmatrix)
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(limma)
library(edgeR)
library(Glimma)
library(readr)
library(dplyr)
library(janitor)
library(ggplot2)
library(ggthemes)

DEGDIR <- paste(snakemake@params[["deg_dir"]])
dir.create(DEGDIR)

TISSUE <- snakemake@params[["tissue"]]

cat("Loading files\n")
load(snakemake@input[["full"]])
load(snakemake@input[["annot"]])

full$targets <- full$targets %>% 
  dplyr::mutate(group = factor(group, levels=c("normal", "cancer")))

cat("Generating model matrix\n")
mm <- model.matrix(~1+group, data = full$targets)
rownames(mm) <- full$targets %>% pull(id)
full$M <- full$M[, rownames(mm)]

cat("Getting voom transformation\n")
y <- voom(full$M, mm, plot = F, save.plot = T)

p <- ggplot() +
  geom_point(aes(x = y$voom.xy$x, y = y$voom.xy$y), size = 0.5) + 
  geom_line(aes(x = y$voom.line$x, y = y$voom.line$y), size = 1, color = "red") +
  xlab(y$voom.xy$xlab) +
  ylab(y$voom.xy$ylab) + 
  theme_bw() +
  ggtitle(TISSUE)

cat("Saving voom plot\n")
png(file = paste0(DEGDIR, "/voom.png"), width = 800, height = 600)
print(p)
dev.off()

cat("Getting linear adjustment\n")
fit <- lmFit(y, mm)
lfc_fit <- eBayes(fit)

cat("Getting Glimma interactive plot plot\n")
dt <- decideTests(lfc_fit, lfc = 2) 

annot <- annot %>% filter(gene_id %in% rownames(full$M)) %>% 
  select(gene_id, ensembl_id, chr, start, end, gene_name) %>%
  arrange(gene_id)

rownames(annot) <- annot$gene_id

glMDPlot(path = DEGDIR, lfc_fit, 
         counts = full$M, groups = full$targets %>% pull(group), 
         status = dt, anno = annot)

p <- ggplot() +
  geom_density(aes(x = lfc_fit$coefficients[,2]), size = 0.5) + 
  theme_bw() +
  xlab("lfc")
  ggtitle(TISSUE)

png(file = paste0(DEGDIR, "/density.png"), width = 800, height = 600)
print(p)
dev.off()

cat("Saving deg results\n")
topTable(lfc_fit, sort.by = "P", n = Inf, adjust.method="BH") %>%
  janitor::clean_names() %>% mutate(gene_id = rownames(.)) %>%
  inner_join(annot %>% 
               select(gene_id, ensembl_id, chr, gene_name), by ="gene_id") %>% 
  select(-gene_id) %>%
  select(ensembl_id, everything()) %>%
  mutate(adj_p_val = ifelse(adj_p_val > 0.05, 1, adj_p_val), 
                           log_fc = ifelse(adj_p_val > 0.05, 0, log_fc)) %>%
  write_tsv(snakemake@output[["deg_results"]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(readr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(ggthemes)
TISSUE <- snakemake@params[["tissue"]]
w <- 1024
h <- 1024
p <- 24

cat("Loading data \n")
load(snakemake@input[[1]])

cat("Calculating mean \n")
melted_data <- melt(log2(full$M+1))
means <- colMeans(full$M)
mm <- mean(means)
sd <- sd(means)

down_outliers <- which(means < mm-2*sd)
up_outliers <- which(means > mm+2*sd)

if(length(down_outliers) > 0 | length(up_outliers) > 0 ) {
  cat("Outliers found, saving samples names \n")
} else {
  cat("Outliers not found\n")
}

outliers <- data.frame(sample = c(names(down_outliers), names(up_outliers)),
           mean = c(means[down_outliers], means[up_outliers]))
outliers <- outliers %>% mutate(sd_from_mean = (mean - mm)/sd )

write_tsv(outliers, file=snakemake@output[["outliers"]])

pl <- ggplot(data = melted_data, aes(x=value, group=Var2, colour=Var2)) + 
  geom_density(show.legend = F) + 
  theme_hc(base_size = 20) +
  xlab("log2(expr+1)") + 
  ggtitle(TISSUE)

png(file = snakemake@output[["density"]], width = w, height = h/2, pointsize = p)
print(pl)
dev.off()

cat("Density plot for all samples generated\n")
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(dplyr)

##########################################################################
load(snakemake@input[[1]])
{ 
  ### We keep only genes with mean expression count > 10 
  exp_genes <- apply(full$M, 1, function(x) mean(x)>10)
  egtable <- table(exp_genes)
  cat("There are", egtable[[1]], "genes with mean expression count < 10", egtable[[2]], "with mean count > 10 \n")
  exp_genes <- names(exp_genes[exp_genes == TRUE])

  non_zero_genes <- apply(full$M, 1, function(x) sum(x == 0) < ncol(full$M)/2)
  egtable <- table(non_zero_genes)
  cat("There are", egtable[[1]], "genes with 0 values in more than 50% samples", egtable[[2]], 
      " genes with 0 values in less than 50% samples \n")
  non_zero_genes <- names(non_zero_genes[non_zero_genes == TRUE])

  genes_pass <- intersect(exp_genes, non_zero_genes)

  ##Filtering low expression genes
  full <- list(M = round(full$M[genes_pass, full$targets$id]),
                 annot = full$annot %>% filter(gene_id %in% genes_pass),
                 targets = full$targets)

  full$annot <- full$annot %>% arrange(gene_id)
  full$M <- full$M[order(row.names(full$M)), full$targets$id]

  stopifnot(all(full$annot$gene_id == rownames(full$M)))
  cat(paste0("Genes in matrix and annotation match positions \n"))

  stopifnot(all(full$targets$id == colnames(full$M)))
  cat(paste0("Samples in matrix and annotation match positions \n"))

  row.names(full$annot) <- full$annot$gene_id

  cat("Saving full_filtered.RData \n") 
  save(full, file=snakemake@output[[1]], compress="xz")
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(readr)
library(dplyr)

cat("Loading files \n")
load(snakemake@input[[1]])
outliers <- read_tsv(snakemake@input[["outliers"]])

cat(nrow(outliers), " outliers will be removed.\n")
targets <- full$targets %>% 
  filter(!id %in% (outliers %>% pull(sample)))

M <- full$M[, targets %>% pull(id)]

full <- list(annot = full$annot, M = M, targets = targets)

dimcancer <- nrow(full$targets %>% filter(group == "cancer"))
dimnormal <- nrow(full$targets %>% filter(group == "normal"))
cat("Final dimensions: ", dimcancer, " cancer and ", dimnormal,
" normal samples \n")

cat("Saving RData \n")
save(full, file=snakemake@output[[1]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(data.table)
library(dplyr)
library(magick)
library(ComplexHeatmap)

cat("Reading matrix \n")
expr_file <- snakemake@input[["expr_matrix"]]
annot_file <-  snakemake@input[["annot"]]

cat("Getting annotations \n")
chr_pal <- c("#D909D1", "#0492EE", "#D2656C", "#106F35", "#5BD2AE", "#199F41", 
             "#FE0F43", "#00FFCC", "#F495C5", "#E1BF5D", "#5F166F", "#088ACA",
             "#41CFE0", "#0F0A71", "#FFFF99", "#B06645", "#651532", "#B925AE",
             "#86C91F", "#CB97E8", "#130B9E", "#EF782B", "#79A5B9", "#F7AA7C")

names(chr_pal) <- c("22","11","12","13","14","15","16","17","18","19","1" ,"2" ,"3" ,"4" ,"5" ,
                    "6" ,"7" ,"X" ,"8" ,"9" ,"20","10","21", "Y")

load(annot_file)
expr_matrix <- fread(expr_file)
annot <- annot %>% 
  mutate(chr = factor(chr, levels=c(as.character(1:22), "X", "Y"))) %>%
  arrange(chr, start)
genes <- expr_matrix %>% pull(gene)
expr_matrix <- expr_matrix %>% select(-gene) %>% as.matrix()
rownames(expr_matrix) <- genes

annot <- annot %>% filter(gene_id %in% genes)
expr_matrix <- expr_matrix[annot$gene_id, ]

expr_matrix <- t(expr_matrix)
cat("Calculating correlation \n")
corr <- cor(expr_matrix, expr_matrix, method="pearson")

chrs <- annot %>% dplyr::select(chr) %>% dplyr::rename(Chr = chr)

cat("Building heatmap \n")
ha <- rowAnnotation(df = chrs, 
                    name = "Chr", show_annotation_name = F,
                    col = list(Chr = chr_pal[as.character(chrs$Chr)]), 
                    width = unit(0.5, "cm"),
                    annotation_legend_param = list(
                      title = "Chr", title_gp = gpar(fontsize = 14), grid_height = unit(0.5, "cm")))


ht <- Heatmap(corr, cluster_rows = F, cluster_columns = F, show_row_names = F, 
        show_column_names = F,  heatmap_legend_param = list( title_gp = gpar(fontsize = 14),
                                                             title = "Pearson \nCorrelation",
                                                             legend_height = unit(4, "cm")), 
        right_annotation = ha)

cat("Saving image \n")
png(snakemake@output[["plot"]], width = 1200, height = 1200)
draw(ht, heatmap_legend_side = "right", annotation_legend_side = "right")
dev.off()

colnames(corr) <- genes
corr[lower.tri(corr)] <- NA
fwrite(corr, file=snakemake@output[["csv"]])
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
## Get the Work and Data dir
library(data.table)
library(readr)
library(dplyr)
library(janitor)
library(rtracklayer)

TISSUE <- snakemake@params[["tissue"]]
TYPE <-  snakemake@params[["type"]]
IS_XENA <- snakemake@params[["is_xena"]]
MCCORES <- as.numeric(snakemake@threads[[1]])
# ###############################################################################
# ## Get and check count matrices
# ## 1. Find files
# ## 2. Read data
# ## 4. Check sample sizes
# ## 5. Check genes order
# ## 6. Build target matrix
# ###############################################################################

RDATADIR <- paste(snakemake@params[["tissue_dir"]], "results", sep="/")
dir.create(RDATADIR)

if(!IS_XENA) {

  RAWDIR <- paste(snakemake@params[["tissue_dir"]], "raw", sep="/")

  cat(paste0("Checking ", TYPE, " samples \n"))

  ## Find the files
  files_to_read <- list.files(path = paste0(RAWDIR,"/",  TYPE), 
                      pattern = "\\.htseq.counts.gz$", full.names = T, recursive = T)

  ## Read the data
  all_files <- lapply(files_to_read, function(file) {
    data <- read_tsv(file, col_names = c("gene_id", "raw_counts"))
    data <- data %>% filter(!startsWith(gene_id, "_" ))
    return(data)
  })

  ## Check samples sizes
  size <- unique(do.call(rbind,lapply(all_files, dim)))
  stopifnot(nrow(size)==1)
  cat(paste0(TYPE, " samples have the same size \n"))

  ## Check genes order in samples
  genes <- do.call(cbind, lapply(all_files, function(x) dplyr::select(x, "gene_id")))
  genes <- t(unique(t(genes)))
  stopifnot(dim(genes)==c(size[1,1], 1))
  cat(paste0("Genes in ", TYPE, " samples match positions \n"))

  ## Build targets matrix
  targets <- data.frame(id = paste(TISSUE, TYPE, 1:length(files_to_read), sep = "_"),
                        file = unlist(lapply(strsplit(files_to_read, "/"), "[[", 10)),
                        file_id = unlist(lapply(strsplit(files_to_read, "/"), "[[", 9)),
                        group = TYPE, stringsAsFactors = FALSE)

  ## Rename columns in counts matrix
  matrix <- bind_cols(lapply(all_files, function(x) dplyr::select(x, "raw_counts")))
  colnames(matrix) <- targets$id

  matrix <- matrix %>% mutate(gene_id = genes[,1]) %>%
    dplyr::select(gene_id, everything())

} else {
  XENA_COUNTS <- snakemake@input[[1]]
  XENA_SAMPLES <- snakemake@input[[2]]
  PRIMARY <- snakemake@params[["primary"]]
  extended_type <- snakemake@params[["extended_type"]]

  matrix_samples <- read_tsv(XENA_SAMPLES) %>% clean_names()
  matrix_counts <-  fread(XENA_COUNTS, nThread = MCCORES)

  # When the normal samples should come from TCGA in UCSC Xena, the extended type 
  # for normal should be: "Solid Tissue Normal". This is only for testing
  # When the normal samples should come from GTEX in UCSC Xena, the extended type 
  # for normal should be: "Normal Tissue". This is the usual pipeline
  if(is.null(extended_type)) {
    extended_type <- ifelse(TYPE == "normal", "Normal Tissue", "Primary Tumor")  
  }

  cat("Extented type: ", extended_type, "\n")
  cat("Primary disease or tissue: ", PRIMARY, "\n")

  camelTissue <- paste0(toupper(substring(TISSUE, 1, 1)), substring(TISSUE, 2))

  cat("Tissue name: ", camelTissue, "\n")
  targets <- matrix_samples %>% dplyr::filter(primary_site == camelTissue &
                                                primary_disease_or_tissue == PRIMARY &
                                                sample_type == extended_type)
  cat("Got", nrow(targets), " samples\n")

  cat("Getting count matrix\n")
  ### Expected counts from XENA are in log2(expected_count+1)
  ### get them back to expected_count for downstream pipeline
  matrix <- matrix_counts %>% dplyr::select_if(names(.) %in% c("sample", targets$sample)) %>%
    dplyr::select(sample, everything())%>% dplyr::mutate(across(-sample, ~ .x^2-1))
  matrix[matrix < 0] <- 0

  targets <- targets %>% dplyr::filter(sample %in% names(matrix)) %>%
    dplyr::mutate(id = paste(TISSUE, TYPE, 1:nrow(.), sep = "_"), group = TYPE) %>% 
    dplyr::rename(file = sample) %>% select(id, file, group)

  matrix <- matrix %>% dplyr::select(sample, targets$file) 
  colnames(matrix) <- c("gene_id", targets$id)
}

cat(paste0("Matrices for ", TYPE, " ready.\n"))
cat("Saving matrix\n")
write_tsv(matrix, snakemake@output[["matrix"]])
cat("Saving samples\n")
write_tsv(targets, snakemake@output[["samples"]])
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(dplyr)
library(NOISeq)
library(ggplot2)

PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", snakemake@params[["plots_type"]], sep="/")
dir.create(PLOTSDIR, recursive = TRUE)
w <- 1024
h <- 1024
p <- 24

load(snakemake@input[[1]])
##########################################
## EXPLORATORY ANALYSIS (NOISeq package)
##########################################
{
  ## Reading data into NOISeq package -> mydata
  mydata <- NOISeq::readData(
    data = full$M, 
    length = full$annot %>% select(gene_id, length) %>% as.data.frame(), 
    biotype = full$annot %>% select(gene_id, gene_type) %>% as.data.frame(), 
    chromosome = full$annot %>% select(chr, start, end) %>% as.data.frame(), 
    factors = full$targets %>% select(group) %>% as.data.frame(),
    gc = full$annot %>% select(gene_id, gc) %>% as.data.frame())

}
##########################################
## Plots
##########################################
{
  # Biodetection plot. Per group.
  mybiodetection <- dat(mydata, type="biodetection", factor="group", k=0)
  png(filename=paste(PLOTSDIR, "biodetection.Rd_%03d.png", sep="/"), 
      width=w, height=h, pointsize=p)
  explo.plot(mybiodetection)
  dev.off()
  cat("Biodetection plots generated\n")

  ## Count distribution per biotype. Using count per million, only for one sample
  mycountsbio <- dat(mydata, factor = NULL, type = "countsbio")
  png(filename=paste(PLOTSDIR, "countsbio.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = 1, samples = 1, plottype = "boxplot")
  dev.off()
  cat("Counts distribution plot per biotype and one sample generated\n")
  #What about expression level?

  ## Count distribution per sample
  mycountsbio <- dat(mydata, factor = NULL, type = "countsbio")
  png(paste(PLOTSDIR, "protein_coding_boxplot.png", sep="/"), width=w*2, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "boxplot")
  dev.off()
  cat("Counts distribution plot for protein coding and all samples generated\n")

  png(paste(PLOTSDIR, "protein_coding_barplot.png", sep="/"), width=w*2, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "barplot")
  dev.off()
  cat("Counts distribution barplot for protein coding biotype and all samples generated\n")

  mycountsbio <- dat(mydata, factor = "group", type = "countsbio")
  ## Count distribution per Experimental factors
  png(paste(PLOTSDIR, "protein_coding_boxplot_group.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "boxplot")
  dev.off()
  cat("Counts distribution boxplot for protein coding biotype and group generated\n")

  png(paste(PLOTSDIR, "protein_coding_barplot_group.png", sep="/"),
      width=w, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "barplot")
  dev.off()
  cat("Counts distribution barplot for protein coding biotype and group generated\n")
  # How much sensitivity we loose? 

}##########################################
## Bias
##########################################
{
  ## Length bias detection
  mylengthbias <- dat(mydata, factor="group", type="lengthbias")
  png(paste(PLOTSDIR, "length_bias.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mylengthbias, samples = NULL, toplot = "global")
  dev.off()
  cat("Lenght bias plot generated\n")
  # Do we see a clear pattern?

  ## GC bias
  mygcbias <- dat(mydata, factor = "group", type="GCbias")
  png(paste(PLOTSDIR, "gc_bias.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mygcbias, samples = NULL, toplot = "global")
  dev.off()
  cat("GC bias plot generated\n")
  # Do we see a clear pattern?

  ## RNA composition
  mycomp <- dat(mydata, type="cd")
  png(paste(PLOTSDIR, "rna_composition.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mycomp, samples=1:12)
  dev.off()
  cat("RNA composition plot generated\n")
  # Are samples comparable?
}
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(ggplot2)
library(reshape2)
library(grid)
library(png)
library(gridExtra)
library(readr)
library(dplyr)

###############################################################################
args <- commandArgs(trailingOnly = T)

PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", sep="/")
PLOTSNORMDIR <- paste(PLOTSDIR, "normalization", sep = "/")
w <- 1024
h <- 1024
p <- 24

{ 
  cat("Integrating plots \n.")

  length_norm <- c("no","full", "loess", "median", "upper")
  gc_norm <- c("no", "full", "loess", "median", "upper")
  between_norm <- c("no", "full", "median", "tmm", "upper")

  ## This function will retrieve plots for one normalization set and will create 
  ## one single plot
  savePlots <- function(step1, step2, step3) {

    plotname <- paste(step1, step2, step3, sep = "_")  
    pngPlots <- c(paste0(PLOTSNORMDIR, "/", plotname, "_length_bias.png"), 
                  paste0(PLOTSNORMDIR, "/", plotname, "_gc_bias.png"), 
                  paste0(PLOTSNORMDIR, "/", plotname, "_rna_composition.png"))

    thePlots <- lapply (pngPlots, function(pngFile) {
      rasterGrob(readPNG(pngFile, native = FALSE), interpolate = FALSE)
    })

    plotpath <- paste(PLOTSNORMDIR, paste(plotname, "png", sep ="."), sep="/")
    png(plotpath,  height=h/2, width=w*(3/2))
    par(oma = c(0, 0, 1.5, 0))
    plot.new()
    do.call(grid.arrange, c(thePlots,  ncol = 3, nrow=1))
    mtext(plotname, outer = TRUE, cex = 1.5)
    dev.off()

    unlink(pngPlots)
    return(plotpath)
  } 

  df_normalizations <- expand.grid(gcn = gc_norm, ln = length_norm, 
                                   bn = between_norm, stringsAsFactors = F)

  cat("Getting plots for all ", nrow(df_normalizations), "normalization combinations.\n")

  plots <- lapply(1:nrow(df_normalizations),  function(i){
                         gcn <- df_normalizations[i, "gcn"]
                         ln <- df_normalizations[i, "ln"]
                         bn <- df_normalizations[i, "bn"]
                         gcp <- NA
                         tryCatch(expr = {
                           gcp <- savePlots(paste("gc", gcn, sep = "_"), 
                                     paste("length", ln, sep = "_"), 
                                     paste("between", bn, sep =  "_"))
                         }, error = function(cond){

                           cat(cond$message, "\n")
                         })
                         lp <- NA
                         tryCatch(expr = {
                           lp <- savePlots(paste("length", ln, sep = "_"), 
                                     paste("gc", gcn, sep = "_"), 
                                     paste("between", bn, sep =  "_"))
                         }, error = function(cond){
                           cat(cond$message, "\n")
                         })
              return(list(gcp, lp))         
          })

  plots <- sort(unlist(plots))

  thePlots <- lapply (plots, function(pngFile) {
    rasterGrob(readPNG(pngFile, native = FALSE), interpolate = FALSE)
  })

  cat("Saving normalization_plots.pdf\n") 
  pdf(paste(PLOTSDIR, "normalization_plots.pdf", sep="/"), title="Normalization Plots")
  print(marrangeGrob(thePlots, nrow = 3, ncol = 1, top = NULL))
  dev.off()
}
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(reshape2)
library(grid)
library(png)
library(gridExtra)
library(parallel)
library(readr)
library(dplyr)
library(NOISeq)
library(EDASeq)

###############################################################################
MCCORES <- as.numeric(snakemake@threads[[1]])
PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", sep="/")
PLOTSNORMDIR <- paste(PLOTSDIR, "normalization", sep = "/")
dir.create(PLOTSNORMDIR)

w <- 1024
h <- 1024
p <- 24

load(snakemake@input[[1]])

cat("Testing normalization methods\n.")
rawEDA <- EDASeq::newSeqExpressionSet(
  counts = full$M,
  featureData = data.frame(full$annot, row.names = full$annot$gene_id),
  phenoData = data.frame(
    conditions = full$targets$group,
    row.names = full$targets$id))

length_norm <- c("no", "full", "loess", "median", "upper")
gc_norm <- c("no", "full", "loess", "median", "upper")
between_norm <-c("no", "full", "median", "tmm", "upper")

## This function gets the relevant statistics for the regression methods for GC and Length bias
getRegressionStatistics <- function(regressionmodel) {
  name <- names(regressionmodel)
  rsquared <- summary(regressionmodel[[1]])$r.squared
  fstatistic <- summary(regressionmodel[[1]])$fstatistic
  pvalue <- signif(pf(q = fstatistic[1], df1 = fstatistic[2], 
                      df2 = fstatistic[3], lower.tail = FALSE), 2)
  return(list("name" = name, "r2" = rsquared, "p" = pvalue))
}

## This function will retrieve the NOISeq results to test the normalization combination
getNOISeqResults <- function(step1, step2, step3, n_counts, m10_data) {
  ### Check the NOISEq results 
  mydata <- NOISeq::readData(
    data = n_counts, 
    length = m10_data$annot %>% select(gene_id, length) %>% as.data.frame(), 
    biotype = m10_data$annot %>% select(gene_id, gene_type) %>% as.data.frame(), 
    chromosome = m10_data$annot %>% select(chr, start, end) %>% as.data.frame(),  
    factors = m10_data$targets %>% select(group) %>% as.data.frame(), 
    gc = m10_data$annot %>% select(gene_id, gc)%>% as.data.frame())

  nsamples <- dim(m10_data$targets)[1]

  ### Length bias 
  mylengthbias <- dat(mydata, factor="group", norm = TRUE, type="lengthbias")
  l_stats_1 <- getRegressionStatistics(mylengthbias@dat$RegressionModels[1])
  l_stats_2 <- getRegressionStatistics(mylengthbias@dat$RegressionModels[2])

  ## GC Bias
  mygcbias <- dat(mydata, factor = "group", norm = TRUE, type ="GCbias")
  gc_stats_1 <- getRegressionStatistics(mygcbias@dat$RegressionModels[1])
  gc_stats_2 <- getRegressionStatistics(mygcbias@dat$RegressionModels[2])

  #RNA Composition
  myrnacomp <- dat(mydata, norm = TRUE, type="cd")
  dtable <- table(myrnacomp@dat$DiagnosticTest[,  "Diagnostic Test"])
  if (is.na(dtable["PASSED"])) dtable <- data.frame(PASSED = 0)

  cat("Step1:", step1, ", step2:", step2, " step3: ", step3, " calculations done\n")
  plotname <- paste(step1, step2, step3, sep = "_")  
  pngPlots <- c(paste0(PLOTSNORMDIR, "/", plotname, "_length_bias.png"), 
                paste0(PLOTSNORMDIR, "/", plotname, "_gc_bias.png"), 
                paste0(PLOTSNORMDIR, "/", plotname, "_rna_composition.png"))

  png(pngPlots[1], width=w/2, height=h/2)
  explo.plot(mylengthbias, samples = NULL, toplot = "global")
  dev.off()

  png(pngPlots[2], width=w/2, height=h/2)
  explo.plot(mygcbias, samples = NULL, toplot = "global")
  dev.off()

  png(pngPlots[3],width=w/2, height=h/2)
  explo.plot(myrnacomp, samples = 1:12)
  dev.off()

  cat("Step1:", step1, ", step2:", step2, " step3: ", step3, " plots saved\n")

  norm_set_results <- tibble(step1, step2, step3, 
                                 l_stats_1$r2, l_stats_1$p, l_stats_2$r2, l_stats_2$p,
                                 gc_stats_1$r2, gc_stats_1$p, gc_stats_2$r2, gc_stats_2$p, 
                                 dtable["PASSED"], dtable["PASSED"]/nsamples)

  colnames(norm_set_results) <- c("step1", "step2", "step3", 
                                  paste("length", l_stats_1$name, "R2", sep = "_"), paste("length", l_stats_1$name, "p-value", sep = "_"),  
                                  paste("length", l_stats_2$name, "R2", sep = "_"), paste("length", l_stats_2$name, "p-value", sep = "_"), 
                                  paste("gc", gc_stats_1$name, "R2", sep = "_"), paste("gc", gc_stats_1$name, "p-value", sep = "_"), 
                                  paste("gc", gc_stats_2$name, "R2", sep = "_"), paste("gc", gc_stats_2$name, "p-value", sep = "_"),
                                  "rna_passed_samples", "rna_passed_proportion")
  return(norm_set_results)
}

df_normalizations <- expand.grid(gcn = gc_norm, ln = length_norm, 
                                bn = between_norm, stringsAsFactors = F)

cat("Testing all ", nrow(df_normalizations), "normalization combinations\n")


## Normalizations with GC step first
gc_norms <- mclapply(X = 1:nrow(df_normalizations), 
                      mc.cores = MCCORES, 
                      mc.preschedule = FALSE,
                      FUN = function(i){

    gcn <- df_normalizations[i, "gcn"]
    ln <- df_normalizations[i, "ln"]
    bn <- df_normalizations[i, "bn"]

    tryCatch(
      expr = {

        if(gcn == "no") {
          gcn_data <- counts(rawEDA)
        } else {
          gcn_data <- withinLaneNormalization(counts(rawEDA), full$annot$gc, which = gcn)  
        }

        if(ln == "no") {
          ln_data <- gcn_data
        } else {
          ln_data <- withinLaneNormalization(gcn_data, full$annot$length, which = ln)
        }

        if (bn == "no") {
          between_data <- ln_data
        } else if (bn == "tmm") {
          between_data <- tmm(ln_data, long = 1000, lc = 0, k = 0)
        } else {
          between_data <- betweenLaneNormalization(ln_data, which = bn, offset = FALSE)
        }
        cat("Testing with GC normalization: ", gcn, ", length normalization: ", ln, " and between lane normalization: ", bn, "\n")
        norm_noiseq_results <- getNOISeqResults(paste("gc", gcn, sep = "_"), paste("length", ln, sep = "_"), paste("between", bn, sep =  "_"), 
                                                between_data, full)
        return(norm_noiseq_results)
      },
      error = function(cond) {

        norm_noiseq_results <- tibble(step1 = paste("gc", gcn, sep = "_"), 
                                   step2 = paste("length", ln, sep = "_"),
                                   step3 = paste("between", bn, sep =  "_"), 
                                   error = cond$message)
        return(norm_noiseq_results)

      })
})

save(gc_norms, file=snakemake@output[["gc_norms"]])
gc_norms <- bind_rows(gc_norms)

## Normalizations with Length step first
ln_norms <- mclapply(X = 1:nrow(df_normalizations), 
                      mc.preschedule = FALSE,
                      mc.cores = MCCORES,
                      FUN = function(i){

    gcn <- df_normalizations[i, "gcn"]
    ln <- df_normalizations[i, "ln"]
    bn <- df_normalizations[i, "bn"]

    tryCatch(
      expr = {

        if(ln == "no") {
          ln_data <- counts(rawEDA)
        } else {
          ln_data <- withinLaneNormalization(counts(rawEDA), full$annot$length, which = ln)
        }

        if(gcn == "no") {
          gcn_data <- ln_data
        } else {
          gcn_data <- withinLaneNormalization(ln_data, full$annot$gc, which = gcn)
        }

        if (bn == "no") {
          between_data <- ln_data
        } else if (bn == "tmm") {
          between_data <- tmm(ln_data, long = 1000, lc = 0, k = 0)
        } else {
          between_data <- betweenLaneNormalization(ln_data, which = bn, offset = FALSE)
        }
        cat("Testing with length normalization: ", ln, "GC normalization: ", gcn, " and between lane normalization: ", bn, "\n")
        norm_noiseq_results <- getNOISeqResults(paste("length", ln, sep = "_"), paste("gc", gcn, sep = "_"), paste("between", bn, sep =  "_"), 
                                                between_data, full)
        return(norm_noiseq_results)
      },
      error = function(cond) {

        norm_noiseq_results <- tibble(step1 = paste("length", ln, sep = "_"),
                                      step2 = paste("gc", gcn, sep = "_"), 
                                      step3 = paste("between", bn, sep =  "_"), 
                                      error = cond$message)
        return(norm_noiseq_results)

      })
})

save(ln_norms, file=snakemake@output[["ln_norms"]])
ln_norms <- bind_rows(ln_norms)

normalization_results <- bind_rows(gc_norms, ln_norms)

cat("End of normalization testing\n")
cat("Saving normalization_results.tsv\n") 
write_tsv(normalization_results, snakemake@output[["results"]])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(NOISeq)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(ggpubr)

w <- 1024
h <- 1024
p <- 24

PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", 
                 snakemake@params[["plots_type"]], sep="/")
dir.create(PLOTSDIR, recursive = TRUE)

cat("Loading data")
load(snakemake@input[[1]])

cat("Performing PCA")
mydata <- NOISeq::readData(
  data = full$M, 
  length = full$annot %>% select(gene_id, length) %>% as.data.frame(), 
  biotype = full$annot %>% select(gene_id, gene_type) %>% as.data.frame(), 
  chromosome = full$annot %>% select(chr, start, end) %>% as.data.frame(), 
  factors = full$targets %>% select(group) %>% as.data.frame(),
  gc = full$annot %>% select(gene_id, gc) %>% as.data.frame())

pca_dat <- dat(mydata, type = "PCA", logtransf = F)
pca_results <- pca_dat@dat$result 
pca_factors <- pca_dat@dat$factors


cat("Getting variance plot")
pc_eigenvalues <- tibble(pc = factor(1:nrow(pca_results$var.exp)), 
                         pct = pca_results$var.exp[,1], 
                         pct_cum = pca_results$var.exp[,2])

g1 <- ggplot(data = pc_eigenvalues[1:50,], aes(x = pc)) +
  geom_col(aes(y = pct)) +
  geom_line(aes(y = pct_cum, group = 1)) + 
  geom_point(aes(y = pct_cum)) +
  labs(x = "Principal component", y = "Fraction variance explained") +
  theme_hc(base_size = 20) +
  theme(axis.text.x = element_text(size=12)) +
  ggtitle("PCA Variance")

ggexport(g1, width = w, height = h/2, pointsize = p, 
         filename = snakemake@output[["variance"]])

cat("Getting scores plot")
pc_scores <- tibble(pc = factor(1:nrow(pca_results$var.exp)),
                    name = rownames(pca_factors),
                    group = factor(pca_factors$group, levels = c("normal", "cancer"), labels = c("Normal", "Cancer")), 
                    pc1 = pca_results$scores[,1],
                    pc2 = pca_results$scores[,2],
                    pc3 = pca_results$scores[,3])

color_pal <- c("#e3a098", "#a32e27")
xrange <- range(pc_scores %>% select(pc1))
yrange <- range(pc_scores %>% select(pc2, pc3))

g1 <- ggplot(pc_scores, aes(x=pc1, y=pc2, color=group)) +
  geom_point(size=2) +
  labs(x = "PC1", y = "PC2") +
  scale_color_manual(values = color_pal) +
  theme_hc(base_size = 20) +
  theme(legend.title = element_blank()) + 
  xlim(xrange) +
  ylim(yrange)

g2 <- ggplot(pc_scores, aes(x=pc1, y=pc3, color=group)) +
  geom_point(size=2) +
  labs(x = "PC1", y = "PC3") +
  scale_color_manual(values = color_pal) +
  theme_hc(base_size = 20) +
  theme(legend.title = element_blank()) +
  xlim(xrange) +
  ylim(yrange)

fig <- ggarrange(g1,  g2, nrow = 1)
fig <- annotate_figure(fig,
                top = text_grob("PCA Scores", size = 25))

ggexport(fig, width = w, height = h/2, pointsize = p, 
         filename = snakemake@output[["score"]])

cat("Getting loading plot")
pc_loadings <- tibble(gene = rownames(pca_results$loadings), 
                      pc1 = pca_results$loadings[, 1],
                      pc2 = pca_results$loadings[, 2])

top_genes <- pc_loadings %>% 
  pivot_longer(matches("pc"), names_to = "pc", values_to = "loading")%>% 
  group_by(pc) %>% 
  arrange(desc(abs(loading))) %>% 
  slice(1:10) %>% 
  pull(gene) %>% 
  unique()

top_loadings <- pc_loadings %>% 
  filter(gene %in% top_genes)

g1 <- ggplot(top_loadings) +
  geom_segment(aes(x = 0, y = 0, xend = pc1, yend = pc2), 
               arrow = arrow(length = unit(0.1, "in")),
               color = "brown") +
  geom_text(aes(x = pc1, y = pc2, label = gene),
            nudge_y = 0.005, size = 3) +
  scale_x_continuous(expand = c(0.02, 0.02)) +
  ggtitle("PCA Loadings") +
  theme_hc(base_size = 15) 

ggexport(g1, width = w/2, height = h/2, pointsize = p, 
         filename = snakemake@output[["loading"]])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
import requests
import json
import sys
import pandas as pd
from io import StringIO
import pathlib

logdir = snakemake.params[0] + "/log"
manifestdir = snakemake.params[0] + "/manifests"
pathlib.Path(logdir).mkdir(parents=True, exist_ok=True)
pathlib.Path(manifestdir).mkdir(parents=True, exist_ok=True)

sys.stderr = open(snakemake.log[0], "w")

primary_site = snakemake.params[1]
sample_type =snakemake.params[2]
sst = sample_type

if (sample_type == "cancer"):
    sample_type = "primary tumor"
elif (sample_type == "normal"):
    sample_type = "solid tissue normal"
else:
    sys.exit("Incorrect sample type")

print("Getting data for: " + primary_site + ", " + sample_type)

# Fields for the query
fields = [
    "cases.case_id",
    "file_name"
    ]

fields = ",".join(fields)

# Endpoints used
files_endpt = "https://api.gdc.cancer.gov/files"
manifest_endpt = "https://api.gdc.cancer.gov/manifest"

# Filters to get RNA seq data
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.project.primary_site",
            "value": [primary_site]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "cases.samples.sample_type",
            "value": [sample_type]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.analysis.workflow_type",
            "value": ["HTSeq - Counts"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.data_type",
            "value": ["Gene Expression Quantification"]  
            }
        }
    ]
}

params = {
    "filters": json.dumps(filters),
    "fields": fields,
    "format": "tsv",
    "size": "2000"
    }

# Getting data
response = requests.get(files_endpt, params = params)
data = response.content.decode("utf-8")

print("RNA-seq query done")

if(not data.strip()):
    sys.exit("No records found for " + primary_site + ", " + sample_type)

df = pd.read_csv(StringIO(data), sep ="\t")

df.columns = [col + "_rna" for col in df.columns] 

# Save file ids and file names just in case
df.to_csv(snakemake.output[0], sep="\t", index=False)  

# Getting RNASeq files manifest for future download
params = { "ids" : df["id_rna"].tolist() }
response = requests.post(manifest_endpt, data= json.dumps(params), headers = {"Content-Type":
    "application/json"})

print("Got RNASeq manifest")

f = open(snakemake.output[1], "w")
f.write(response.content.decode("utf-8"))
f.close()

print("RNASeq manifest written")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(NOISeq)
library(readr)
library(dplyr)

STEP1 = snakemake@params[["step1"]]
STEP2 = snakemake@params[["step2"]]
STEP3 = snakemake@params[["step3"]]
##########################################
## ARSyN to reduce batch effect
##########################################
{
  cat("Loading data\n")
  load(snakemake@input[[1]])

  mydata <- NOISeq::readData(
    data = full$M, 
    factors = full$target %>% select(group) %>% as.data.frame())

  cat("Performing ARSyN for batch correction")
  myARSyN <- ARSyNseq(mydata, norm = "n", logtransf = FALSE)

  ##Saving everything
  full_arsyn <- list(M = assayData(myARSyN)$exprs, annot = full$annot, 
                                targets = full$targets)

  stopifnot(nrow(full$M) == nrow(full_arsyn$annot))
  stopifnot(all(rownames(full_arsyn$M) == rownames(full_arsyn$annot)))

  full <- full_arsyn
  cat("Saving", paste(STEP1, STEP2, STEP3,  "norm_arsyn.RData", sep = "_"), "\n")
  save(full, file=snakemake@output[["arsyn_rdata"]], compress="xz")

  cat("Generating data matrices with arsyn for Aracne\n")
  ## Data matrices for Aracne
  #normal samples
  normal <- as_tibble(full$M[,full$targets$group == "normal"])
  normal <- bind_cols(gene=as.character(full$annot$gene_id), normal)

  #cancer samples
  cancer <- as_tibble(full$M[,full$targets$group == "cancer"])
  cancer <- bind_cols(gene=as.character(full$annot$gene_id), cancer)

  #gene_ids
  symbols <- full$annot %>% select(gene_id)

  cat("Saving arsyn data\n")

  cat("Saving", paste(STEP1, STEP2, STEP3,  "norm_arsyn_normal.tsv", sep = "_"), "\n")
  write_tsv(normal, file=snakemake@output[["normal_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_arsyn_cancer.tsv", sep = "_"), "\n")
  write_tsv(cancer, file=snakemake@output[["cancer_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3,  "norm_arsyn_genelist.txt", sep = "_"), "\n")
  write_tsv(symbols, file=snakemake@output[["gene_list"]], col_names = F)
}
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

##Usefull Libraries
library(ggplot2)
library(reshape2)
library(readr)
library(dplyr)
library(EDASeq)
library(NOISeq)
library(DESeq2)

STEP1 = snakemake@params[["step1"]]
STEP2 = snakemake@params[["step2"]]
STEP3 = snakemake@params[["step3"]]
is_xena = snakemake@params[["xena"]]

RDATA <-paste(snakemake@params[["tissue_dir"]], "rdata", sep="/")
w <- 1024
h <- 1024
p <- 24
##########################################################################
{##### USER SELECTED NORMALIZATION
  cat("Loading data\n")
  load(snakemake@input[[1]])
  cat("Performing normalization. Step1: ", STEP1, ", Step2: ", STEP2, " Step3: ", STEP3, "\n")

  first <- strsplit(STEP1, "-")[[1]]
  second <- strsplit(STEP2, "-")[[1]]

  if(first[1] == "length") {
    if(first[2] != "no") {
      norm_counts <- withinLaneNormalization(full$M, full$annot$length, which = first[2])  
    } else {
      norm_counts <- full$M
    }
    if(second[1] == "gc") {
      if(second[2] != "no") {
        norm_counts <- withinLaneNormalization(norm_counts, full$annot$gc, which = second[2])    
      }
    }
  } else if (first[1] == "gc") {
    if(first[2] != "no") {
      norm_counts <- withinLaneNormalization(full$M, full$annot$gc, which = first[2])
    } else {
      norm_counts <- full$M
    }
    if(second[1] == "length") {
      if(second[2] != "no") {
        norm_counts <- withinLaneNormalization(norm_counts, full$annot$length, which = second[2])  
      }
    }
  } else {
    norm_counts <- full$M
  }

  if(STEP3 == "no") {
    norm_counts <- full$M
  } else if (STEP3 == "tmm") {
    norm_counts <- tmm(norm_counts, long = 1000, lc = 0, k = 0)
  } else {
    norm_counts <- betweenLaneNormalization(norm_counts, which = STEP3, offset = FALSE)
  }
  cat("Normalization done. Step1: ", STEP1, ", Step2: ", STEP2, " Step3: ", STEP3, "\n")

  # Saving normalized data
  full$M <- norm_counts

  norm_data_cpm10 <- filtered.data(full$M, factor=full$targets$group, 
                                   norm=TRUE, cv.cutoff=100, cpm=10)

  filtered <- nrow(full$M) - nrow(norm_data_cpm10)
  cat("After normalization. There are", filtered, "genes with counts per million mean < 10", 
      nrow(norm_data_cpm10), "with counts per million mean > 10 \n")

  ### There is a problem with xena expression matrix. Values for
  ### genes "ENSG00000203811.1" and "ENSG00000203852.3" are exactly
  ### the same and it affects correlation calculation
  if(is_xena) {
    genes <- rownames(norm_data_cpm10)
    if("ENSG00000203811.1" %in% genes){
      genes <- genes[genes != "ENSG00000203811.1"]
      norm_data_cpm10 <- norm_data_cpm10[genes,]
    }
  }

  full <-list(M = norm_data_cpm10, 
                    annot = full$annot %>% filter(gene_id %in% rownames(norm_data_cpm10)),
                    targets = full$targets)



  stopifnot(nrow(full$M) == nrow(full$annot))
  stopifnot(rownames(full$M) == rownames(full$annot))

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_full.RData", sep = "_"), "\n")
  save(full, file=snakemake@output[["norm_rdata"]], compress="xz")

  cat("Generating data matrices for Aracne\n")
  ## Data matrices for Aracne

  #normal samples
  normal <- as_tibble(full$M[, full$targets$group == "normal"])
  normal <- bind_cols(gene=as.character(full$annot$gene_id), normal)

  #cancer samples
  cancer <- as_tibble(full$M[, full$targets$group == "cancer"])
  cancer <- bind_cols(gene=as.character(full$annot$gene_id), cancer)

  #gene_ids
  symbols <- full$annot %>% select(gene_id)

  cat("Saving data\n")

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_normal.tsv", sep = "_"), "\n")
  write_tsv(normal, file=snakemake@output[["normal_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_cancer.tsv", sep = "_"), "\n")
  write_tsv(cancer, file=snakemake@output[["cancer_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_genelist.txt", sep = "_"), "\n")
  write_tsv(symbols, file=snakemake@output[["gene_list"]], col_names = F)
}##########################################
ShowHide 25 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ddiannae/tcga-xena-pipeline
Name: tcga-xena-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...