Proyecto: Evaluación de la regulación trans en cáncer: un enfoque de biología de sistemas

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This pipelie serves to get differential expression and co-expression matrices for normal and cancer tissues from the TCGA dataset, downloaded from GDC, and the Toil Xena dataset that integrates both TCGA and GTEx data.

Folders in this repository

bin: It contains the GDC Data Transfer Tool command line for Linux
config: It contains the config.yaml file used for the Snakemake pipeline
input: With the GENECODE annotation files originally used for TCGA data (v22, according to this link ). It also contains GENECODE v37 (April, 2021) to filter genes and BioMart Ensembl Genes 80 to add GC content for QC purposes.
workflow: It has all the rules and scripts for the Snakemake pipeline

The pipeline

These are the main steps in the pipeline:

Get data:
- Get data from Xena: It downloads counts, sample information and annotations directly from the Xena-Toil S3 bucket.
  - Rules file: xena.smk
- Get data from GDC: It queries GDC to creat a manifest and uses the gdc-client tool to download files.
  - Rules file: gdc.smk
  - Script: queryGDC.py
Get raw matrix: This step integrates, for each tissue, raw counts with their respective annotation to obtain a raw matrix.
- Rules: raw.smk
- Scripts: addAnnotations.R and getRawMatrix.R
QC: It gets NOISeq plots, filters genes with low expression (mean raw counts < 10), gets PCA and density plots (expression per sample) and removes samples with mean expression greater and lower than 2sd from the total mean.
- Rules: qc.smk
- Scripts: NOISeqPlots.R, filterLowExpression.R, PCA.R, densityPlot.R and filterOutliers.R
Normalization: In the first run, it performs a test with different normalization methods and integrates the results. When normalization steps are defined for each tissue, it gets normalized data and runs arsyn for batch correction.
- Rules: normalization.smk
- Scripts: normalizationPlots.R, normalizationTest.R, usrNormalization.R, and runArsyn.R After this step, we get a final expression matrix for normal and cancer tissues, having the same gene set.
Get final output:
- Co-expression computation: It gets PCA and density plots after normalization and runs the ARACNE algorithm using a singularity image to get co-expression matrix for each tissue and condition. It can also obtain a pearson correlation matrix, if output files are required.
  - Rules: correlation.smk
  - Scripts: aracne_matrix.py, getPearsonMatrix.R
- Differential expression: This step, like the co-expression computation, can be run after normalization and it gets gene differential expression and its associated plots.
  - Rules: deg.smk
  - Scripts: deg.R

How to run it

Unfortunately, the HTSeq raw counts used in this pipeline were removed by GDC on March, 2022 ( Data Release 32 ). This means that the GDC query used here (in the Get data from GDC step) will not return any result. However, all the data generated by this pipeline, including raw counts, can be found here:

Code Snippets

script:
    "../scripts/aracne_matrix.py"

SnakeMake From line 16 of rules/correlation.smk

script:
    "../scripts/getPearsonMatrix.R"

SnakeMake From line 28 of rules/correlation.smk

script:
    "../scripts/deg.R"

SnakeMake From line 12 of rules/deg.smk

shell:
    """
    mkdir -p {output[0]};
    ./bin/gdc-client download -d {output[0]} -m {input[0]} --retry-amount 3;
    touch {output[1]}
    """

SnakeMake From line 7 of rules/gdc.smk

script:
    "../scripts/queryGDC.py"

SnakeMake From line 25 of rules/gdc.smk

script:
    "../scripts/normalizationPlots.R"

SnakeMake From line 10 of rules/normalization.smk

script:
    "../scripts/normalizationTest.R"

SnakeMake From line 25 of rules/normalization.smk

script:
    "../scripts/userNormalization.R"

SnakeMake From line 44 of rules/normalization.smk

script:
    "../scripts/runArsyn.R"

SnakeMake From line 63 of rules/normalization.smk

script:
    "../scripts/NOISeqPlots.R"

SnakeMake From line 11 of rules/qc.smk

script:
    "../scripts/filterLowExpression.R"

SnakeMake From line 22 of rules/qc.smk

script:
    "../scripts/PCA.R"

SnakeMake From line 37 of rules/qc.smk

script:
    "../scripts/densityPlot.R"

SnakeMake From line 51 of rules/qc.smk

script:
    "../scripts/filterOutliers.R"

SnakeMake From line 62 of rules/qc.smk

script:
    "../scripts/addAnnotations.R"

SnakeMake From line 12 of rules/raw.smk

script:
        "../scripts/getRawMatrix.R"

SnakeMake From line 30 of rules/raw.smk

shell:
    "mkdir -p {params.xenadir};"
    "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz -O {output}"

SnakeMake From line 6 of rules/xena.smk

shell:
    "mkdir -p {params.xenadir};"
    "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGTEX_phenotype.txt.gz -O {output}"

SnakeMake From line 15 of rules/xena.smk

shell:
    "mkdir -p {params.xenadir};"
    "wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/probeMap%2Fgencode.v23.annotation.gene.probemap -O {output}"

SnakeMake From line 24 of rules/xena.smk

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
## Get the Work and Data dir
library(readr)
library(dplyr)
library(rtracklayer)

RDATADIR <- paste(snakemake@params[["tissue_dir"]], "rdata", sep="/")
dir.create(RDATADIR)

IS_XENA <- snakemake@params[["is_xena"]]
##############################################################################
## Get annotation data
## 1. Read the original annotation data
## 2. Read the new annotation data
## 3. Keep only genes/protein-coding present in original and new data
## 4. Query bioMart ensembl 80 for GC content 
## 5. Save data
###############################################################################
if(!IS_XENA) {
  cat("Getting annotation file \n")
  cat("Reading original file \n")
  ## Annotation file used for RNA-seq pipeline according to:
  ## https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files
  annot <- rtracklayer::import('input/gencode.v22.annotation.gtf.gz')
  annot <- as.data.frame(annot)

  ## Only protein coding genes
  annot <- annot %>% dplyr::select(gene_id, seqnames, start, end, width, type, 
                                   gene_type, gene_name) %>% 
    filter(type == "gene" & gene_type == "protein_coding")

  annot <- annot %>% mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1))) %>%
    select(-gene_type)

} else {
  annot <- read_tsv(snakemake@input[["xena_annot"]],
                    col_names  = c("gene_id", "gene_name", "seqnames", "start", "end", "strand"), 
                    skip = 1)
  annot <- annot %>% dplyr::mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1)), 
                                   width = end - start)
  join_by <- "ensembl_id"
}

cat("Reading new file \n")
## Newest gencode file. April, 2021.
annot_new <-  rtracklayer::import('input/gencode.v37.annotation.gtf.gz')
annot_new <- as.data.frame(annot_new)
annot_new <- annot_new %>% dplyr::select(gene_id, gene_name, type, gene_type) %>% 
  dplyr::filter(type == "gene" & gene_type == "protein_coding")

annot_new <- annot_new %>% dplyr::mutate(ensembl_id = unlist(lapply(strsplit(gene_id, "\\."), "[[", 1)))

## Keep genes that remain in the newest annotation file
## but get the newest names and keep only conventional chromosomes
## remove duplicates
annot <- annot %>% dplyr::select(-gene_name) %>%
  inner_join(annot_new %>% dplyr::select(ensembl_id, gene_name, gene_type), by = "ensembl_id") %>%
  filter(seqnames %in% paste0("chr", c(as.character(1:22), "X", "Y"))) %>% distinct()

cat('Annotation file new/old merge: ', paste(dim(annot), collapse=", "), '\n')

## We need GC content per gene for normalization. 
## Query Biomart 80 (accoring to gencode.v22.annotation.gtf, version 79 was used
## but it is no longer accessible in the website)
## http://may2015.archive.ensembl.org/biomart
biomart <- read_tsv("input/Biomart_Ensembl80_GRCh38_p2.txt", 
                    col_names = c("ensembl_id", "version", "gc"), skip = 1)

## Get only genes matching ensemblID
biomart <- biomart %>% mutate(gene_id = paste(ensembl_id, version, sep="."))

if(!IS_XENA) {
  annot <- annot %>%
    inner_join(biomart %>% dplyr::select(gene_id, gc), by = "gene_id")
} else {
  annot <- annot %>%
    inner_join(biomart %>% dplyr::select(ensembl_id, gc), by = "ensembl_id")
}

annot <- annot %>% mutate(chr = gsub("chr", "", seqnames)) %>%
  dplyr::rename(length = width) %>%
  dplyr::select(-seqnames) %>%   dplyr::select(gene_id, chr, everything())

cat('Annotation file. Final dimension: ', paste(dim(annot), collapse=", "), '\n')
save(annot, file=snakemake@output[["annot_rdata"]], compress="xz")
write_tsv(annot, snakemake@output[["annot_tsv"]])
cat('annot.RData saved \n')

##############################################################################
## Merging count and annotation
## 1. Build counts matrix. M=normal|tumour
## 2. Build targets matrix. targets=normal+tumor
## 3. Check M y targets integrity
## 4. Filter by annotation file 
## 5. Save the clean data
##############################################################################
{
  cat('Merging counts and annotations \n')

  normal_samples <- list(matrix=read_tsv(snakemake@input[["normal_matrix"]]),
                         targets=read_tsv(snakemake@input[["normal_targets"]])) 
  cancer_samples <- list(matrix=read_tsv(snakemake@input[["cancer_matrix"]]),
                         targets=read_tsv(snakemake@input[["cancer_targets"]])) 
  ## Raw counts
  M <- normal_samples$matrix %>% inner_join(cancer_samples$matrix, by = "gene_id")
  cat('Total number of features and samples: ', paste(dim(M), collapse=" ,"), '\n')

  ## Samples
  targets <- bind_rows(normal_samples$targets, cancer_samples$targets)

  ## Check M y targets integrity. Remove gene_ids col
  stopifnot(nrow(targets) == (ncol(M)-1))
  cat('Number of counts columns match sample number\n')

  ## Filter counts by annotation data
  cat('Adding biomart data\n')
  M <- M %>% semi_join(annot, by = "gene_id")
  annot <- annot %>% semi_join(M, by = "gene_id")

  cat('Total number of annotated (genes/protein-coding) features:', nrow(M), '\n')
  cat('Total number of samples:', ncol(M)-1, '\n')

  no_gc <- sum(is.na(annot$gc))
  cat("There are",no_gc, "entries with no GC info \n")

  ids <- annot %>% filter(!is.na(gc) & !is.na(length)) %>%
    select(gene_id) %>% unlist(use.names = F)

  M <- M %>% filter(gene_id %in% ids) %>% arrange(gene_id)
  annot <- annot %>% filter(gene_id %in% ids) %>% arrange(gene_id)
  cat("Non GC and lenght annotated genes removed.\n")

  ## Save it as a matrix
  ids <- M$gene_id
  MM <- M %>% select(-gene_id) %>% as.matrix() 
  rownames(MM) <- ids
  MM <- MM[,targets$id]

  ## Make sure they are factors
  targets$group <- factor(targets$group, levels=c("cancer", "normal"))

  ##Save clean data
  cat('Saving raw full data \n')
  full <- list(M = MM, annot = annot, targets = targets)

  save(full, file=snakemake@output[["raw_rdata"]], compress="xz")
  cat("raw_full.RData saved \n")
}

R dplyr readr From line 1 of scripts/addAnnotations.R

from MultiAracne import Aracne
import sys
import pathlib

sys.stdout = open(snakemake.log[0], "w")

exp_matrix = snakemake.input[0]
outmatrix = snakemake.output[0]
outdir = snakemake.params[0] + "/correlation/output_" + snakemake.params[1] + "_" + snakemake.params[2]
procs = int(snakemake.threads)

print(f"Saving files in {outdir}")
pathlib.Path(outdir).mkdir(parents=True, exist_ok=True)

ma = Aracne(exp_matrix)
ma.run(processes=procs, outdir=outdir, pval=1)
ma.join_adj(outdir, outdir+"/matrix.adj")
Aracne.adj_to_matrix(outdir+"/matrix.adj", outmatrix)

Python From line 1 of scripts/aracne_matrix.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(limma)
library(edgeR)
library(Glimma)
library(readr)
library(dplyr)
library(janitor)
library(ggplot2)
library(ggthemes)

DEGDIR <- paste(snakemake@params[["deg_dir"]])
dir.create(DEGDIR)

TISSUE <- snakemake@params[["tissue"]]

cat("Loading files\n")
load(snakemake@input[["full"]])
load(snakemake@input[["annot"]])

full$targets <- full$targets %>% 
  dplyr::mutate(group = factor(group, levels=c("normal", "cancer")))

cat("Generating model matrix\n")
mm <- model.matrix(~1+group, data = full$targets)
rownames(mm) <- full$targets %>% pull(id)
full$M <- full$M[, rownames(mm)]

cat("Getting voom transformation\n")
y <- voom(full$M, mm, plot = F, save.plot = T)

p <- ggplot() +
  geom_point(aes(x = y$voom.xy$x, y = y$voom.xy$y), size = 0.5) + 
  geom_line(aes(x = y$voom.line$x, y = y$voom.line$y), size = 1, color = "red") +
  xlab(y$voom.xy$xlab) +
  ylab(y$voom.xy$ylab) + 
  theme_bw() +
  ggtitle(TISSUE)

cat("Saving voom plot\n")
png(file = paste0(DEGDIR, "/voom.png"), width = 800, height = 600)
print(p)
dev.off()

cat("Getting linear adjustment\n")
fit <- lmFit(y, mm)
lfc_fit <- eBayes(fit)

cat("Getting Glimma interactive plot plot\n")
dt <- decideTests(lfc_fit, lfc = 2) 

annot <- annot %>% filter(gene_id %in% rownames(full$M)) %>% 
  select(gene_id, ensembl_id, chr, start, end, gene_name) %>%
  arrange(gene_id)

rownames(annot) <- annot$gene_id

glMDPlot(path = DEGDIR, lfc_fit, 
         counts = full$M, groups = full$targets %>% pull(group), 
         status = dt, anno = annot)

p <- ggplot() +
  geom_density(aes(x = lfc_fit$coefficients[,2]), size = 0.5) + 
  theme_bw() +
  xlab("lfc")
  ggtitle(TISSUE)

png(file = paste0(DEGDIR, "/density.png"), width = 800, height = 600)
print(p)
dev.off()

cat("Saving deg results\n")
topTable(lfc_fit, sort.by = "P", n = Inf, adjust.method="BH") %>%
  janitor::clean_names() %>% mutate(gene_id = rownames(.)) %>%
  inner_join(annot %>% 
               select(gene_id, ensembl_id, chr, gene_name), by ="gene_id") %>% 
  select(-gene_id) %>%
  select(ensembl_id, everything()) %>%
  mutate(adj_p_val = ifelse(adj_p_val > 0.05, 1, adj_p_val), 
                           log_fc = ifelse(adj_p_val > 0.05, 0, log_fc)) %>%
  write_tsv(snakemake@output[["deg_results"]])

R ggplot2 dplyr readr janitor ggthemes Glimma From line 12 of scripts/deg.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(readr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(ggthemes)
TISSUE <- snakemake@params[["tissue"]]
w <- 1024
h <- 1024
p <- 24

cat("Loading data \n")
load(snakemake@input[[1]])

cat("Calculating mean \n")
melted_data <- melt(log2(full$M+1))
means <- colMeans(full$M)
mm <- mean(means)
sd <- sd(means)

down_outliers <- which(means < mm-2*sd)
up_outliers <- which(means > mm+2*sd)

if(length(down_outliers) > 0 | length(up_outliers) > 0 ) {
  cat("Outliers found, saving samples names \n")
} else {
  cat("Outliers not found\n")
}

outliers <- data.frame(sample = c(names(down_outliers), names(up_outliers)),
           mean = c(means[down_outliers], means[up_outliers]))
outliers <- outliers %>% mutate(sd_from_mean = (mean - mm)/sd )

write_tsv(outliers, file=snakemake@output[["outliers"]])

pl <- ggplot(data = melted_data, aes(x=value, group=Var2, colour=Var2)) + 
  geom_density(show.legend = F) + 
  theme_hc(base_size = 20) +
  xlab("log2(expr+1)") + 
  ggtitle(TISSUE)

png(file = snakemake@output[["density"]], width = w, height = h/2, pointsize = p)
print(pl)
dev.off()

cat("Density plot for all samples generated\n")

R ggplot2 dplyr readr reshape2 ggthemes From line 1 of scripts/densityPlot.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(dplyr)

##########################################################################
load(snakemake@input[[1]])
{ 
  ### We keep only genes with mean expression count > 10 
  exp_genes <- apply(full$M, 1, function(x) mean(x)>10)
  egtable <- table(exp_genes)
  cat("There are", egtable[[1]], "genes with mean expression count < 10", egtable[[2]], "with mean count > 10 \n")
  exp_genes <- names(exp_genes[exp_genes == TRUE])

  non_zero_genes <- apply(full$M, 1, function(x) sum(x == 0) < ncol(full$M)/2)
  egtable <- table(non_zero_genes)
  cat("There are", egtable[[1]], "genes with 0 values in more than 50% samples", egtable[[2]], 
      " genes with 0 values in less than 50% samples \n")
  non_zero_genes <- names(non_zero_genes[non_zero_genes == TRUE])

  genes_pass <- intersect(exp_genes, non_zero_genes)

  ##Filtering low expression genes
  full <- list(M = round(full$M[genes_pass, full$targets$id]),
                 annot = full$annot %>% filter(gene_id %in% genes_pass),
                 targets = full$targets)

  full$annot <- full$annot %>% arrange(gene_id)
  full$M <- full$M[order(row.names(full$M)), full$targets$id]

  stopifnot(all(full$annot$gene_id == rownames(full$M)))
  cat(paste0("Genes in matrix and annotation match positions \n"))

  stopifnot(all(full$targets$id == colnames(full$M)))
  cat(paste0("Samples in matrix and annotation match positions \n"))

  row.names(full$annot) <- full$annot$gene_id

  cat("Saving full_filtered.RData \n") 
  save(full, file=snakemake@output[[1]], compress="xz")
}

R dplyr From line 14 of scripts/filterLowExpression.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(readr)
library(dplyr)

cat("Loading files \n")
load(snakemake@input[[1]])
outliers <- read_tsv(snakemake@input[["outliers"]])

cat(nrow(outliers), " outliers will be removed.\n")
targets <- full$targets %>% 
  filter(!id %in% (outliers %>% pull(sample)))

M <- full$M[, targets %>% pull(id)]

full <- list(annot = full$annot, M = M, targets = targets)

dimcancer <- nrow(full$targets %>% filter(group == "cancer"))
dimnormal <- nrow(full$targets %>% filter(group == "normal"))
cat("Final dimensions: ", dimcancer, " cancer and ", dimnormal,
" normal samples \n")

cat("Saving RData \n")
save(full, file=snakemake@output[[1]])

R dplyr readr From line 1 of scripts/filterOutliers.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(data.table)
library(dplyr)
library(magick)
library(ComplexHeatmap)

cat("Reading matrix \n")
expr_file <- snakemake@input[["expr_matrix"]]
annot_file <-  snakemake@input[["annot"]]

cat("Getting annotations \n")
chr_pal <- c("#D909D1", "#0492EE", "#D2656C", "#106F35", "#5BD2AE", "#199F41", 
             "#FE0F43", "#00FFCC", "#F495C5", "#E1BF5D", "#5F166F", "#088ACA",
             "#41CFE0", "#0F0A71", "#FFFF99", "#B06645", "#651532", "#B925AE",
             "#86C91F", "#CB97E8", "#130B9E", "#EF782B", "#79A5B9", "#F7AA7C")

names(chr_pal) <- c("22","11","12","13","14","15","16","17","18","19","1" ,"2" ,"3" ,"4" ,"5" ,
                    "6" ,"7" ,"X" ,"8" ,"9" ,"20","10","21", "Y")

load(annot_file)
expr_matrix <- fread(expr_file)
annot <- annot %>% 
  mutate(chr = factor(chr, levels=c(as.character(1:22), "X", "Y"))) %>%
  arrange(chr, start)
genes <- expr_matrix %>% pull(gene)
expr_matrix <- expr_matrix %>% select(-gene) %>% as.matrix()
rownames(expr_matrix) <- genes

annot <- annot %>% filter(gene_id %in% genes)
expr_matrix <- expr_matrix[annot$gene_id, ]

expr_matrix <- t(expr_matrix)
cat("Calculating correlation \n")
corr <- cor(expr_matrix, expr_matrix, method="pearson")

chrs <- annot %>% dplyr::select(chr) %>% dplyr::rename(Chr = chr)

cat("Building heatmap \n")
ha <- rowAnnotation(df = chrs, 
                    name = "Chr", show_annotation_name = F,
                    col = list(Chr = chr_pal[as.character(chrs$Chr)]), 
                    width = unit(0.5, "cm"),
                    annotation_legend_param = list(
                      title = "Chr", title_gp = gpar(fontsize = 14), grid_height = unit(0.5, "cm")))


ht <- Heatmap(corr, cluster_rows = F, cluster_columns = F, show_row_names = F, 
        show_column_names = F,  heatmap_legend_param = list( title_gp = gpar(fontsize = 14),
                                                             title = "Pearson \nCorrelation",
                                                             legend_height = unit(4, "cm")), 
        right_annotation = ha)

cat("Saving image \n")
png(snakemake@output[["plot"]], width = 1200, height = 1200)
draw(ht, heatmap_legend_side = "right", annotation_legend_side = "right")
dev.off()

colnames(corr) <- genes
corr[lower.tri(corr)] <- NA
fwrite(corr, file=snakemake@output[["csv"]])

R dplyr data.table ComplexHeatmap magick From line 1 of scripts/getPearsonMatrix.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
## Get the Work and Data dir
library(data.table)
library(readr)
library(dplyr)
library(janitor)
library(rtracklayer)

TISSUE <- snakemake@params[["tissue"]]
TYPE <-  snakemake@params[["type"]]
IS_XENA <- snakemake@params[["is_xena"]]
MCCORES <- as.numeric(snakemake@threads[[1]])
# ###############################################################################
# ## Get and check count matrices
# ## 1. Find files
# ## 2. Read data
# ## 4. Check sample sizes
# ## 5. Check genes order
# ## 6. Build target matrix
# ###############################################################################

RDATADIR <- paste(snakemake@params[["tissue_dir"]], "results", sep="/")
dir.create(RDATADIR)

if(!IS_XENA) {

  RAWDIR <- paste(snakemake@params[["tissue_dir"]], "raw", sep="/")

  cat(paste0("Checking ", TYPE, " samples \n"))

  ## Find the files
  files_to_read <- list.files(path = paste0(RAWDIR,"/",  TYPE), 
                      pattern = "\\.htseq.counts.gz$", full.names = T, recursive = T)

  ## Read the data
  all_files <- lapply(files_to_read, function(file) {
    data <- read_tsv(file, col_names = c("gene_id", "raw_counts"))
    data <- data %>% filter(!startsWith(gene_id, "_" ))
    return(data)
  })

  ## Check samples sizes
  size <- unique(do.call(rbind,lapply(all_files, dim)))
  stopifnot(nrow(size)==1)
  cat(paste0(TYPE, " samples have the same size \n"))

  ## Check genes order in samples
  genes <- do.call(cbind, lapply(all_files, function(x) dplyr::select(x, "gene_id")))
  genes <- t(unique(t(genes)))
  stopifnot(dim(genes)==c(size[1,1], 1))
  cat(paste0("Genes in ", TYPE, " samples match positions \n"))

  ## Build targets matrix
  targets <- data.frame(id = paste(TISSUE, TYPE, 1:length(files_to_read), sep = "_"),
                        file = unlist(lapply(strsplit(files_to_read, "/"), "[[", 10)),
                        file_id = unlist(lapply(strsplit(files_to_read, "/"), "[[", 9)),
                        group = TYPE, stringsAsFactors = FALSE)

  ## Rename columns in counts matrix
  matrix <- bind_cols(lapply(all_files, function(x) dplyr::select(x, "raw_counts")))
  colnames(matrix) <- targets$id

  matrix <- matrix %>% mutate(gene_id = genes[,1]) %>%
    dplyr::select(gene_id, everything())

} else {
  XENA_COUNTS <- snakemake@input[[1]]
  XENA_SAMPLES <- snakemake@input[[2]]
  PRIMARY <- snakemake@params[["primary"]]
  extended_type <- snakemake@params[["extended_type"]]

  matrix_samples <- read_tsv(XENA_SAMPLES) %>% clean_names()
  matrix_counts <-  fread(XENA_COUNTS, nThread = MCCORES)

  # When the normal samples should come from TCGA in UCSC Xena, the extended type 
  # for normal should be: "Solid Tissue Normal". This is only for testing
  # When the normal samples should come from GTEX in UCSC Xena, the extended type 
  # for normal should be: "Normal Tissue". This is the usual pipeline
  if(is.null(extended_type)) {
    extended_type <- ifelse(TYPE == "normal", "Normal Tissue", "Primary Tumor")  
  }

  cat("Extented type: ", extended_type, "\n")
  cat("Primary disease or tissue: ", PRIMARY, "\n")

  camelTissue <- paste0(toupper(substring(TISSUE, 1, 1)), substring(TISSUE, 2))

  cat("Tissue name: ", camelTissue, "\n")
  targets <- matrix_samples %>% dplyr::filter(primary_site == camelTissue &
                                                primary_disease_or_tissue == PRIMARY &
                                                sample_type == extended_type)
  cat("Got", nrow(targets), " samples\n")

  cat("Getting count matrix\n")
  ### Expected counts from XENA are in log2(expected_count+1)
  ### get them back to expected_count for downstream pipeline
  matrix <- matrix_counts %>% dplyr::select_if(names(.) %in% c("sample", targets$sample)) %>%
    dplyr::select(sample, everything())%>% dplyr::mutate(across(-sample, ~ .x^2-1))
  matrix[matrix < 0] <- 0

  targets <- targets %>% dplyr::filter(sample %in% names(matrix)) %>%
    dplyr::mutate(id = paste(TISSUE, TYPE, 1:nrow(.), sep = "_"), group = TYPE) %>% 
    dplyr::rename(file = sample) %>% select(id, file, group)

  matrix <- matrix %>% dplyr::select(sample, targets$file) 
  colnames(matrix) <- c("gene_id", targets$id)
}

cat(paste0("Matrices for ", TYPE, " ready.\n"))
cat("Saving matrix\n")
write_tsv(matrix, snakemake@output[["matrix"]])
cat("Saving samples\n")
write_tsv(targets, snakemake@output[["samples"]])

R dplyr data.table readr janitor From line 16 of scripts/getRawMatrix.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(dplyr)
library(NOISeq)
library(ggplot2)

PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", snakemake@params[["plots_type"]], sep="/")
dir.create(PLOTSDIR, recursive = TRUE)
w <- 1024
h <- 1024
p <- 24

load(snakemake@input[[1]])
##########################################
## EXPLORATORY ANALYSIS (NOISeq package)
##########################################
{
  ## Reading data into NOISeq package -> mydata
  mydata <- NOISeq::readData(
    data = full$M, 
    length = full$annot %>% select(gene_id, length) %>% as.data.frame(), 
    biotype = full$annot %>% select(gene_id, gene_type) %>% as.data.frame(), 
    chromosome = full$annot %>% select(chr, start, end) %>% as.data.frame(), 
    factors = full$targets %>% select(group) %>% as.data.frame(),
    gc = full$annot %>% select(gene_id, gc) %>% as.data.frame())

}
##########################################
## Plots
##########################################
{
  # Biodetection plot. Per group.
  mybiodetection <- dat(mydata, type="biodetection", factor="group", k=0)
  png(filename=paste(PLOTSDIR, "biodetection.Rd_%03d.png", sep="/"), 
      width=w, height=h, pointsize=p)
  explo.plot(mybiodetection)
  dev.off()
  cat("Biodetection plots generated\n")

  ## Count distribution per biotype. Using count per million, only for one sample
  mycountsbio <- dat(mydata, factor = NULL, type = "countsbio")
  png(filename=paste(PLOTSDIR, "countsbio.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = 1, samples = 1, plottype = "boxplot")
  dev.off()
  cat("Counts distribution plot per biotype and one sample generated\n")
  #What about expression level?

  ## Count distribution per sample
  mycountsbio <- dat(mydata, factor = NULL, type = "countsbio")
  png(paste(PLOTSDIR, "protein_coding_boxplot.png", sep="/"), width=w*2, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "boxplot")
  dev.off()
  cat("Counts distribution plot for protein coding and all samples generated\n")

  png(paste(PLOTSDIR, "protein_coding_barplot.png", sep="/"), width=w*2, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "barplot")
  dev.off()
  cat("Counts distribution barplot for protein coding biotype and all samples generated\n")

  mycountsbio <- dat(mydata, factor = "group", type = "countsbio")
  ## Count distribution per Experimental factors
  png(paste(PLOTSDIR, "protein_coding_boxplot_group.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "boxplot")
  dev.off()
  cat("Counts distribution boxplot for protein coding biotype and group generated\n")

  png(paste(PLOTSDIR, "protein_coding_barplot_group.png", sep="/"),
      width=w, height=h, pointsize=p)
  explo.plot(mycountsbio, toplot = "protein_coding",
      samples = NULL, plottype = "barplot")
  dev.off()
  cat("Counts distribution barplot for protein coding biotype and group generated\n")
  # How much sensitivity we loose? 

}##########################################
## Bias
##########################################
{
  ## Length bias detection
  mylengthbias <- dat(mydata, factor="group", type="lengthbias")
  png(paste(PLOTSDIR, "length_bias.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mylengthbias, samples = NULL, toplot = "global")
  dev.off()
  cat("Lenght bias plot generated\n")
  # Do we see a clear pattern?

  ## GC bias
  mygcbias <- dat(mydata, factor = "group", type="GCbias")
  png(paste(PLOTSDIR, "gc_bias.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mygcbias, samples = NULL, toplot = "global")
  dev.off()
  cat("GC bias plot generated\n")
  # Do we see a clear pattern?

  ## RNA composition
  mycomp <- dat(mydata, type="cd")
  png(paste(PLOTSDIR, "rna_composition.png", sep="/"), width=w, height=h, pointsize=p)
  explo.plot(mycomp, samples=1:12)
  dev.off()
  cat("RNA composition plot generated\n")
  # Are samples comparable?
}

R ggplot2 dplyr NOISeq From line 25 of scripts/NOISeqPlots.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(ggplot2)
library(reshape2)
library(grid)
library(png)
library(gridExtra)
library(readr)
library(dplyr)

###############################################################################
args <- commandArgs(trailingOnly = T)

PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", sep="/")
PLOTSNORMDIR <- paste(PLOTSDIR, "normalization", sep = "/")
w <- 1024
h <- 1024
p <- 24

{ 
  cat("Integrating plots \n.")

  length_norm <- c("no","full", "loess", "median", "upper")
  gc_norm <- c("no", "full", "loess", "median", "upper")
  between_norm <- c("no", "full", "median", "tmm", "upper")

  ## This function will retrieve plots for one normalization set and will create 
  ## one single plot
  savePlots <- function(step1, step2, step3) {

    plotname <- paste(step1, step2, step3, sep = "_")  
    pngPlots <- c(paste0(PLOTSNORMDIR, "/", plotname, "_length_bias.png"), 
                  paste0(PLOTSNORMDIR, "/", plotname, "_gc_bias.png"), 
                  paste0(PLOTSNORMDIR, "/", plotname, "_rna_composition.png"))

    thePlots <- lapply (pngPlots, function(pngFile) {
      rasterGrob(readPNG(pngFile, native = FALSE), interpolate = FALSE)
    })

    plotpath <- paste(PLOTSNORMDIR, paste(plotname, "png", sep ="."), sep="/")
    png(plotpath,  height=h/2, width=w*(3/2))
    par(oma = c(0, 0, 1.5, 0))
    plot.new()
    do.call(grid.arrange, c(thePlots,  ncol = 3, nrow=1))
    mtext(plotname, outer = TRUE, cex = 1.5)
    dev.off()

    unlink(pngPlots)
    return(plotpath)
  } 

  df_normalizations <- expand.grid(gcn = gc_norm, ln = length_norm, 
                                   bn = between_norm, stringsAsFactors = F)

  cat("Getting plots for all ", nrow(df_normalizations), "normalization combinations.\n")

  plots <- lapply(1:nrow(df_normalizations),  function(i){
                         gcn <- df_normalizations[i, "gcn"]
                         ln <- df_normalizations[i, "ln"]
                         bn <- df_normalizations[i, "bn"]
                         gcp <- NA
                         tryCatch(expr = {
                           gcp <- savePlots(paste("gc", gcn, sep = "_"), 
                                     paste("length", ln, sep = "_"), 
                                     paste("between", bn, sep =  "_"))
                         }, error = function(cond){

                           cat(cond$message, "\n")
                         })
                         lp <- NA
                         tryCatch(expr = {
                           lp <- savePlots(paste("length", ln, sep = "_"), 
                                     paste("gc", gcn, sep = "_"), 
                                     paste("between", bn, sep =  "_"))
                         }, error = function(cond){
                           cat(cond$message, "\n")
                         })
              return(list(gcp, lp))         
          })

  plots <- sort(unlist(plots))

  thePlots <- lapply (plots, function(pngFile) {
    rasterGrob(readPNG(pngFile, native = FALSE), interpolate = FALSE)
  })

  cat("Saving normalization_plots.pdf\n") 
  pdf(paste(PLOTSDIR, "normalization_plots.pdf", sep="/"), title="Normalization Plots")
  print(marrangeGrob(thePlots, nrow = 3, ncol = 1, top = NULL))
  dev.off()
}

R ggplot2 dplyr readr reshape2 gridExtra png From line 20 of scripts/normalizationPlots.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(reshape2)
library(grid)
library(png)
library(gridExtra)
library(parallel)
library(readr)
library(dplyr)
library(NOISeq)
library(EDASeq)

###############################################################################
MCCORES <- as.numeric(snakemake@threads[[1]])
PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", sep="/")
PLOTSNORMDIR <- paste(PLOTSDIR, "normalization", sep = "/")
dir.create(PLOTSNORMDIR)

w <- 1024
h <- 1024
p <- 24

load(snakemake@input[[1]])

cat("Testing normalization methods\n.")
rawEDA <- EDASeq::newSeqExpressionSet(
  counts = full$M,
  featureData = data.frame(full$annot, row.names = full$annot$gene_id),
  phenoData = data.frame(
    conditions = full$targets$group,
    row.names = full$targets$id))

length_norm <- c("no", "full", "loess", "median", "upper")
gc_norm <- c("no", "full", "loess", "median", "upper")
between_norm <-c("no", "full", "median", "tmm", "upper")

## This function gets the relevant statistics for the regression methods for GC and Length bias
getRegressionStatistics <- function(regressionmodel) {
  name <- names(regressionmodel)
  rsquared <- summary(regressionmodel[[1]])$r.squared
  fstatistic <- summary(regressionmodel[[1]])$fstatistic
  pvalue <- signif(pf(q = fstatistic[1], df1 = fstatistic[2], 
                      df2 = fstatistic[3], lower.tail = FALSE), 2)
  return(list("name" = name, "r2" = rsquared, "p" = pvalue))
}

## This function will retrieve the NOISeq results to test the normalization combination
getNOISeqResults <- function(step1, step2, step3, n_counts, m10_data) {
  ### Check the NOISEq results 
  mydata <- NOISeq::readData(
    data = n_counts, 
    length = m10_data$annot %>% select(gene_id, length) %>% as.data.frame(), 
    biotype = m10_data$annot %>% select(gene_id, gene_type) %>% as.data.frame(), 
    chromosome = m10_data$annot %>% select(chr, start, end) %>% as.data.frame(),  
    factors = m10_data$targets %>% select(group) %>% as.data.frame(), 
    gc = m10_data$annot %>% select(gene_id, gc)%>% as.data.frame())

  nsamples <- dim(m10_data$targets)[1]

  ### Length bias 
  mylengthbias <- dat(mydata, factor="group", norm = TRUE, type="lengthbias")
  l_stats_1 <- getRegressionStatistics(mylengthbias@dat$RegressionModels[1])
  l_stats_2 <- getRegressionStatistics(mylengthbias@dat$RegressionModels[2])

  ## GC Bias
  mygcbias <- dat(mydata, factor = "group", norm = TRUE, type ="GCbias")
  gc_stats_1 <- getRegressionStatistics(mygcbias@dat$RegressionModels[1])
  gc_stats_2 <- getRegressionStatistics(mygcbias@dat$RegressionModels[2])

  #RNA Composition
  myrnacomp <- dat(mydata, norm = TRUE, type="cd")
  dtable <- table(myrnacomp@dat$DiagnosticTest[,  "Diagnostic Test"])
  if (is.na(dtable["PASSED"])) dtable <- data.frame(PASSED = 0)

  cat("Step1:", step1, ", step2:", step2, " step3: ", step3, " calculations done\n")
  plotname <- paste(step1, step2, step3, sep = "_")  
  pngPlots <- c(paste0(PLOTSNORMDIR, "/", plotname, "_length_bias.png"), 
                paste0(PLOTSNORMDIR, "/", plotname, "_gc_bias.png"), 
                paste0(PLOTSNORMDIR, "/", plotname, "_rna_composition.png"))

  png(pngPlots[1], width=w/2, height=h/2)
  explo.plot(mylengthbias, samples = NULL, toplot = "global")
  dev.off()

  png(pngPlots[2], width=w/2, height=h/2)
  explo.plot(mygcbias, samples = NULL, toplot = "global")
  dev.off()

  png(pngPlots[3],width=w/2, height=h/2)
  explo.plot(myrnacomp, samples = 1:12)
  dev.off()

  cat("Step1:", step1, ", step2:", step2, " step3: ", step3, " plots saved\n")

  norm_set_results <- tibble(step1, step2, step3, 
                                 l_stats_1$r2, l_stats_1$p, l_stats_2$r2, l_stats_2$p,
                                 gc_stats_1$r2, gc_stats_1$p, gc_stats_2$r2, gc_stats_2$p, 
                                 dtable["PASSED"], dtable["PASSED"]/nsamples)

  colnames(norm_set_results) <- c("step1", "step2", "step3", 
                                  paste("length", l_stats_1$name, "R2", sep = "_"), paste("length", l_stats_1$name, "p-value", sep = "_"),  
                                  paste("length", l_stats_2$name, "R2", sep = "_"), paste("length", l_stats_2$name, "p-value", sep = "_"), 
                                  paste("gc", gc_stats_1$name, "R2", sep = "_"), paste("gc", gc_stats_1$name, "p-value", sep = "_"), 
                                  paste("gc", gc_stats_2$name, "R2", sep = "_"), paste("gc", gc_stats_2$name, "p-value", sep = "_"),
                                  "rna_passed_samples", "rna_passed_proportion")
  return(norm_set_results)
}

df_normalizations <- expand.grid(gcn = gc_norm, ln = length_norm, 
                                bn = between_norm, stringsAsFactors = F)

cat("Testing all ", nrow(df_normalizations), "normalization combinations\n")


## Normalizations with GC step first
gc_norms <- mclapply(X = 1:nrow(df_normalizations), 
                      mc.cores = MCCORES, 
                      mc.preschedule = FALSE,
                      FUN = function(i){

    gcn <- df_normalizations[i, "gcn"]
    ln <- df_normalizations[i, "ln"]
    bn <- df_normalizations[i, "bn"]

    tryCatch(
      expr = {

        if(gcn == "no") {
          gcn_data <- counts(rawEDA)
        } else {
          gcn_data <- withinLaneNormalization(counts(rawEDA), full$annot$gc, which = gcn)  
        }

        if(ln == "no") {
          ln_data <- gcn_data
        } else {
          ln_data <- withinLaneNormalization(gcn_data, full$annot$length, which = ln)
        }

        if (bn == "no") {
          between_data <- ln_data
        } else if (bn == "tmm") {
          between_data <- tmm(ln_data, long = 1000, lc = 0, k = 0)
        } else {
          between_data <- betweenLaneNormalization(ln_data, which = bn, offset = FALSE)
        }
        cat("Testing with GC normalization: ", gcn, ", length normalization: ", ln, " and between lane normalization: ", bn, "\n")
        norm_noiseq_results <- getNOISeqResults(paste("gc", gcn, sep = "_"), paste("length", ln, sep = "_"), paste("between", bn, sep =  "_"), 
                                                between_data, full)
        return(norm_noiseq_results)
      },
      error = function(cond) {

        norm_noiseq_results <- tibble(step1 = paste("gc", gcn, sep = "_"), 
                                   step2 = paste("length", ln, sep = "_"),
                                   step3 = paste("between", bn, sep =  "_"), 
                                   error = cond$message)
        return(norm_noiseq_results)

      })
})

save(gc_norms, file=snakemake@output[["gc_norms"]])
gc_norms <- bind_rows(gc_norms)

## Normalizations with Length step first
ln_norms <- mclapply(X = 1:nrow(df_normalizations), 
                      mc.preschedule = FALSE,
                      mc.cores = MCCORES,
                      FUN = function(i){

    gcn <- df_normalizations[i, "gcn"]
    ln <- df_normalizations[i, "ln"]
    bn <- df_normalizations[i, "bn"]

    tryCatch(
      expr = {

        if(ln == "no") {
          ln_data <- counts(rawEDA)
        } else {
          ln_data <- withinLaneNormalization(counts(rawEDA), full$annot$length, which = ln)
        }

        if(gcn == "no") {
          gcn_data <- ln_data
        } else {
          gcn_data <- withinLaneNormalization(ln_data, full$annot$gc, which = gcn)
        }

        if (bn == "no") {
          between_data <- ln_data
        } else if (bn == "tmm") {
          between_data <- tmm(ln_data, long = 1000, lc = 0, k = 0)
        } else {
          between_data <- betweenLaneNormalization(ln_data, which = bn, offset = FALSE)
        }
        cat("Testing with length normalization: ", ln, "GC normalization: ", gcn, " and between lane normalization: ", bn, "\n")
        norm_noiseq_results <- getNOISeqResults(paste("length", ln, sep = "_"), paste("gc", gcn, sep = "_"), paste("between", bn, sep =  "_"), 
                                                between_data, full)
        return(norm_noiseq_results)
      },
      error = function(cond) {

        norm_noiseq_results <- tibble(step1 = paste("length", ln, sep = "_"),
                                      step2 = paste("gc", gcn, sep = "_"), 
                                      step3 = paste("between", bn, sep =  "_"), 
                                      error = cond$message)
        return(norm_noiseq_results)

      })
})

save(ln_norms, file=snakemake@output[["ln_norms"]])
ln_norms <- bind_rows(ln_norms)

normalization_results <- bind_rows(gc_norms, ln_norms)

cat("End of normalization testing\n")
cat("Saving normalization_results.tsv\n") 
write_tsv(normalization_results, snakemake@output[["results"]])

R dplyr readr reshape2 gridExtra png NOISeq EDASeq From line 18 of scripts/normalizationTest.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(NOISeq)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(ggpubr)

w <- 1024
h <- 1024
p <- 24

PLOTSDIR <-paste(snakemake@params[["tissue_dir"]], "plots", 
                 snakemake@params[["plots_type"]], sep="/")
dir.create(PLOTSDIR, recursive = TRUE)

cat("Loading data")
load(snakemake@input[[1]])

cat("Performing PCA")
mydata <- NOISeq::readData(
  data = full$M, 
  length = full$annot %>% select(gene_id, length) %>% as.data.frame(), 
  biotype = full$annot %>% select(gene_id, gene_type) %>% as.data.frame(), 
  chromosome = full$annot %>% select(chr, start, end) %>% as.data.frame(), 
  factors = full$targets %>% select(group) %>% as.data.frame(),
  gc = full$annot %>% select(gene_id, gc) %>% as.data.frame())

pca_dat <- dat(mydata, type = "PCA", logtransf = F)
pca_results <- pca_dat@dat$result 
pca_factors <- pca_dat@dat$factors


cat("Getting variance plot")
pc_eigenvalues <- tibble(pc = factor(1:nrow(pca_results$var.exp)), 
                         pct = pca_results$var.exp[,1], 
                         pct_cum = pca_results$var.exp[,2])

g1 <- ggplot(data = pc_eigenvalues[1:50,], aes(x = pc)) +
  geom_col(aes(y = pct)) +
  geom_line(aes(y = pct_cum, group = 1)) + 
  geom_point(aes(y = pct_cum)) +
  labs(x = "Principal component", y = "Fraction variance explained") +
  theme_hc(base_size = 20) +
  theme(axis.text.x = element_text(size=12)) +
  ggtitle("PCA Variance")

ggexport(g1, width = w, height = h/2, pointsize = p, 
         filename = snakemake@output[["variance"]])

cat("Getting scores plot")
pc_scores <- tibble(pc = factor(1:nrow(pca_results$var.exp)),
                    name = rownames(pca_factors),
                    group = factor(pca_factors$group, levels = c("normal", "cancer"), labels = c("Normal", "Cancer")), 
                    pc1 = pca_results$scores[,1],
                    pc2 = pca_results$scores[,2],
                    pc3 = pca_results$scores[,3])

color_pal <- c("#e3a098", "#a32e27")
xrange <- range(pc_scores %>% select(pc1))
yrange <- range(pc_scores %>% select(pc2, pc3))

g1 <- ggplot(pc_scores, aes(x=pc1, y=pc2, color=group)) +
  geom_point(size=2) +
  labs(x = "PC1", y = "PC2") +
  scale_color_manual(values = color_pal) +
  theme_hc(base_size = 20) +
  theme(legend.title = element_blank()) + 
  xlim(xrange) +
  ylim(yrange)

g2 <- ggplot(pc_scores, aes(x=pc1, y=pc3, color=group)) +
  geom_point(size=2) +
  labs(x = "PC1", y = "PC3") +
  scale_color_manual(values = color_pal) +
  theme_hc(base_size = 20) +
  theme(legend.title = element_blank()) +
  xlim(xrange) +
  ylim(yrange)

fig <- ggarrange(g1,  g2, nrow = 1)
fig <- annotate_figure(fig,
                top = text_grob("PCA Scores", size = 25))

ggexport(fig, width = w, height = h/2, pointsize = p, 
         filename = snakemake@output[["score"]])

cat("Getting loading plot")
pc_loadings <- tibble(gene = rownames(pca_results$loadings), 
                      pc1 = pca_results$loadings[, 1],
                      pc2 = pca_results$loadings[, 2])

top_genes <- pc_loadings %>% 
  pivot_longer(matches("pc"), names_to = "pc", values_to = "loading")%>% 
  group_by(pc) %>% 
  arrange(desc(abs(loading))) %>% 
  slice(1:10) %>% 
  pull(gene) %>% 
  unique()

top_loadings <- pc_loadings %>% 
  filter(gene %in% top_genes)

g1 <- ggplot(top_loadings) +
  geom_segment(aes(x = 0, y = 0, xend = pc1, yend = pc2), 
               arrow = arrow(length = unit(0.1, "in")),
               color = "brown") +
  geom_text(aes(x = pc1, y = pc2, label = gene),
            nudge_y = 0.005, size = 3) +
  scale_x_continuous(expand = c(0.02, 0.02)) +
  ggtitle("PCA Loadings") +
  theme_hc(base_size = 15) 

ggexport(g1, width = w/2, height = h/2, pointsize = p, 
         filename = snakemake@output[["loading"]])

R ggplot2 dplyr tidyr readr ggpubr ggthemes NOISeq From line 1 of scripts/PCA.R

import requests
import json
import sys
import pandas as pd
from io import StringIO
import pathlib

logdir = snakemake.params[0] + "/log"
manifestdir = snakemake.params[0] + "/manifests"
pathlib.Path(logdir).mkdir(parents=True, exist_ok=True)
pathlib.Path(manifestdir).mkdir(parents=True, exist_ok=True)

sys.stderr = open(snakemake.log[0], "w")

primary_site = snakemake.params[1]
sample_type =snakemake.params[2]
sst = sample_type

if (sample_type == "cancer"):
    sample_type = "primary tumor"
elif (sample_type == "normal"):
    sample_type = "solid tissue normal"
else:
    sys.exit("Incorrect sample type")

print("Getting data for: " + primary_site + ", " + sample_type)

# Fields for the query
fields = [
    "cases.case_id",
    "file_name"
    ]

fields = ",".join(fields)

# Endpoints used
files_endpt = "https://api.gdc.cancer.gov/files"
manifest_endpt = "https://api.gdc.cancer.gov/manifest"

# Filters to get RNA seq data
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.project.primary_site",
            "value": [primary_site]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "cases.samples.sample_type",
            "value": [sample_type]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.analysis.workflow_type",
            "value": ["HTSeq - Counts"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.data_type",
            "value": ["Gene Expression Quantification"]  
            }
        }
    ]
}

params = {
    "filters": json.dumps(filters),
    "fields": fields,
    "format": "tsv",
    "size": "2000"
    }

# Getting data
response = requests.get(files_endpt, params = params)
data = response.content.decode("utf-8")

print("RNA-seq query done")

if(not data.strip()):
    sys.exit("No records found for " + primary_site + ", " + sample_type)

df = pd.read_csv(StringIO(data), sep ="\t")

df.columns = [col + "_rna" for col in df.columns] 

# Save file ids and file names just in case
df.to_csv(snakemake.output[0], sep="\t", index=False)  

# Getting RNASeq files manifest for future download
params = { "ids" : df["id_rna"].tolist() }
response = requests.post(manifest_endpt, data= json.dumps(params), headers = {"Content-Type":
    "application/json"})

print("Got RNASeq manifest")

f = open(snakemake.output[1], "w")
f.write(response.content.decode("utf-8"))
f.close()

print("RNASeq manifest written")

Python Pandas JSON requests HTSeq From line 1 of scripts/queryGDC.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(NOISeq)
library(readr)
library(dplyr)

STEP1 = snakemake@params[["step1"]]
STEP2 = snakemake@params[["step2"]]
STEP3 = snakemake@params[["step3"]]
##########################################
## ARSyN to reduce batch effect
##########################################
{
  cat("Loading data\n")
  load(snakemake@input[[1]])

  mydata <- NOISeq::readData(
    data = full$M, 
    factors = full$target %>% select(group) %>% as.data.frame())

  cat("Performing ARSyN for batch correction")
  myARSyN <- ARSyNseq(mydata, norm = "n", logtransf = FALSE)

  ##Saving everything
  full_arsyn <- list(M = assayData(myARSyN)$exprs, annot = full$annot, 
                                targets = full$targets)

  stopifnot(nrow(full$M) == nrow(full_arsyn$annot))
  stopifnot(all(rownames(full_arsyn$M) == rownames(full_arsyn$annot)))

  full <- full_arsyn
  cat("Saving", paste(STEP1, STEP2, STEP3,  "norm_arsyn.RData", sep = "_"), "\n")
  save(full, file=snakemake@output[["arsyn_rdata"]], compress="xz")

  cat("Generating data matrices with arsyn for Aracne\n")
  ## Data matrices for Aracne
  #normal samples
  normal <- as_tibble(full$M[,full$targets$group == "normal"])
  normal <- bind_cols(gene=as.character(full$annot$gene_id), normal)

  #cancer samples
  cancer <- as_tibble(full$M[,full$targets$group == "cancer"])
  cancer <- bind_cols(gene=as.character(full$annot$gene_id), cancer)

  #gene_ids
  symbols <- full$annot %>% select(gene_id)

  cat("Saving arsyn data\n")

  cat("Saving", paste(STEP1, STEP2, STEP3,  "norm_arsyn_normal.tsv", sep = "_"), "\n")
  write_tsv(normal, file=snakemake@output[["normal_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_arsyn_cancer.tsv", sep = "_"), "\n")
  write_tsv(cancer, file=snakemake@output[["cancer_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3,  "norm_arsyn_genelist.txt", sep = "_"), "\n")
  write_tsv(symbols, file=snakemake@output[["gene_list"]], col_names = F)
}

R dplyr readr NOISeq From line 1 of scripts/runArsyn.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

##Usefull Libraries
library(ggplot2)
library(reshape2)
library(readr)
library(dplyr)
library(EDASeq)
library(NOISeq)
library(DESeq2)

STEP1 = snakemake@params[["step1"]]
STEP2 = snakemake@params[["step2"]]
STEP3 = snakemake@params[["step3"]]
is_xena = snakemake@params[["xena"]]

RDATA <-paste(snakemake@params[["tissue_dir"]], "rdata", sep="/")
w <- 1024
h <- 1024
p <- 24
##########################################################################
{##### USER SELECTED NORMALIZATION
  cat("Loading data\n")
  load(snakemake@input[[1]])
  cat("Performing normalization. Step1: ", STEP1, ", Step2: ", STEP2, " Step3: ", STEP3, "\n")

  first <- strsplit(STEP1, "-")[[1]]
  second <- strsplit(STEP2, "-")[[1]]

  if(first[1] == "length") {
    if(first[2] != "no") {
      norm_counts <- withinLaneNormalization(full$M, full$annot$length, which = first[2])  
    } else {
      norm_counts <- full$M
    }
    if(second[1] == "gc") {
      if(second[2] != "no") {
        norm_counts <- withinLaneNormalization(norm_counts, full$annot$gc, which = second[2])    
      }
    }
  } else if (first[1] == "gc") {
    if(first[2] != "no") {
      norm_counts <- withinLaneNormalization(full$M, full$annot$gc, which = first[2])
    } else {
      norm_counts <- full$M
    }
    if(second[1] == "length") {
      if(second[2] != "no") {
        norm_counts <- withinLaneNormalization(norm_counts, full$annot$length, which = second[2])  
      }
    }
  } else {
    norm_counts <- full$M
  }

  if(STEP3 == "no") {
    norm_counts <- full$M
  } else if (STEP3 == "tmm") {
    norm_counts <- tmm(norm_counts, long = 1000, lc = 0, k = 0)
  } else {
    norm_counts <- betweenLaneNormalization(norm_counts, which = STEP3, offset = FALSE)
  }
  cat("Normalization done. Step1: ", STEP1, ", Step2: ", STEP2, " Step3: ", STEP3, "\n")

  # Saving normalized data
  full$M <- norm_counts

  norm_data_cpm10 <- filtered.data(full$M, factor=full$targets$group, 
                                   norm=TRUE, cv.cutoff=100, cpm=10)

  filtered <- nrow(full$M) - nrow(norm_data_cpm10)
  cat("After normalization. There are", filtered, "genes with counts per million mean < 10", 
      nrow(norm_data_cpm10), "with counts per million mean > 10 \n")

  ### There is a problem with xena expression matrix. Values for
  ### genes "ENSG00000203811.1" and "ENSG00000203852.3" are exactly
  ### the same and it affects correlation calculation
  if(is_xena) {
    genes <- rownames(norm_data_cpm10)
    if("ENSG00000203811.1" %in% genes){
      genes <- genes[genes != "ENSG00000203811.1"]
      norm_data_cpm10 <- norm_data_cpm10[genes,]
    }
  }

  full <-list(M = norm_data_cpm10, 
                    annot = full$annot %>% filter(gene_id %in% rownames(norm_data_cpm10)),
                    targets = full$targets)



  stopifnot(nrow(full$M) == nrow(full$annot))
  stopifnot(rownames(full$M) == rownames(full$annot))

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_full.RData", sep = "_"), "\n")
  save(full, file=snakemake@output[["norm_rdata"]], compress="xz")

  cat("Generating data matrices for Aracne\n")
  ## Data matrices for Aracne

  #normal samples
  normal <- as_tibble(full$M[, full$targets$group == "normal"])
  normal <- bind_cols(gene=as.character(full$annot$gene_id), normal)

  #cancer samples
  cancer <- as_tibble(full$M[, full$targets$group == "cancer"])
  cancer <- bind_cols(gene=as.character(full$annot$gene_id), cancer)

  #gene_ids
  symbols <- full$annot %>% select(gene_id)

  cat("Saving data\n")

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_normal.tsv", sep = "_"), "\n")
  write_tsv(normal, file=snakemake@output[["normal_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_cancer.tsv", sep = "_"), "\n")
  write_tsv(cancer, file=snakemake@output[["cancer_matrix"]])

  cat("Saving", paste(STEP1, STEP2, STEP3, "norm_genelist.txt", sep = "_"), "\n")
  write_tsv(symbols, file=snakemake@output[["gene_list"]], col_names = F)
}##########################################