Ribosome Profiling Data Analysis with Snakemake: public_riboseq Workflow

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Code Snippets

library("IHW")
library("DESeq2")
library("ggplot2")
library("RColorBrewer")
library("pheatmap")
library("tximport")
#library("biomart")

print(snakemake@output[[1]])

#Carrega os nomes dos arquivos de expressão de cada biblioteca
tx2gene <- read.csv(snakemake@params[["transcripts_to_genes"]], header = FALSE, sep=",")
head(tx2gene)
print("\n\n")
head(snakemake@params[["samples_file"]])
print("\n\n")
samples_table <- read.csv(snakemake@params[["samples_file"]], header = TRUE, sep=",")
samples_table <- samples_table[!duplicated(samples_table$sample), ]
samples <- unique(samples_table$sample)
files <- file.path("results/kallisto",samples, "abundance.tsv")
names(files) <- samples
files
print("\n\n")
#Importa os resultados de expressão de cada biblioteca
txi.kallisto.tsv <- tximport(files, type = "kallisto",  tx2gene = tx2gene, dropInfReps = TRUE)
print(samples)
head(txi.kallisto.tsv$counts)
#colnames(txi.kallisto.tsv$counts) <- samples
dim(txi.kallisto.tsv$counts)

colDatak <-data.frame(row.names=colnames(txi.kallisto.tsv$counts), condition=as.factor(samples_table$condition), type=as.factor(samples_table$type))
colDatak

ddsk <- DESeqDataSetFromTximport(txi = txi.kallisto.tsv, colData = colDatak, design = ~ type + condition + type:condition)

#Cria objeto do DESeq2
ddsk <- DESeq(ddsk)

print("Criar gráfico do Heatmap")
#Cria gráfico do Heatmap
vsdk <- rlog(ddsk, blind=FALSE)
sampleDistsk <- dist(t(assay(vsdk)))
sampleDistMatrixk <- as.matrix(sampleDistsk)
rownames(sampleDistMatrixk) <- paste(vsdk$type, vsdk$condition, sep="-")
#colnames(sampleDistMatrixk)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
print("Heatmap")
png("Heatmap.png")
pheatmap(sampleDistMatrixk,
         clustering_distance_rows=sampleDistsk,
         clustering_distance_cols=sampleDistsk,
         col=colors)
dev.off()

#Cria PCA
pcaDatak <- plotPCA(vsdk, intgroup=c("type", "condition"), returnData=TRUE)
percentVark <- round(100 * attr(pcaDatak, "percentVar"))
png("PCA.png")
ggplot(pcaDatak, aes(PC1, PC2, color=condition, shape=type)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVark[1],"% variance")) +
  ylab(paste0("PC2: ",percentVark[2],"% variance")) +
  coord_fixed()
dev.off()
print("Results")

ddsk$group <- factor(paste0(ddsk$type, "_", ddsk$condition))
design(ddsk) <- ~ group
ddsk <- DESeq(ddsk)
print(ddsk$group)

#Comparações que serão feitas
res_trap_npc_neuron<-results(ddsk, contrast=c("group","trap_Neuron","trap_NPC"),  filterFun=ihw)
res_rna_npc_neuron<-results(ddsk, contrast=c("group","rna_Neuron","rna_NPC"),  filterFun=ihw)
res_trap_npc_rna_npc<-results(ddsk, contrast=c("group","trap_NPC","rna_NPC"),  filterFun=ihw)
res_trap_neuron_rna_neuron<-results(ddsk, contrast=c("group","trap_Neuron","rna_Neuron"),  filterFun=ihw)

design(ddsk) <- ~ type + condition + type:condition
ddsk <- DESeq(ddsk, test = "LRT", reduced = ~type + condition)
resk <- results(ddsk,  filterFun=ihw)

res_neuron <- results(ddsk, name="typetrap.conditionNPC", test="Wald")

print(resultsNames(ddsk))

#Adiciona anotação para os genes
split_gene_id <- function(word){
    strsplit(word, ".", fixed = TRUE)[[1]][1]
    #unlist(strsplit(my.string, ".", fixed = TRUE))[1]
}

gene_id = apply(as.matrix(rownames(resk)), 1, split_gene_id)

rownames(resk) <- gene_id

resk$ensembl_id <- rownames(resk)
#martk <- biomaRt::useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
#genes.tablek <- biomaRt::getBM(attributes = c("ensembl_gene_id", "external_gene_name", "description", "gene_biotype"), mart = martk)
#resk <- merge(x = as.matrix(resk), y = genes.tablek, by.x = "ensembl_id", by.y = "ensembl_gene_id", all.x = T, all.y = F )
counts_normalizedk<-counts(ddsk, normalized=TRUE)
counts_rawk<-counts(ddsk)
Tabelao<-cbind(res_trap_npc_neuron,res_rna_npc_neuron,res_trap_npc_rna_npc,res_trap_neuron_rna_neuron,resk,res_neuron,counts_rawk,counts_normalizedk)

write.table(as.data.frame(Tabelao),file=snakemake@output[[1]], sep = "\t")

R ggplot2 DESeq2 kallisto RColorBrewer PCAtools pheatmap tximport IHW From line 2 of scripts/kallisto_deseq2.R

wrapper:
    "v0.75.0/bio/sra-tools/fasterq-dump"

SnakeMake From line 35 of workflow/Snakefile

shell:
    "trim_galore {params.trim} --cores {threads} --output_dir {output} {input} 2> {log}"

SnakeMake Trim_Galore From line 51 of workflow/Snakefile

shell:
    """
    set +e
    FQ1=`echo {input} | awk '{{for(X=1;X<=NF;X++){{OUT=OUT $X"/*_1_val_1.fq.gz "}}}}END{{print OUT}}'`
    FQ2=`echo {input} | awk '{{for(X=1;X<=NF;X++){{OUT=OUT $X"/*_2_val_2.fq.gz "}}}}END{{print OUT}}'`
    echo $FQ1
    echo $FQ2
    cat $FQ1 > {output.fq1} 2> {log}
    cat $FQ2 > {output.fq2} 2>> {log}
    """

SnakeMake From line 66 of workflow/Snakefile

shell:
    """
    out=`echo {output.unaligned_fq1} | sed 's/.1.fq.gz/.%.fq.gz/'`
    bowtie2 {params.bowtie2} -p {threads} --un-conc-gz $out -1 {input.fq1} -2 {input.fq2} -S {output.aligned} 2> {log}
    """

SnakeMake Bowtie 2 From line 94 of workflow/Snakefile

shell:
    "kallisto quant -t {threads} {params.extra} " 
    "-g {params.gtf} "
    "-i {params.index} -o {output} {input} > {log} 2>&1"

SnakeMake Quant kallisto From line 115 of workflow/Snakefile

script:
    "./scripts/kallisto_deseq2.R"

SnakeMake DESeq2 From line 130 of workflow/Snakefile

__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

outdir = os.path.dirname(snakemake.output[0])
if outdir:
    outdir = "--outdir {}".format(outdir)

extra = snakemake.params.get("extra", "")

with tempfile.TemporaryDirectory() as tmp:
    shell(
        "fasterq-dump --temp {tmp} --threads {snakemake.threads} "
        "{extra} {outdir} {snakemake.wildcards.accession} {log}"
    )