RNA-seq Pipeline for the PI3K Project

public public 1yr ago 0 bookmarks

This is a RNA-seq pipeline structured by snakemake template. It is for the PI3K project stand on the standard RNA-seq pipeline. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

  • chaodi (@dic)

Usage

Running on respublica by: snakemake --use-conda -c "qsub -l h_vmem={params.mem} -l mem_free={params.mem} -pe smp {threads} -V -cwd -e qsub/{params.jobName}.e -o qsub/{params.jobName}.o" -j -p

Code Snippets

26
27
script:
    "../scripts/count-matrix.py"
45
46
script:
    "../scripts/deseq2_allsample_plot.R"
64
65
script:
    "../scripts/deseq2_diffexp.R"
15
16
17
18
19
20
shell:
    "factor=`cat {input.rpm_factors} |  grep {wildcards.sample} | cut -f2` && "
    "genomeCoverageBed -split -bg -ibam {input.bam} -scale $factor > bw_rpm/{wildcards.sample}.bg && "
    "bedtools sort -i bw_rpm/{wildcards.sample}.bg > bw_rpm/{wildcards.sample}.sort.bg && "
    "bedGraphToBigWig bw_rpm/{wildcards.sample}.sort.bg STAR_index/chrNameLength.txt {output} && "
    "rm bw_rpm/{wildcards.sample}.bg bw_rpm/{wildcards.sample}.sort.bg"
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
shell:
    '''
    rm -f {output};
    for i in {input}; do
        sampleName="$(basename $i .Log.final.out)";
        cat $i | grep 'Number of input reads' | awk '{{print $6}}' > foo1;
        cat $i | grep 'Uniquely mapped reads number' | awk '{{print $6}}' > foo2;
        cat $i | grep 'Number of reads mapped to multiple loci' | awk '{{print $9}}' > foo3;
        cat $i | grep "Uniquely mapped reads %" | awk '{{print $6}}' > foo4;
        cat $i | grep "% of reads mapped to multiple loci"|awk '{{print $9}}' > foo5;
        paste foo1 foo2 foo3 foo4 foo5 | awk '{{print "'$sampleName'\t"$1"\t"$2+$3"\t"$4+$5}}' >> {output.mappedReads}
    done
    cat {output.mappedReads} | awk '{{print $1"\t"1000000/$2}}' > {output.rpmFactor}
    sed -i '1isample\ttotal_reads\tmapped_reads\t%mapped' {output.mappedReads};
    rm -f foo*
    '''
16
17
18
19
20
21
22
shell:
    "STAR --runThreadN {threads} --genomeDir {params.genome_dir} "
    "--outFileNamePrefix STAR_align/{wildcards.sample}. --outSAMtype BAM SortedByCoordinate "
    "--outSAMmapqUnique 255 --outFilterMultimapNmax 1 "
    "--quantMode GeneCounts --sjdbGTFfile {params.annotation} "
    "--readFilesIn {input} --outSJfilterReads Unique --readFilesCommand gunzip -c && "
    "mv STAR_align/{wildcards.sample}.Aligned.sortedByCoord.out.bam STAR_align/{wildcards.sample}.bam"
33
34
shell:
    "samtools index {input} {output}"
16
17
18
19
shell:
    "trim_galore --cores {threads} --gzip --fastqc --paired -o {params.outdir} {input} &> {log} && "
    "mv trimmed_fq/{wildcards.sample}_1_val_1.fq.gz {output.fq1} &>> {log} && "
    "mv trimmed_fq/{wildcards.sample}_2_val_2.fq.gz {output.fq2} &>> {log}"  
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd

def get_count_column(strand):
    if strand == "no":
        return 1 #non stranded protocol
    elif strand == "yes":
        return 2 #3rd column in STAR output {sample}.ReadsPerGene.out.tab
    elif strand == "reverse":
        return 3 #4th column, usually for Illumina truseq
    else:
        raise ValueError("'strand' column should be empty or has the value of 'none', 'yes' or 'reverse'!!!")

counts = [pd.read_table(count_file, index_col=0, usecols=[0, get_count_column(strand)], 
          header=None, skiprows=4) 
          for count_file, strand in zip(snakemake.input, snakemake.params.strand_list)]

for df, sample in zip(counts, snakemake.params.sample_list):
    df.columns = [sample]

matrix = pd.concat(counts, axis=1)
matrix.index.name = "gene"
# collapse technical replicates
# matrix = matrix.groupby(matrix.columns, axis=1).sum()
matrix.to_csv(snakemake.output[0], sep="\t")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
library(DESeq2)
library(pheatmap)
library(genefilter)
library(ggplot2)
library(RColorBrewer)

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

# colData and countData must have the same sample order, but this is ensured
# by the way we create the count matrix
cts <- read.table(snakemake@input[["count_table"]], header=TRUE, row.names="gene", check.names=FALSE)
coldata <- read.table(snakemake@params[["sample_table"]], header=TRUE, row.names="sample", check.names=FALSE)

dds <- DESeqDataSetFromMatrix(countData=cts, colData=coldata, design = ~ cell_type + condition)
dds$condition <- relevel(dds$condition, "Pre") # use "Pre" as the reference
# Using a grouping variable as contrast 
dds$group <- factor(paste0(dds$cell_type, dds$condition))
design(dds) <- ~ group

# remove uninformative columns
dds <- dds[rowSums(counts(dds)) > 1, ]
# normalization and pre-processing
dds <- DESeq(dds, parallel=parallel)

# raw count normalization
norm_counts <- counts(dds, normalized=TRUE) 
# count transformation, log2 scale, either rlog or vst
vsd <- vst(dds, blind=FALSE)

## visualizations ##
# pca plot
pdf(snakemake@output[["pca_plot"]])
pcaData <- plotPCA(vsd, intgroup=c("condition", "cell_type"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=cell_type)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
dev.off()

# heatmap of the count matrix
#select <- order(rowMeans(norm_counts), decreasing=TRUE)[1:30] # most highly expressed
select <- head(order(-rowVars(assay(vsd))),35) # most variable genes
pdata <- assay(vsd)[select,]
df <- as.data.frame(colData(dds)[,c("condition","cell_type")])
rownames(df) <- rownames(colData(dds))
colnames(df) <- c("condition","cell_type")
pdf(snakemake@output[["heatmap_exp"]])
pheatmap(pdata, cluster_rows=TRUE, show_rownames=TRUE,
         cluster_cols=TRUE, annotation_col=df)
dev.off()

# heatmap of sample-sample distances
pdf(snakemake@output[["heatmap_dist"]])
sampleDists = dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$cell_type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)
dev.off()
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
library(DESeq2)
library(pheatmap)
library(genefilter)
library(ggplot2)
library(RColorBrewer)
library(fdrtool)
library(EnhancedVolcano)

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

parallel <- FALSE
if (snakemake@threads > 1) {
    library("BiocParallel")
    # setup parallelization
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

# colData and countData must have the same sample order, but this is ensured
# by the way we create the count matrix
cts <- read.table(snakemake@input[["count_table"]], header=TRUE, row.names="gene", check.names=FALSE)
coldata <- read.table(snakemake@params[["sample_table"]], header=TRUE, row.names="sample", check.names=FALSE)

dds <- DESeqDataSetFromMatrix(countData=cts, colData=coldata, design = ~ cell_type + condition)
dds$condition <- relevel(dds$condition, "Pre") # use "Pre" as the reference
# Using a grouping variable as contrast 
dds$group <- factor(paste0(dds$cell_type, dds$condition))
design(dds) <- ~ group

# remove uninformative columns
dds <- dds[rowSums(counts(dds)) > 1, ]
# normalization and pre-processing
dds <- DESeq(dds, parallel=parallel)

# raw count normalization
norm_counts <- counts(dds, normalized=TRUE) 
# count transformation, log2 scale, either rlog or vst
vsd <- vst(dds, blind=FALSE)

# get the current contrast/cell_type from snakemake output, e.g., "CD8Tnn_Post_vs_Pre"
output_file <- snakemake@output[["table"]]
comp = gsub(".diffexp.tsv", "", tail(unlist(strsplit(output_file, "/")),1))
cell = gsub("_Post_vs_Pre", "", comp)
# get Post_vs_Pre by groups(i.e., cell_types)
res <- results(dds, contrast = c("group", paste0(cell,"Post"), paste0(cell,"Pre")), parallel = parallel)

# use fdrtool to correct the overestimated p-value,
# https://www.huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html
res <- res[!is.na(res$pvalue),]
res <- res[!is.na(res$padj),]
res <- res[,-which(names(res)=="padj")]
FDR.res <- fdrtool(res$stat, statistic="normal", plot=F)
res[,"padj"]  <- p.adjust(FDR.res$pval, method = "BH")
message(comp, paste0(" : # Up = ", length(res[which(res$padj<=0.1 & res$log2FoldChange>0),]$padj)),
        paste0(" # Down = ", length(res[which(res$padj<=0.1 & res$log2FoldChange<0),]$padj)))

# shrink fold changes for lowly expressed genes
res <- lfcShrink(dds, contrast = c("group", paste0(cell,"Post"), paste0(cell,"Pre")), res=res, type="ashr")
# extract the current cell_type samples
df_vsd = as.data.frame(assay(vsd))
df_vsd_cell = df_vsd[,grep(paste0(cell,'$'),colnames(df_vsd))]
# merge with normalized count data and output the table
resdata <- merge(df_vsd_cell, as.data.frame(res), by="row.names",sort=FALSE)
names(resdata)[1] <- "Gene"
#print(head(resdata))
write.table(resdata, file=snakemake@output[["table"]], sep="\t", quote=FALSE, row.names=FALSE)

## basic plots for Data quality assessment
# M-A plot, points are red with padj < 0.1, points fall out of the window are open triangles 
pdf(snakemake@output[["ma_plot"]])
plotMA(res, main=comp, colLine="red")
dev.off()
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/chaodi51/20190815_PIK3CD-140106977_RNA-seq
Name: 20190815_pik3cd-140106977_rna-seq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...