Metatranscriptomics - snakemake pipeline as part of the Bpexa HS Leiden bioinformatics project

public public 1yr ago 0 bookmarks

DiFlex is a de novo meta-transcriptome assembly and differential gene expression pipeline. For the input the pipeline takes raw Illumina paired-end RNAseq libraries. The output contains the transcriptome assembly and the differential gene expression between treatments. The main tools that are used by the pipeline are FastQC, Trimmomatic and Trinity. DiFlex is a flexible, open source, user-friendly pipeline.

Repo content

  • config.yaml - Extention: the configuration file for parameteter specification of snakemake workflow.

  • db/ - Databases: adapters and sortmerna database.

  • envs/ - Conda environments in yaml format. These environments will be automatically installed when the pipeline is run.

  • report/ - Reports on the results and autogenerated snakemake report.

  • rules/ - Modular snakemake rules called by the snakefile.

  • sample-data/ - Tiny sample dataset with raw and trimmomatic filtered reads from two types of marine samples. Can be used for testing of workflow. Small enough to run on laptop.

  • samples.csv - Tabular with sample information and link to associated input files. To be used in snakefile to initiate pipeline.

  • schemas/ - Extention: Config file of config to restrict allowed parameter formats and to set default values.

  • Snakefile - Snakefile.

  • src/ - Source.

Getting started

Requirements

The pipeline has been created and tested on a Linux system.
The pipeline requires the file samples.csv. The first must contain the following columns: run, sample type, sample, forward and reverse. The columns forward and reverse consists of the pathnames of the samples. To run the pipeline, the required files need to be in the same path as given in the config.yaml. DiFlex makes use of the tool Snakemake.

Installation

DiFlex the Snakemake en Conda installation can be found in the following links.

You can either clone DiFlex, like so:

git clone https://github.com/ejongepier/metatrans-smk-hs.git 

or download and extract the zip archive DiFlex
DiFlex is immediately ready to use.

Usage

Quick start

The file samples.csv needs to be in the same path as given in the config.yaml. To run snakemake, you simple run the next command:

snakemake --use-conda --cores <ncores> 

Where --cores are the number of the CPU cores/jobs that can be run in parallel on your system.

Customizing

You can customize the pipeline through the config.yaml, for example, by changing the number of the threads per process. For details in how to customize the sample.csv see the help function in the pipeline. To invoke the help function, please use the following command line:

snakemake --use-conda --cores <ncores> help 

Report

The report will show a workflow of the pipeline. It will also contain the statistic and configuration. In the statistics, the runtime and the creating date is available. The config.yaml will be shown in the configuration.
After running the pipeline, a report can be formed with the next command line: snakemake --report reports/report.zip

Output

The results directory contains the output files per rule. The words between {} are different based on what the user specified in the samples.csv.

The rule FastQC creates a FastQC report in html format for each sample. FastQC is a quality control for high throughput sequence data. It provides a set of modular analysis. Through these analyses is it possible to quickly understand if there are any problems with the data. There are two functions of the FastQC. One for the raw data and one for the trimmed data. The input data of both rules contains the direction of the samples, and the output data is a directory consists of the results of the analyses of every sample.

The rule Trimmomatic performs adapter trimming and quality filtering to trimmed reads in FastQ format.

The rule Trinity assembly creates the trinity assemble directory. It will look for overlay between the sequences. The overlay determined the order in which the sequences originally will be. Due this the original sequence will be formed. The input is the trimmomatic trimmed libraries. The ouput will generate a directory for the two outputfiles: Trinity.fasta and Trinity_stats.txt. The Trinity.fasta is a FASTA file with assembled transcripts. The file Trinity._stats.txt contains the statics of the assembled transcripts.

Trinity DE performs the differential expression on the trimmed data. First, the assembly data is aligned using bowtie2 and an abundance estimation is done using RSEM. This output is used to make matrices of the isoform and gene data. Afterwards the DE analysis is made for both the isoform and gene data. Both use their respective counts matrix. Both analyses are performed using edgeR. A differential expression analysis is also made using the TMM matrices of both the isoform and gene data.

Overview outputfiles:

  • Quality control of raw input libraries generated with fastqc:

    • results/{run}/fastqc/raw/{sample}/
  • Adapter and quality trimmed read libraries generated with trimmomatic:

    • results/{run}/trimmomatic/{sample}.R1.paired.fastq.gz

    • results/{run}/trimmomatic/{sample}.R2.paired.fastq.gz

    • results/{run}/trimmomatic/{sample}.R1.unpaired.fastq.gz

    • results/{run}/trimmomatic/{sample}.R2.unpaired.fastq.gz

  • Quality control of trimmed input libraries generated with fastqc:

    • results/{run}/fastqc/trimmed/{sample}/
  • Overlaps of trimmed input libraries generated with trinity assembly:

    • results/{run}/trinity_output/trinity_assemble/Trinity.fasta

    • results/{run}/trinity_output/trinity_assemble/Trinity_stats.txt

  • Differential gene expression of the assembly data with trinity DE:

    • diffExpr.P0.01_C1.0.matrix.log2.centered.genes_vs_samples_heatmap.pdf

    • diffExpr.P0.01_C1.0.matrix.log2.centered.sample_cor_matrix.pdf

    • trinity-de.gene.counts.matrix.C_vs_MA.edgeR.DE_results.MA_n_Volcano.pdf

    • trinity-de.gene.TMM.EXPR.matrix.log2.prcomp.principal_components.pdf

    • diffExpr.P0.01_C1.0.matrix.log2.centered.genes_vs_samples_heatmap.pdf

    • diffExpr.P0.01_C1.0.matrix.log2.centered.sample_cor_matrix.pdf

    • trinity-de.isoform.counts.matrix.C_vs_MA.edgeR.DE_results.MA_n_Volcano.pdf

    • trinity-de.isoform.TMM.EXPR.matrix.log2.prcomp.principal_components.pdf

Authors

Code Snippets

15
16
17
18
19
shell:
	"""
	mkdir -p {output} 
	fastqc -o {output} {input} --threads {threads}
	"""
35
36
37
38
39
shell:
	"""
	mkdir -p {output}
	fastqc -o {output} {input} --threads {threads}
	"""
16
17
18
19
20
21
22
23
24
shell:
    """
    echo "{input.raw} {input.trim} {input.filt}" | tr " " "\n" > results/{wildcards.run}/reads_files.tmp
    seqkit stats \
        --infile-list results/{wildcards.run}/reads_files.tmp \
        --out-file {output} \
        -T > {log}
    rm -f results/{wildcards.run}/reads_files.tmp
    """
37
38
shell: 
    """samtools flagstats -@ {threads} -O tsv {params.bam} > {output}"""
46
47
48
49
50
51
52
53
54
55
56
run: 
    bam_list = []
    for bam in input:
        run = bam.split("/")[1]
        sample = bam.split("/")[4]
        with open(bam, "r") as f:
            res = f.readlines()[13].split("\t")[0]
            bam_list.append(f"{run}\t{sample}\ttrinity\t{res}")
    with open(output[0], "w") as outf:
        for x in bam_list:
            outf.write(f"{x}\n")
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
run: 
    read_list = []
    with open(input[0], "r") as f:
        for line in f:
            sline = line.split("\t")
            if "results/" in line:
                if "trimmomatic" in line:
                    read_list.append(f"{wildcards.run}\t{sline[0].split('/')[3]}\t{sline[0].split('/')[2]}\t{sline[3]}")
                else:
                    read_list.append(f"{wildcards.run}\t{sline[0].split('/')[3]}.{sline[0].split('/')[4]}\t{sline[0].split('/')[2]}\t{sline[3]}")
            else:
                read_list.append(f"{wildcards.run}\t{sline[0].split('/')[-1]}\traw\t{sline[3]}")
    with open(output[0], "w") as outf:
        for x in read_list[1:]:
            outf.write(f"{x}\n")
85
86
shell: 
    """cat {input.seqkit} {input.bam} > {output.read_info}"""
98
99
shell: 
    """Rscript scripts/plot_reads.r {input.reads_results} {output.plot} results/{wildcards.run}/plots > {log}"""
 7
 8
 9
10
11
12
13
14
shell: 
    """
    mkdir -p 'db/sortmerna/download'
    wget -nc -nv -P 'db/sortmerna/download/' 'https://github.com/biocore/sortmerna/releases/download/v4.3.3/database.tar.gz'
    tar -xzvf db/sortmerna/download/database.tar.gz -C db/sortmerna/download
    mv 'db/sortmerna/download/{params.db}' '{input}' 
    rm -r 'db/sortmerna/download'
    """
37
38
39
40
41
42
43
44
45
46
47
48
shell: 
    """
    sortmerna \
    --reads {input.fow} \
    --reads {input.rev} \
    --ref {input.ref} \
    --workdir {output.work} \
    --num_alignments {params.num_alignments} \
    --threads {threads} \
    -m {resources.mem_mb} \
    -{params.paired_optie} -v -fastx -other > {log}
    """
64
65
66
67
shell:
    """
    reformat.sh -Xmx{resources.mem_mb}m in={input.other} int=t zl=4 out1={output.paired_left} out2={output.paired_right} verifypairing > {log}
    """
19
20
21
22
23
24
25
26
27
28
29
30
shell:
    """
    trimmomatic PE \
      -threads {threads} \
      -phred33 \
      {input} \
      {output.fwd_p} {output.fwd_u} \
      {output.rev_p} {output.rev_u} \
      ILLUMINACLIP:{params.adapters}:2:30:10 \
      LEADING:3 TRAILING:3 \
      SLIDINGWINDOW:4:28 MINLEN:36 2> {log}
    """
10
11
12
13
14
15
16
17
18
19
20
21
run: 
	samples_list = []
	i = 0
	with open(input.samples, "r") as in_sample:
		in_sample.readline()
		samples_list = in_sample.readlines()

	with open(output.assemble_samples, "w") as out:
		for line_str in samples_list:
			line = line_str.split(",")
			out.write(f"{line[1]}\t{line[2]}\t{input.left[i]}\t{input.right[i]}\n")
			i += 1
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
	shell:
		"""
		Trinity \
  		  --normalize_reads \
		  --seqType fq \
		  --max_memory {params.max_memory} \
		  --CPU {threads} \
		  --samples_file results/{wildcards.run}/trinity_output/assemble_samples.txt \
		  --output {output.out_dir} \
		  --min_kmer_cov {params.min_kmer_cov} > {log}

		TrinityStats.pl \
		  {output.out_fasta} \
		  > {output.out_stats}
		"""
 6
 7
 8
 9
10
11
12
13
run:
	samples_list = []
	with open(input.assem_samples, "r") as in_sample:
		samples_list = in_sample.readlines()
	with open(output.align_samples, "w") as out_sample:
		for line_str in samples_list:
			line = line_str.split("\t")
			out_sample.write(f"{line[0]}\tresults/{wildcards.run}/trinity_output/trinity_de/{line[1]}\t{line[2]}\t{line[3]}")
33
34
35
36
37
38
39
40
41
42
43
44
	shell:
		"""
		align_and_estimate_abundance.pl \
    	  --transcripts {input.assembly} \
    	  --seqType fq \
    	  --samples_file {input.align_samples} \
		  --thread_count {threads} \
    	  --prep_reference \
    	  --est_method RSEM \
    	  --aln_method bowtie2 \
    	  --gene_trans_map {params.gene_trans_map} > {log}
		"""
52
53
shell: 
	"""ls results/{wildcards.run}/trinity_output/trinity_de/*/RSEM.isoforms.results > {output.isoform_path}"""
70
71
72
73
74
75
76
77
78
	shell: 
		"""
		abundance_estimates_to_matrix.pl \
  			--est_method RSEM \
  			--out_prefix {params.edgeR_dir}/trinity-de \
  			--gene_trans_map none \
  			--name_sample_by_basedir \
  			--quant_files {input.isoform_path} > {log}
		"""
 96
 97
 98
 99
100
101
102
103
104
	shell: 
		"""
		abundance_estimates_to_matrix.pl \
  		  --est_method RSEM \
  		  --out_prefix {params.edgeR_dir}/trinity-de \
  		  --gene_trans_map {input.gene_trans_map} \
  		  --name_sample_by_basedir \
  		  --quant_files {input.isoform_path} > {log}
		"""
113
114
115
116
shell: 
	"""
	cat {input.samples} | awk -F "," '{{print $2"\t"$3}}' | awk 'NR!=1 {{print}}' | sort | uniq > {output.sample_file}
	"""
133
134
135
136
137
138
139
140
	shell: 
		"""
		run_DE_analysis.pl \
    		--matrix {input.matrix} \
    		--method edgeR \
    		--samples_file {input.sample_file} \
    		--output {params.isoform_dir} > {log}
		"""
156
157
158
159
160
161
162
163
	shell: 
		"""
		run_DE_analysis.pl \
    		--matrix {input.matrix} \
    		--method edgeR \
    		--samples_file {input.sample_file} \
    		--output {params.gene_dir} > {log}
		"""
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
shell:
	"""
	cd {params.isoform_dir}

	analyze_diff_expr.pl \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		-P {params.P} \
		-C {params.C} > {params.prefix}/{log.diff_expr}

	PtR \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		--log2 \
		--prin_comp {params.PCA_components} > {params.prefix}/{log.PtR}
	"""
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
shell:
	"""
	cd {params.gene_dir}

	analyze_diff_expr.pl \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		-P {params.P} \
		-C {params.C} > {params.prefix}/{log.diff_expr}

	PtR \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		--log2 \
		--prin_comp {params.PCA_components} > {params.prefix}/{log.PtR}
	 """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)

inreads <- read.delim(args[1], header = F, sep = "\t")
colnames(inreads) <- c("run", "sample", "method", "nreads")

inreads$sample <- gsub(".fq.gz", "", inreads$sample)
inreads$sample <- gsub(".fastq.gz", "", inreads$sample)

res <- aggregate(inreads$nreads, by=list(inreads$run,inreads$method), FUN=sum)
colnames(res) <- c("run", "method", "nreads")

complot <- ggplot(res, aes(x=method, y=nreads)) +
  geom_bar(stat="identity", color = "black", size = 0.1) + 
  labs(x = "Method", y = "Read count") +
  theme_gray()

ggsave(args[2], plot = complot, device = "pdf")

gen_met_plot <- function(pldata, plmethod) {
  p = ggplot(pldata, aes(x=method, y=nreads))
  p = p + scale_fill_manual(name = paste("Read count per sample from ", plmethod), labels = unique(pldata$sample), values = pldata$nreads)
  p = p + facet_wrap(pldata$method ~ pldata$sample)
  p = p + geom_bar(stat = "identity", position = "stack", color = "black", size = 0.1)
  p = p + labs(x = "", y = "Read count")

  ggsave(paste0(args[3], "/", plmethod, "_plot.pdf"), plot = p, device = "pdf", scale = 1.5)
}

gen_met_plot(inreads[inreads$method == "raw", ], "raw")
gen_met_plot(inreads[inreads$method == "trimmomatic", ], "trimmomatic")
gen_met_plot(inreads[inreads$method == "sortmerna", ], "sortmerna")
gen_met_plot(inreads[inreads$method == "trinity", ], "trinity")
68
69
shell:
	"cat {input}"
SnakeMake From line 68 of main/Snakefile
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ejongepier/metatrans-smk-hs
Name: metatrans-smk-hs
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...