Metatranscriptomics - snakemake pipeline as part of the Bpexa HS Leiden bioinformatics project

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

DiFlex is a de novo meta-transcriptome assembly and differential gene expression pipeline. For the input the pipeline takes raw Illumina paired-end RNAseq libraries. The output contains the transcriptome assembly and the differential gene expression between treatments. The main tools that are used by the pipeline are FastQC, Trimmomatic and Trinity. DiFlex is a flexible, open source, user-friendly pipeline.

Repo content

config.yaml - Extention: the configuration file for parameteter specification of snakemake workflow.
db/ - Databases: adapters and sortmerna database.
envs/ - Conda environments in yaml format. These environments will be automatically installed when the pipeline is run.
report/ - Reports on the results and autogenerated snakemake report.
rules/ - Modular snakemake rules called by the snakefile.
sample-data/ - Tiny sample dataset with raw and trimmomatic filtered reads from two types of marine samples. Can be used for testing of workflow. Small enough to run on laptop.
samples.csv - Tabular with sample information and link to associated input files. To be used in snakefile to initiate pipeline.
schemas/ - Extention: Config file of config to restrict allowed parameter formats and to set default values.
Snakefile - Snakefile.
src/ - Source.

Getting started

Requirements

The pipeline has been created and tested on a Linux system.
The pipeline requires the file samples.csv. The first must contain the following columns: run, sample type, sample, forward and reverse. The columns forward and reverse consists of the pathnames of the samples. To run the pipeline, the required files need to be in the same path as given in the config.yaml. DiFlex makes use of the tool Snakemake.

Installation

DiFlex the Snakemake en Conda installation can be found in the following links.

You can either clone DiFlex, like so:

git clone https://github.com/ejongepier/metatrans-smk-hs.git

or download and extract the zip archive DiFlex
DiFlex is immediately ready to use.

Usage

Quick start

The file samples.csv needs to be in the same path as given in the config.yaml. To run snakemake, you simple run the next command:

snakemake --use-conda --cores <ncores>

Where --cores are the number of the CPU cores/jobs that can be run in parallel on your system.

Customizing

You can customize the pipeline through the config.yaml, for example, by changing the number of the threads per process. For details in how to customize the sample.csv see the help function in the pipeline. To invoke the help function, please use the following command line:

snakemake --use-conda --cores <ncores> help

Report

The report will show a workflow of the pipeline. It will also contain the statistic and configuration. In the statistics, the runtime and the creating date is available. The config.yaml will be shown in the configuration.
After running the pipeline, a report can be formed with the next command line: snakemake --report reports/report.zip

Output

The results directory contains the output files per rule. The words between {} are different based on what the user specified in the samples.csv.

The rule FastQC creates a FastQC report in html format for each sample. FastQC is a quality control for high throughput sequence data. It provides a set of modular analysis. Through these analyses is it possible to quickly understand if there are any problems with the data. There are two functions of the FastQC. One for the raw data and one for the trimmed data. The input data of both rules contains the direction of the samples, and the output data is a directory consists of the results of the analyses of every sample.

The rule Trimmomatic performs adapter trimming and quality filtering to trimmed reads in FastQ format.

The rule Trinity assembly creates the trinity assemble directory. It will look for overlay between the sequences. The overlay determined the order in which the sequences originally will be. Due this the original sequence will be formed. The input is the trimmomatic trimmed libraries. The ouput will generate a directory for the two outputfiles: Trinity.fasta and Trinity_stats.txt. The Trinity.fasta is a FASTA file with assembled transcripts. The file Trinity._stats.txt contains the statics of the assembled transcripts.

Trinity DE performs the differential expression on the trimmed data. First, the assembly data is aligned using bowtie2 and an abundance estimation is done using RSEM. This output is used to make matrices of the isoform and gene data. Afterwards the DE analysis is made for both the isoform and gene data. Both use their respective counts matrix. Both analyses are performed using edgeR. A differential expression analysis is also made using the TMM matrices of both the isoform and gene data.

Overview outputfiles:

Quality control of raw input libraries generated with fastqc:
- results/{run}/fastqc/raw/{sample}/
Adapter and quality trimmed read libraries generated with trimmomatic:
- results/{run}/trimmomatic/{sample}.R1.paired.fastq.gz
- results/{run}/trimmomatic/{sample}.R2.paired.fastq.gz
- results/{run}/trimmomatic/{sample}.R1.unpaired.fastq.gz
- results/{run}/trimmomatic/{sample}.R2.unpaired.fastq.gz
Quality control of trimmed input libraries generated with fastqc:
- results/{run}/fastqc/trimmed/{sample}/
Overlaps of trimmed input libraries generated with trinity assembly:
- results/{run}/trinity_output/trinity_assemble/Trinity.fasta
- results/{run}/trinity_output/trinity_assemble/Trinity_stats.txt
Differential gene expression of the assembly data with trinity DE:
- diffExpr.P0.01_C1.0.matrix.log2.centered.genes_vs_samples_heatmap.pdf
- diffExpr.P0.01_C1.0.matrix.log2.centered.sample_cor_matrix.pdf
- trinity-de.gene.counts.matrix.C_vs_MA.edgeR.DE_results.MA_n_Volcano.pdf
- trinity-de.gene.TMM.EXPR.matrix.log2.prcomp.principal_components.pdf
- diffExpr.P0.01_C1.0.matrix.log2.centered.genes_vs_samples_heatmap.pdf
- diffExpr.P0.01_C1.0.matrix.log2.centered.sample_cor_matrix.pdf
- trinity-de.isoform.counts.matrix.C_vs_MA.edgeR.DE_results.MA_n_Volcano.pdf
- trinity-de.isoform.TMM.EXPR.matrix.log2.prcomp.principal_components.pdf

Authors

Evelien Jongepier ([email protected])
Chezley van Veen ([email protected])
Noa van Daatselaar ([email protected])
Thomas den Hamer ([email protected])
Selena Impink ([email protected])

Code Snippets

shell:
	"""
	mkdir -p {output} 
	fastqc -o {output} {input} --threads {threads}
	"""

SnakeMake FastQC From line 15 of rules/fastQC.smk

shell:
	"""
	mkdir -p {output}
	fastqc -o {output} {input} --threads {threads}
	"""

SnakeMake FastQC From line 35 of rules/fastQC.smk

shell:
    """
    echo "{input.raw} {input.trim} {input.filt}" | tr " " "\n" > results/{wildcards.run}/reads_files.tmp
    seqkit stats \
        --infile-list results/{wildcards.run}/reads_files.tmp \
        --out-file {output} \
        -T > {log}
    rm -f results/{wildcards.run}/reads_files.tmp
    """

SnakeMake seqkit From line 16 of rules/read_analyse.smk

shell: 
    """samtools flagstats -@ {threads} -O tsv {params.bam} > {output}"""

SnakeMake SAMtools From line 37 of rules/read_analyse.smk

run: 
    bam_list = []
    for bam in input:
        run = bam.split("/")[1]
        sample = bam.split("/")[4]
        with open(bam, "r") as f:
            res = f.readlines()[13].split("\t")[0]
            bam_list.append(f"{run}\t{sample}\ttrinity\t{res}")
    with open(output[0], "w") as outf:
        for x in bam_list:
            outf.write(f"{x}\n")

SnakeMake From line 46 of rules/read_analyse.smk

run: 
    read_list = []
    with open(input[0], "r") as f:
        for line in f:
            sline = line.split("\t")
            if "results/" in line:
                if "trimmomatic" in line:
                    read_list.append(f"{wildcards.run}\t{sline[0].split('/')[3]}\t{sline[0].split('/')[2]}\t{sline[3]}")
                else:
                    read_list.append(f"{wildcards.run}\t{sline[0].split('/')[3]}.{sline[0].split('/')[4]}\t{sline[0].split('/')[2]}\t{sline[3]}")
            else:
                read_list.append(f"{wildcards.run}\t{sline[0].split('/')[-1]}\traw\t{sline[3]}")
    with open(output[0], "w") as outf:
        for x in read_list[1:]:
            outf.write(f"{x}\n")

SnakeMake Trimmomatic From line 63 of rules/read_analyse.smk

shell: 
    """cat {input.seqkit} {input.bam} > {output.read_info}"""

SnakeMake From line 85 of rules/read_analyse.smk

shell: 
    """Rscript scripts/plot_reads.r {input.reads_results} {output.plot} results/{wildcards.run}/plots > {log}"""

SnakeMake From line 98 of rules/read_analyse.smk

shell: 
    """
    mkdir -p 'db/sortmerna/download'
    wget -nc -nv -P 'db/sortmerna/download/' 'https://github.com/biocore/sortmerna/releases/download/v4.3.3/database.tar.gz'
    tar -xzvf db/sortmerna/download/database.tar.gz -C db/sortmerna/download
    mv 'db/sortmerna/download/{params.db}' '{input}' 
    rm -r 'db/sortmerna/download'
    """

SnakeMake From line 7 of rules/sortmerna.smk

shell: 
    """
    sortmerna \
    --reads {input.fow} \
    --reads {input.rev} \
    --ref {input.ref} \
    --workdir {output.work} \
    --num_alignments {params.num_alignments} \
    --threads {threads} \
    -m {resources.mem_mb} \
    -{params.paired_optie} -v -fastx -other > {log}
    """

SnakeMake From line 37 of rules/sortmerna.smk

shell:
    """
    reformat.sh -Xmx{resources.mem_mb}m in={input.other} int=t zl=4 out1={output.paired_left} out2={output.paired_right} verifypairing > {log}
    """

SnakeMake From line 64 of rules/sortmerna.smk

shell:
    """
    trimmomatic PE \
      -threads {threads} \
      -phred33 \
      {input} \
      {output.fwd_p} {output.fwd_u} \
      {output.rev_p} {output.rev_u} \
      ILLUMINACLIP:{params.adapters}:2:30:10 \
      LEADING:3 TRAILING:3 \
      SLIDINGWINDOW:4:28 MINLEN:36 2> {log}
    """

SnakeMake Trimmomatic From line 19 of rules/trimmomatic.smk

run: 
	samples_list = []
	i = 0
	with open(input.samples, "r") as in_sample:
		in_sample.readline()
		samples_list = in_sample.readlines()

	with open(output.assemble_samples, "w") as out:
		for line_str in samples_list:
			line = line_str.split(",")
			out.write(f"{line[1]}\t{line[2]}\t{input.left[i]}\t{input.right[i]}\n")
			i += 1

SnakeMake From line 10 of rules/trinity_assemble.smk

	shell:
		"""
		Trinity \
  		  --normalize_reads \
		  --seqType fq \
		  --max_memory {params.max_memory} \
		  --CPU {threads} \
		  --samples_file results/{wildcards.run}/trinity_output/assemble_samples.txt \
		  --output {output.out_dir} \
		  --min_kmer_cov {params.min_kmer_cov} > {log}

		TrinityStats.pl \
		  {output.out_fasta} \
		  > {output.out_stats}
		"""

SnakeMake seqtk-trinity From line 46 of rules/trinity_assemble.smk

run:
	samples_list = []
	with open(input.assem_samples, "r") as in_sample:
		samples_list = in_sample.readlines()
	with open(output.align_samples, "w") as out_sample:
		for line_str in samples_list:
			line = line_str.split("\t")
			out_sample.write(f"{line[0]}\tresults/{wildcards.run}/trinity_output/trinity_de/{line[1]}\t{line[2]}\t{line[3]}")

SnakeMake From line 6 of rules/trinity_de.smk

	shell:
		"""
		align_and_estimate_abundance.pl \
    	  --transcripts {input.assembly} \
    	  --seqType fq \
    	  --samples_file {input.align_samples} \
		  --thread_count {threads} \
    	  --prep_reference \
    	  --est_method RSEM \
    	  --aln_method bowtie2 \
    	  --gene_trans_map {params.gene_trans_map} > {log}
		"""

SnakeMake Bowtie 2 RSEM From line 33 of rules/trinity_de.smk

shell: 
	"""ls results/{wildcards.run}/trinity_output/trinity_de/*/RSEM.isoforms.results > {output.isoform_path}"""

SnakeMake From line 52 of rules/trinity_de.smk

	shell: 
		"""
		abundance_estimates_to_matrix.pl \
  			--est_method RSEM \
  			--out_prefix {params.edgeR_dir}/trinity-de \
  			--gene_trans_map none \
  			--name_sample_by_basedir \
  			--quant_files {input.isoform_path} > {log}
		"""

SnakeMake RSEM From line 70 of rules/trinity_de.smk

	shell: 
		"""
		abundance_estimates_to_matrix.pl \
  		  --est_method RSEM \
  		  --out_prefix {params.edgeR_dir}/trinity-de \
  		  --gene_trans_map {input.gene_trans_map} \
  		  --name_sample_by_basedir \
  		  --quant_files {input.isoform_path} > {log}
		"""

SnakeMake RSEM From line 96 of rules/trinity_de.smk

shell: 
	"""
	cat {input.samples} | awk -F "," '{{print $2"\t"$3}}' | awk 'NR!=1 {{print}}' | sort | uniq > {output.sample_file}
	"""

SnakeMake From line 113 of rules/trinity_de.smk

	shell: 
		"""
		run_DE_analysis.pl \
    		--matrix {input.matrix} \
    		--method edgeR \
    		--samples_file {input.sample_file} \
    		--output {params.isoform_dir} > {log}
		"""

SnakeMake From line 133 of rules/trinity_de.smk

	shell: 
		"""
		run_DE_analysis.pl \
    		--matrix {input.matrix} \
    		--method edgeR \
    		--samples_file {input.sample_file} \
    		--output {params.gene_dir} > {log}
		"""

SnakeMake From line 156 of rules/trinity_de.smk

shell:
	"""
	cd {params.isoform_dir}

	analyze_diff_expr.pl \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		-P {params.P} \
		-C {params.C} > {params.prefix}/{log.diff_expr}

	PtR \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		--log2 \
		--prin_comp {params.PCA_components} > {params.prefix}/{log.PtR}
	"""

SnakeMake From line 189 of rules/trinity_de.smk

shell:
	"""
	cd {params.gene_dir}

	analyze_diff_expr.pl \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		-P {params.P} \
		-C {params.C} > {params.prefix}/{log.diff_expr}

	PtR \
		--matrix {params.prefix}/{input.matrix} \
		--samples {params.prefix}/{input.sample_file} \
		--log2 \
		--prin_comp {params.PCA_components} > {params.prefix}/{log.PtR}
	 """

SnakeMake From line 230 of rules/trinity_de.smk

library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)

inreads <- read.delim(args[1], header = F, sep = "\t")
colnames(inreads) <- c("run", "sample", "method", "nreads")

inreads$sample <- gsub(".fq.gz", "", inreads$sample)
inreads$sample <- gsub(".fastq.gz", "", inreads$sample)

res <- aggregate(inreads$nreads, by=list(inreads$run,inreads$method), FUN=sum)
colnames(res) <- c("run", "method", "nreads")

complot <- ggplot(res, aes(x=method, y=nreads)) +
  geom_bar(stat="identity", color = "black", size = 0.1) + 
  labs(x = "Method", y = "Read count") +
  theme_gray()

ggsave(args[2], plot = complot, device = "pdf")

gen_met_plot <- function(pldata, plmethod) {
  p = ggplot(pldata, aes(x=method, y=nreads))
  p = p + scale_fill_manual(name = paste("Read count per sample from ", plmethod), labels = unique(pldata$sample), values = pldata$nreads)
  p = p + facet_wrap(pldata$method ~ pldata$sample)
  p = p + geom_bar(stat = "identity", position = "stack", color = "black", size = 0.1)
  p = p + labs(x = "", y = "Read count")

  ggsave(paste0(args[3], "/", plmethod, "_plot.pdf"), plot = p, device = "pdf", scale = 1.5)
}

gen_met_plot(inreads[inreads$method == "raw", ], "raw")
gen_met_plot(inreads[inreads$method == "trimmomatic", ], "trimmomatic")
gen_met_plot(inreads[inreads$method == "sortmerna", ], "sortmerna")
gen_met_plot(inreads[inreads$method == "trinity", ], "trinity")