Repository to process ITS amplicon sequencing data and produce figures for Schneider et al (2021)

public public 1yr ago 0 bookmarks

This repository contains code to analyse the ITS amplicon sequencing data with DADA2 for further comparisons with RNA-seq data from the same samples. It uses data collected for Haas et al (2018), but using dada2 and Swarm clustering instead of OTU clustering.

Repository contents

The repository consists of two units that are run separately, and an additional folder with R scripts for analysis and plotting of the results:

1 demultiplex_wf -> download and demultiplex raw data

The folder contains instructions on how to run the workflow as a docker container. It will download the raw data from the ENA and demultiplex the sequences into files per sample, as well as concatenate technical replicates.

2 workflow -> ITS amplicon sequencing data preprocessing workflow

The folder contains a snakemake workflow that will reproduce the preprocessing of the demultiplexed ITS amplicon sequencing data as used in the study.

How to use it

The workflow is run through snakemake, from the root folder of the repository (where this readme sits). To continue with the demultiplexed data, we need to move it from the demultiplex_wf subfolder.

mkdir $(pwd)/data
mv $(pwd)/demultiplex_wf/data/ $(pwd)/

Next we will create a conda environment ("its_wf") needed to execute the snakemake workflow, and then activate it:

conda env create -n its_wf -f $(pwd)/environment.yml
conda activate its_wf

Once the conda environment has been created successfully, we can execute the workflow with the following command:

snakemake -s $(pwd)/workflow/Snakefile -pr -j 4 --use-conda

This will output the final count matrix and other results (such as sequences for every Swarm OTU and taxonomic assignments) into the results/ folder.

analysis -> R scripts to analyse ITS and RNA data

The folder contains scripts to reproduce the figures in the publication.

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
require(dada2, quietly=TRUE)

#Parameters
if (is.null(snakemake@params$maxN)) {snakemake@params$maxN <- 0}
if (is.null(snakemake@params$truncQ)) {snakemake@params$truncQ <- 2}
if (is.null(snakemake@params$truncLen)) {snakemake@params$truncLen <- 0}
if (is.null(snakemake@params$maxEE_R1)) {snakemake@params$maxEE_R1 <- Inf}
if (is.null(snakemake@params$maxEE_R2)) {snakemake@params$maxEE_R2 <- Inf}
if (is.null(snakemake@params$rm_phix)) {snakemake@params$rm_phix <- TRUE}
if (is.null(snakemake@params$minLen)) {snakemake@params$minLen <- 0}

maxEE <- c(snakemake@params$maxEE_R1, snakemake@params$maxEE_R2)

# Performance
threads <- snakemake@threads

out <- filterAndTrim(fwd = snakemake@input$R1, filt = snakemake@output$R1,
              rev = snakemake@input$R2, filt.rev = snakemake@output$R2,
              maxN = snakemake@params$maxN, truncQ = snakemake@params$truncQ,
              maxEE = maxEE, rm.phix = snakemake@params$rm_phix,
              minLen = snakemake@params$minLen, multithread = threads,
              verbose=TRUE)

saveRDS(out, snakemake@output$rds)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
args = commandArgs(trailingOnly=TRUE)

itsx_in <- snakemake@input$its_in
itsx_out <- snakemake@input$its_out
sq.nc <- snakemake@input$seqtab
out <- snakemake@output$seqtab
out_mock <- snakemake@output$mock
threads <- snakemake@threads
mock <- snakemake@params$mock

mock_samples <- strsplit(mock, ",")[[1]]

library(dada2); packageVersion("dada2")
library(dplyr)
library(Biostrings)

dna <- readDNAStringSet(itsx_in)
dna_itsx <- as.character(readDNAStringSet(itsx_out))
seqtab.nc <- readRDS(sq.nc)
colnames(seqtab.nc) <- names(dna)
seqtab.nc <- seqtab.nc[,colnames(seqtab.nc)%in%names(dna_itsx)]

colnames(seqtab.nc) <- dna_itsx
seqtab.nc <- t(seqtab.nc)

#Remove sequences below 50bp
if (any(nchar(getSequences(t(seqtab.nc)))<50)) {
  dna_itsx <- dna_itsx[!(nchar(dna_itsx)<50)]
  seqtab.nc <- seqtab.nc[nchar(rownames(seqtab.nc))>50,]
}


#At this point we can summarise all identical sequences 
seqtab.nc2 <- cbind.data.frame(sequence=rownames(seqtab.nc), seqtab.nc)
seqtab.nc3 <- group_by(seqtab.nc2, sequence) %>% 
  summarise_each(funs(sum))

if (length(intersect(colnames(seqtab.nc3), mock_samples)) > 0) {
  mock_cols <- append(c("sequence"), intersect(colnames(seqtab.nc3), mock_samples))
  mock <- seqtab.nc3 %>% select(all_of(mock_cols))
  seqtab.nc4 <- seqtab.nc3 %>% select(!any_of(mock_cols))
  saveRDS(mock, out_mock)
} else {
  seqtab.nc4 <- seqtab.nc3[,colnames(seqtab.nc3)!="sequence"]
}

rownames(seqtab.nc4) <- seqtab.nc3$sequence
seqtab.nc4 <- data.matrix(t(seqtab.nc4))

saveRDS(seqtab.nc4, out)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from argparse import ArgumentParser
from Bio.Seq import reverse_complement


def revcomp(seq):
    return reverse_complement(seq)


def main(args):
    r = revcomp(args.seq)
    print(r, end="")


if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("seq", type=str,
                        help="Input fasta string")
    args = parser.parse_args()
    main(args)
34
35
36
37
38
39
40
shell:
    """
    fastq-dump {params.spots} --split-3 --gzip -O {params.data_dir} \
        {params.acc} > {log} 2>&1
    mv {params.data_dir}/{params.acc}_1.fastq.gz {output.R1}
    mv {params.data_dir}/{params.acc}_2.fastq.gz {output.R2}
    """
60
61
run:
    symlink(input[0], output[0])
81
82
script:
    "scripts/filter.R"
 96
 97
 98
 99
100
shell:
    """
    python workflow/scripts/revcomp.py {params.fwd} > {output.fwd_rc}
    python workflow/scripts/revcomp.py {params.rev} > {output.rev_rc}
    """
126
127
128
129
130
131
132
133
134
shell:
    """
    A=$(cat {input.fwd_rc})
    a=$(cat {input.rev_rc})

    cutadapt -g {params.FWD} -a $a -G {params.REV} -A $A -j {threads} \
     -n {params.n} -o {output.R1} -p {output.R2} --minimum-length {params.min_len} \
     {input.R1} {input.R2} > {log} 2>&1
    """
161
162
script:
    "scripts/filter.R"
203
204
205
206
207
shell:
    """
    Rscript --vanilla workflow/scripts/runDada2.R {params.fw_dir} {params.rv_dir} \
        {params.out_dir} {threads} > {log} 2>&1
    """
229
230
231
232
233
shell:
     """
     ITSx -i {input} -o {params.prefix}/itsx --cpu {threads} --multi_thread T \
        --preserve T --partial 50 --minlen 50 > {log} 2>&1
     """
256
257
script:
    "scripts/ProcessITSx.R"
276
277
278
279
280
shell:
    """
    Rscript --vanilla workflow/scripts/MergeLibraries.R {params.output_dir} \
        {params.dirnames} > {log} 2>&1 
    """
298
299
300
301
302
shell:
    """
    swarm -t {threads} -d 3 -z --output-file {output.txt} \
        --seeds {output.seeds} {input} > {log} 2>&1
    """
321
322
323
324
325
shell:
    """
    Rscript --vanilla workflow/scripts/processSwarm.R {params.swarm_res} \
        {params.swarm_in} {params.swarm_res} > {log} 2>&1
    """
335
336
337
338
339
340
341
342
shell:
    """
    curl -o {params.dir}/unite.zip -L {params.unite_url} > {log} 2>&1
    unzip -d {params.dir} {params.dir}/unite.zip > {log} 2>&1
    cat {params.dir}/sh*.fasta > {params.dir}/unite
    rm -rf {params.dir}/developer {params.dir}/unite.zip {params.dir}/*.fasta
    mv {params.dir}/unite {params.dir}/unite.fasta
    """
357
358
359
360
361
shell:
    """
    Rscript --vanilla workflow/scripts/rundada2TAX.R {input[0]} {input[1]} \
        {output[0]} {threads} > {log} 2>&1
    """
ShowHide 11 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/andnischneider/its_workflow
Name: its_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...