Amplicon sequencing analysis workflow using DADA2 and QIIME2

public public 1yr ago Version: 2.6.1 0 bookmarks

Introduction

nfcore/ampliseq is a bioinformatics analysis pipeline used for amplicon sequencing, supporting denoising of any amplicon and, currently, taxonomic assignment of 16S, ITS, CO1 and 18S amplicons. Phylogenetic placement is also possible. Supported is paired-end Illumina or single-end Illumina, PacBio and IonTorrent data. Default is the analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina.

A video about relevance, usage and output of the pipeline (version 2.1.0; 26th Oct. 2021) can also be found in YouTube and billibilli , the slides are deposited at figshare .

nf-core/ampliseq workflow overview

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website .

Pipeline summary

By default, the pipeline currently performs the following:

  • Sequencing quality control ( FastQC )

  • Trimming of reads ( Cutadapt )

  • Infer Amplicon Sequence Variants (ASVs) ( DADA2 )

  • Predict whether ASVs are ribosomal RNA sequences ( Barrnap )

  • Phylogenetic placement ( EPA-NG )

  • Taxonomical classification using DADA2, SINTAX or QIIME2

  • Excludes unwanted taxa, produces absolute and relative feature/taxa count tables and plots, plots alpha rarefaction curves, computes alpha and beta diversity indices and plots thereof ( QIIME2 )

  • Calls differentially abundant taxa ( ANCOM )

  • Overall pipeline run summaries ( MultiQC )

Usage

Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, you need to know whether the sequencing files at hand are expected to contain primer sequences (usually yes) and if yes, what primer sequences. In the example below, the paired end sequencing data was produced with 515f (GTGYCAGCMGCCGCGGTAA) and 806r (GGACTACNVGGGTWTCTAAT) primers of the V4 region of the 16S rRNA gene. Please note, that those sequences should not contain any sequencing adapter sequences, only the sequence that matches the biological amplicon.

Next, the data needs to be organized in a folder, here data , or detailed in a samplesheet (see input documentation ).

Now, you can run the pipeline using:

nextflow run nf-core/ampliseq \
 -profile <docker/singularity/.../institute> \
 --input "data" \
 --FW_primer "GTGYCAGCMGCCGCGGTAA" \
 --RV_primer "GGACTACNVGGGTWTCTAAT" \
 --outdir <OUTDIR>

Note Adding metadata will considerably increase the output, see metadata documentation .

Warning: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters ; see docs .

For more details, please refer to the usage documentation and the parameter documentation .

Pipeline output

To see the the results of a test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation .

Credits

nf-core/ampliseq was originally written by Daniel Straub ( @d4straub ) and Alexander Peltzer ( @apeltzer ) for use at the Quantitative Biology Center (QBiC) and Microbial Ecology, Center for Applied Geosciences , part of Eberhard Karls Universität Tübingen (Germany). Daniel Lundin @erikrikarddaniel ( Linnaeus University, Sweden ) joined before pipeline release 2.0.0 and helped to improve the pipeline considerably.

We thank the following people for their extensive assistance in the development of this pipeline (in alphabetical order):

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines .

For further information or help, don't hesitate to get in touch on the Slack #ampliseq channel (you can join with this invite ).

Citations

If you use nf-core/ampliseq for your analysis, please cite the ampliseq article as follows:

Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline

Daniel Straub, Nia Blackwell, Adrian Langarica-Fuentes, Alexander Peltzer, Sven Nahnsen, Sara Kleindienst

Frontiers in Microbiology 2020, 11:2652 doi: 10.3389/fmicb.2020.550420 .

You can cite the nf-core/ampliseq zenodo record for a specific version using the following doi: 10.5281/zenodo.1493841

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x .

Code Snippets

25
26
27
28
29
30
31
32
33
"""
add_sh_to_taxonomy.py ${sh_info.join(' ')} $asvtable $blastfile $outtable $args

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
    pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"""
IFS=',' read -r -a kingdom <<< \"$kingdom\"

for KINGDOM in "\${kingdom[@]}"
do
    barrnap \\
        --threads $task.cpus \\
        $args \\
        --kingdom \$KINGDOM \\
        --outseq Filtered.\${KINGDOM}.fasta \\
        < $fasta \\
        > rrna.\${KINGDOM}.gff

    #this fails when processing an empty file, so it requires a workaround!
    if [ -s Filtered.\${KINGDOM}.fasta ]; then
        grep -h '>' Filtered.\${KINGDOM}.fasta | sed 's/^>//' | sed 's/:\\+/\\t/g' | awk '{print \$2}' | sort -u >\${KINGDOM}.matches.txt
    else
        touch \${KINGDOM}.matches.txt
    fi
done

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    barrnap: \$(echo \$(barrnap --version 2>&1) | sed "s/^.*barrnap //g")
END_VERSIONS
"""
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"""
summarize_barrnap.py $predictions

if [[ \$(wc -l < summary.tsv ) -le 1 ]]; then
    touch WARNING_no_rRNA_found_warning.txt
else
    touch no_warning.txt
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$( python --version )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
"""
combine_table.r ${table} ${seq} ${tax}
mv combined_ASV_table.tsv ${outfile}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
"""
#!/usr/bin/env Rscript
standard <- read.table(\"${files[0]}\", header = TRUE, sep = "\\t", stringsAsFactors = FALSE)
doubleprimer <- read.table(\"${files[1]}\", header = TRUE, sep = "\\t", stringsAsFactors = FALSE)
colnames(doubleprimer) <- c("sample", "cutadapt_doubleprimer_total_processed", "cutadapt_doubleprimer_reverse_complemented", "cutadapt_doubleprimer_passing_filters", "cutadapt_doubleprimer_passing_filters_percent")

#merge
df <- merge(standard, doubleprimer, by = "sample")

#filter columns
remove_columns <- c("cutadapt_doubleprimer_total_processed")
for(column in remove_columns) df[column]<-NULL

#write
write.table(df, file = \"cutadapt_summary.tsv\", quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\\t")

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")) ), "versions.yml")
"""
42
43
44
"""
cp $files cutadapt_summary.tsv
"""
23
24
25
26
27
28
29
30
"""
cutadapt_summary.py $mode *.cutadapt.log > ${name}_summary.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$( python --version )
END_VERSIONS
"""
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))
set.seed($seed) # Initialize random number generator for reproducibility

#add "Species" if not already in taxlevels
taxlevels <- $taxlevels
if ( !"Species" %in% taxlevels ) { taxlevels <- c(taxlevels,"Species") }

taxtable <- readRDS(\"$taxtable\")

#remove Species annotation from assignTaxonomy
taxa_nospecies <- taxtable[,!colnames(taxtable) %in% 'Species']

tx <- addSpecies(taxa_nospecies, \"$database\", $args, verbose=TRUE)

# Create a table with specified column order
tmp <- data.frame(row.names(tx)) # To separate ASV_ID from sequence
expected_order <- c("ASV_ID",taxlevels,"confidence")
taxa <- as.data.frame( subset(tx, select = expected_order) )
taxa\$sequence <- tmp[,1]
row.names(taxa) <- row.names(tmp)

#rename Species annotation to Species_exact
colnames(taxa)[which(names(taxa) == "Species")] <- "Species_exact"

#add Species annotation from assignTaxonomy again, after "Genus" column
if ( "Species" %in% colnames(taxtable) ) {
    taxtable <- data.frame(taxtable)
    taxa_export <- data.frame(append(taxa, list(Species=taxtable\$Species), after=match("Genus", names(taxa))))
} else {
    taxa_export <- taxa
}

write.table(taxa_export, file = \"$outfile\", sep = "\\t", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '')

write.table('addSpecies\t$args\ntaxlevels\t$taxlevels\nseed\t$seed', file = "addSpecies.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

errF = readRDS("${errormodel[0]}")
errR = readRDS("${errormodel[1]}")

filtFs <- sort(list.files("./filtered/", pattern = "_1.filt.fastq.gz", full.names = TRUE))
filtRs <- sort(list.files("./filtered/", pattern = "_2.filt.fastq.gz", full.names = TRUE))

#denoising
sink(file = "${meta.run}.dada.log")
dadaFs <- dada(filtFs, err = errF, $args, multithread = $task.cpus)
saveRDS(dadaFs, "${meta.run}_1.dada.rds")
dadaRs <- dada(filtRs, err = errR, $args, multithread = $task.cpus)
saveRDS(dadaRs, "${meta.run}_2.dada.rds")
sink(file = NULL)

#make table
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, verbose=TRUE)
saveRDS(mergers, "${meta.run}.mergers.rds")
seqtab <- makeSequenceTable(mergers)
saveRDS(seqtab, "${meta.run}.seqtab.rds")

write.table('dada\t$args', file = "dada.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
write.table('mergePairs\t$args2', file = "mergePairs.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

errF = readRDS("${errormodel}")

filtFs <- sort(list.files("./filtered/", pattern = ".fastq.gz", full.names = TRUE))

#denoising
sink(file = "${meta.run}.dada.log")
dadaFs <- dada(filtFs, err = errF, $args, multithread = $task.cpus)
saveRDS(dadaFs, "${meta.run}.dada.rds")
sink(file = NULL)

#make table
seqtab <- makeSequenceTable(dadaFs)
saveRDS(seqtab, "${meta.run}.seqtab.rds")

#dummy file to fulfill output rules
saveRDS("dummy", "dummy_${meta.run}.mergers.rds")

write.table('dada\t$args', file = "dada.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))
set.seed($seed) # Initialize random number generator for reproducibility

fnFs <- sort(list.files(".", pattern = "_1.filt.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(".", pattern = "_2.filt.fastq.gz", full.names = TRUE))

sink(file = "${meta.run}.err.log")
errF <- learnErrors(fnFs, $args, multithread = $task.cpus, verbose = TRUE)
saveRDS(errF, "${meta.run}_1.err.rds")
errR <- learnErrors(fnRs, $args, multithread = $task.cpus, verbose = TRUE)
saveRDS(errR, "${meta.run}_2.err.rds")
sink(file = NULL)

pdf("${meta.run}_1.err.pdf")
plotErrors(errF, nominalQ = TRUE)
dev.off()
svg("${meta.run}_1.err.svg")
plotErrors(errF, nominalQ = TRUE)
dev.off()

pdf("${meta.run}_2.err.pdf")
plotErrors(errR, nominalQ = TRUE)
dev.off()
svg("${meta.run}_2.err.svg")
plotErrors(errR, nominalQ = TRUE)
dev.off()

sink(file = "${meta.run}_1.err.convergence.txt")
dada2:::checkConvergence(errF)
sink(file = NULL)

sink(file = "${meta.run}_2.err.convergence.txt")
dada2:::checkConvergence(errR)
sink(file = NULL)

write.table('learnErrors\t$args', file = "learnErrors.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))
set.seed($seed) # Initialize random number generator for reproducibility

fnFs <- sort(list.files(".", pattern = ".filt.fastq.gz", full.names = TRUE))

sink(file = "${meta.run}.err.log")
errF <- learnErrors(fnFs, $args, multithread = $task.cpus, verbose = TRUE)
saveRDS(errF, "${meta.run}.err.rds")
sink(file = NULL)

pdf("${meta.run}.err.pdf")
plotErrors(errF, nominalQ = TRUE)
dev.off()
svg("${meta.run}.err.svg")
plotErrors(errF, nominalQ = TRUE)
dev.off()

sink(file = "${meta.run}.err.convergence.txt")
dada2:::checkConvergence(errF)
sink(file = NULL)

write.table('learnErrors\t$args', file = "learnErrors.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

out <- filterAndTrim($in_and_out,
    $trunc_args,
    $args,
    compress = TRUE,
    multithread = $task.cpus,
    verbose = TRUE)
out <- cbind(out, ID = row.names(out))

write.table( out, file = "${meta.id}.filter_stats.tsv", sep = "\\t", row.names = FALSE, quote = FALSE, na = '')
write.table(paste('filterAndTrim\t$trunc_args','$args',sep=","), file = "filterAndTrim.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(digest))

#combine stats files
for (data in sort(list.files(".", pattern = ".stats.tsv", full.names = TRUE))) {
    if (!exists("stats")){ stats <- read.csv(data, header=TRUE, sep="\\t") }
    if (exists("stats")){
        temp <-read.csv(data, header=TRUE, sep="\\t")
        stats <-unique(rbind(stats, temp))
        rm(temp)
    }
}
write.table( stats, file = "DADA2_stats.tsv", sep = "\\t", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '')

#combine dada-class objects
files <- sort(list.files(".", pattern = ".ASVtable.rds", full.names = TRUE))
if ( length(files) == 1 ) {
    ASVtab = readRDS(files[1])
} else {
    ASVtab <- mergeSequenceTables(tables=files, repeats = "error", orderBy = "abundance", tryRC = FALSE)
}
saveRDS(ASVtab, "DADA2_table.rds")

df <- t(ASVtab)
colnames(df) <- gsub('_1.filt.fastq.gz', '', colnames(df))
colnames(df) <- gsub('.filt.fastq.gz', '', colnames(df))
df <- data.frame(sequence = rownames(df), df, check.names=FALSE)
# Create an md5 sum of the sequences as ASV_ID and rearrange columns
df\$ASV_ID <- sapply(df\$sequence, digest, algo='md5', serialize = FALSE)
df <- df[,c(ncol(df),3:ncol(df)-1,1)]

# file to publish
write.table(df, file = "DADA2_table.tsv", sep = "\\t", row.names = FALSE, quote = FALSE, na = '')

# Write fasta file with ASV sequences to file
write.table(data.frame(s = sprintf(">%s\n%s", df\$ASV_ID, df\$sequence)), 'ASV_seqs.fasta', col.names = FALSE, row.names = FALSE, quote = FALSE, na = '')

# Write ASV file with ASV abundances to file
df\$sequence <- NULL
write.table(df, file = "ASV_table.tsv", sep="\\t", row.names = FALSE, quote = FALSE, na = '')

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
"""
#!/usr/bin/env Rscript

suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(ShortRead))

readfiles <- sort(list.files(".", pattern = ".fastq.gz", full.names = TRUE))

#make list of number of sequences
readfiles_length <- countLines(readfiles) / 4
sum_readfiles_length <- sum(readfiles_length)

#use only the first x files when read number gets above 2147483647, read numbers above that do not fit into an INT and crash the process!
if ( sum_readfiles_length > 2147483647 ) {
    max_files = length(which(cumsum(readfiles_length) <= 2147483647 ))
    write.table(max_files, file = paste0("WARNING Only ",max_files," of ",length(readfiles)," files and ",sum(readfiles_length[1:max_files])," of ",sum_readfiles_length," reads were used for ${meta} plotQualityProfile.txt"), row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
    readfiles <- readfiles[1:max_files]
} else {
    max_files <- length(readfiles)
    write.table(max_files, file = paste0(max_files," files were used for ${prefix} plotQualityProfile.txt"), row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
}

plot <- plotQualityProfile(readfiles$args)
data <- plot\$data

#aggregate data for each sequencing cycle
df <- data.frame(Cycle=character(), Count=character(), Median=character(), stringsAsFactors=FALSE)
cycles <- sort(unique(data\$Cycle))
for (cycle in cycles) {
    subdata <- data[data[, "Cycle"] == cycle, ]
    score <- list()
    #convert to list to calculate median
    for (j in 1:nrow(subdata)) {score <- unlist(c(score, rep(subdata\$Score[j], subdata\$Count[j])))}
    temp = data.frame(Cycle=cycle, Count=sum(subdata\$Count), Median=median(score), stringsAsFactors=FALSE)
    df <- rbind(df, temp)
}

#write output
write.table( t(df), file = paste0("${prefix}_qual_stats",".tsv"), sep = "\t", row.names = TRUE, col.names = FALSE, quote = FALSE)
pdf(paste0("${prefix}_qual_stats",".pdf"))
plot
dev.off()
svg(paste0("${prefix}_qual_stats",".svg"))
plot
dev.off()

write.table(paste0('plotQualityProfile\t$args\nmax_files\t',max_files), file = "${prefix}_plotQualityProfile.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")),paste0("    ShortRead: ", packageVersion("ShortRead")) ), "versions.yml")
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

seqtab = readRDS("${seqtab}")

#remove chimera
seqtab.nochim <- removeBimeraDenovo(seqtab, $args, multithread=$task.cpus, verbose=TRUE)
if ( ${no_samples} == 1 ) { rownames(seqtab.nochim) <- "${first_sample}" }
saveRDS(seqtab.nochim,"${meta.run}.ASVtable.rds")

write.table('removeBimeraDenovo\t$args', file = "removeBimeraDenovo.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

#combine filter_and_trim files
for (data in list.files("./filter_and_trim_files", full.names=TRUE)){
    if (!exists("filter_and_trim")){ filter_and_trim <- read.csv(data, header=TRUE, sep="\\t") }
    if (exists("filter_and_trim")){
        tempory <-read.csv(data, header=TRUE, sep="\\t")
        filter_and_trim <-unique(rbind(filter_and_trim, tempory))
        rm(tempory)
    }
}
rownames(filter_and_trim) <- filter_and_trim\$ID
filter_and_trim["ID"] <- NULL
#write.table( filter_and_trim, file = "${meta.run}.filter_and_trim.tsv", sep = "\\t", row.names = TRUE, quote = FALSE, na = '')

#read data
dadaFs = readRDS("${denoised[0]}")
dadaRs = readRDS("${denoised[1]}")
mergers = readRDS("$mergers")
seqtab.nochim = readRDS("$seqtab_nochim")

#track reads through pipeline
getN <- function(x) sum(getUniques(x))
if ( nrow(filter_and_trim) == 1 ) {
    track <- cbind(filter_and_trim, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
} else {
    track <- cbind(filter_and_trim, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
}
colnames(track) <- c("DADA2_input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sub(pattern = "_1.fastq.gz\$", replacement = "", rownames(track)) #this is when cutadapt is skipped!
track <- cbind(sample = sub(pattern = "(.*?)\\\\..*\$", replacement = "\\\\1", rownames(track)), track)
write.table( track, file = "${meta.run}.stats.tsv", sep = "\\t", row.names = FALSE, quote = FALSE, na = '')

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

#combine filter_and_trim files
for (data in list.files("./filter_and_trim_files", full.names=TRUE)){
    if (!exists("filter_and_trim")){ filter_and_trim <- read.csv(data, header=TRUE, sep="\\t") }
    if (exists("filter_and_trim")){
        tempory <-read.csv(data, header=TRUE, sep="\\t")
        filter_and_trim <-unique(rbind(filter_and_trim, tempory))
        rm(tempory)
    }
}
rownames(filter_and_trim) <- filter_and_trim\$ID
filter_and_trim["ID"] <- NULL
#write.table( filter_and_trim, file = "${meta.run}.filter_and_trim.tsv", sep = "\\t", row.names = TRUE, quote = FALSE, na = '')

#read data
dadaFs = readRDS("${denoised[0]}")
seqtab.nochim = readRDS("$seqtab_nochim")

#track reads through pipeline
getN <- function(x) sum(getUniques(x))
if ( nrow(filter_and_trim) == 1 ) {
    track <- cbind(filter_and_trim, getN(dadaFs), rowSums(seqtab.nochim))
} else {
    track <- cbind(filter_and_trim, sapply(dadaFs, getN), rowSums(seqtab.nochim))
}
colnames(track) <- c("DADA2_input", "filtered", "denoised", "nonchim")
track <- cbind(sample = sub(pattern = "(.*?)\\\\..*\$", replacement = "\\\\1", rownames(track)), track)
write.table( track, file = "${meta.run}.stats.tsv", sep = "\\t", row.names = FALSE, quote = FALSE, na = '')

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))
set.seed($seed) # Initialize random number generator for reproducibility

taxlevels <- $taxlevels

seq <- getSequences(\"$fasta\", collapse = TRUE, silence = FALSE)
taxa <- assignTaxonomy(seq, \"$database\", taxLevels = taxlevels, $args, multithread = $task.cpus, verbose=TRUE, outputBootstraps = TRUE)

# (1) Make a data frame, add ASV_ID from seq
tx <- data.frame(ASV_ID = names(seq), taxa, sequence = row.names(taxa\$tax), row.names = names(seq))

# (2) Set confidence to the bootstrap for the most specific taxon
# extract columns with taxonomic values
tax <- tx[ , grepl( "tax." , names( tx ) ) ]
# find first occurrence of NA
res <- max.col(is.na(tax), ties = "first")
# correct if no NA is present in column to NA
if(any(res == 1)) is.na(res) <- (res == 1) & !is.na(tax[[1]])
# find index of last entry before NA, which is the bootstrap value
res <- res-1
# if NA choose last entry
res[is.na(res)] <- ncol(tax)
# extract bootstrap values
boot <- tx[ , grepl( "boot." , names( tx ) ) ]
boot\$last_tax <- res
valid_boot <- apply(boot,1,function(x) x[x[length(x)]][1]/100 )
# replace missing bootstrap values (NA) with 0
valid_boot[is.na(valid_boot)] <- 0
# add bootstrap values to column confidence
tx\$confidence <- valid_boot

# (3) Reorder columns before writing to file
expected_order <- c("ASV_ID",paste0("tax.",taxlevels),"confidence","sequence")
expected_order <- intersect(expected_order,colnames(tx))
taxa_export <- subset(tx, select = expected_order)
colnames(taxa_export) <- sub("tax.", "", colnames(taxa_export))
rownames(taxa_export) <- names(seq)

write.table(taxa_export, file = \"$outfile\", sep = "\\t", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '')

# Save a version with rownames for addSpecies
taxa_export <- cbind( ASV_ID = tx\$ASV_ID, taxa\$tax, confidence = tx\$confidence)
saveRDS(taxa_export, "ASV_tax.rds")

write.table('assignTaxonomy\t$args\ntaxlevels\t$taxlevels\nseed\t$seed', file = "assignTaxonomy.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
27
28
29
30
31
32
33
34
35
36
"""
filt_codons.py -f ${fasta} -t ${asv} -p ASV_codon ${args}
filt_codon_stats.py ASV_codon_filtered.table.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
    pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
END_VERSIONS
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
"""
#!/usr/bin/env Rscript

#load packages
suppressPackageStartupMessages(library(Biostrings))

#read abundance file, first column is ASV_ID
$read_table
colnames(table)[1] <- "ASV_ID"

#read fasta file of ASV sequences
seq <- readDNAStringSet("$fasta")
seq <- data.frame(ID=names(seq), sequence=paste(seq))

#filter
filtered_seq <- seq[nchar(seq\$sequence) %in% $min_len_asv:$max_len_asv,]
list <- filtered_seq[, "ID", drop = FALSE]
filtered_table <- merge(table, list, by.x="ASV_ID", by.y="ID", all.x=FALSE, all.y=TRUE)

#report
distribution_before <- table(nchar(seq\$sequence))
distribution_before <- data.frame(Length=names(distribution_before),Counts=as.vector(distribution_before))
distribution_after <- table(nchar(filtered_seq\$sequence))
distribution_after <- data.frame(Length=names(distribution_after),Counts=as.vector(distribution_after))

#write
write.table(filtered_table, file = "ASV_table.len.tsv", row.names=FALSE, sep="\t", col.names = TRUE, quote = FALSE, na = '')
write.table(data.frame(s = sprintf(">%s\n%s", filtered_seq\$ID, filtered_seq\$sequence)), 'ASV_seqs.len.fasta', col.names = FALSE, row.names = FALSE, quote = FALSE, na = '')
write.table(distribution_before, file = "ASV_len_orig.tsv", row.names=FALSE, sep="\t", col.names = TRUE, quote = FALSE, na = '')
write.table(distribution_after, file = "ASV_len_filt.tsv", row.names=FALSE, sep="\t", col.names = TRUE, quote = FALSE, na = '')

#stats
stats <- as.data.frame( t( rbind( colSums(table[-1]), colSums(filtered_table[-1]) ) ) )
stats\$ID <- rownames(stats)
colnames(stats) <- c("lenfilter_input","lenfilter_output", "sample")
write.table(stats, file = "stats.len.tsv", row.names=FALSE, sep="\t")

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    Biostrings: ", packageVersion("Biostrings")) ), "versions.yml")
"""
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
"""
#!/usr/bin/env Rscript

#load packages
suppressPackageStartupMessages(library(Biostrings))

kingdom <- as.list(strsplit("$kingdom", ",")[[1]])

df = read.table("$barrnap_summary", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# keep only ASV_ID & eval columns & sort
df <- subset(df, select = c(ASV_ID,mito_eval,euk_eval,arc_eval,bac_eval))

# choose kingdom (column) with lowest evalue
df[is.na(df)] <- 1
df\$result = colnames(df[,2:5])[apply(df[,2:5],1,which.min)]
df\$result = gsub("_eval", "", df\$result)

# filter ASVs
df_filtered = subset(df, df\$result %in% kingdom)
id_filtered = subset(df_filtered, select = c(ASV_ID))

#error if all ASVs are removed
if ( nrow(df_filtered) == 0 ) stop("Chosen kingdom(s) by --filter_ssu has no matches. Please choose a different kingdom (domain) or omit filtering.")

#read abundance file, first column is ASV_ID
table <- read.table(file = "$table", sep = '\t', comment.char = "", header=TRUE)
colnames(table)[1] <- "ASV_ID"

#read fasta file of ASV sequences
seq <- readDNAStringSet("$fasta")
seq <- data.frame(ID=names(seq), sequence=paste(seq))

#check if all ids match
if(!all(id_filtered\$ID %in% seq\$ID))  {stop(paste(paste(files,sep=","),"and","$fasta","dont share all IDs, exit."), call.=FALSE)}
if(!all(id_filtered\$ID %in% table\$ASV_ID))  {stop(paste(paste(files,sep=","),"and","$table","dont share all IDs, exit"), call.=FALSE)}

#merge
filtered_table <- merge(table, id_filtered, by.x="ASV_ID", by.y="ASV_ID", all.x=FALSE, all.y=TRUE)
filtered_seq <- merge(seq, id_filtered, by.x="ID", by.y="ASV_ID", all.x=FALSE, all.y=TRUE)

#write
write.table(filtered_table, file = "ASV_table.ssu.tsv", row.names=FALSE, sep="\t", col.names = TRUE, quote = FALSE, na = '')
write.table(data.frame(s = sprintf(">%s\n%s", filtered_seq\$ID, filtered_seq\$sequence)), 'ASV_seqs.ssu.fasta', col.names = FALSE, row.names = FALSE, quote = FALSE, na = '')

#stats
stats <- as.data.frame( t( rbind( colSums(table[-1]), colSums(filtered_table[-1]) ) ) )
stats\$ID <- rownames(stats)
colnames(stats) <- c("ssufilter_input","ssufilter_output", "sample")
write.table(stats, file = "stats.ssu.tsv", row.names=FALSE, sep="\t")

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    Biostrings: ", packageVersion("Biostrings")) ), "versions.yml")
"""
21
22
23
24
25
26
27
28
29
"""
filter_stats.py $unfiltered $filtered

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
    pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
END_VERSIONS
"""
20
21
22
23
24
25
26
27
"""
cat $fastain | sed '/^>/s/\t/ /g' > input.mod.fasta

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    sed: \$(sed --version 2>&1 | sed -n 1p | sed 's/sed (GNU sed) //')
END_VERSIONS
"""
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
#!/usr/bin/env Rscript

tax = read.table("$tax", header = TRUE, sep = "\t", stringsAsFactors = FALSE, comment.char = '', quote = '')

df <- data.frame(
    name=character(),
    LWR=character(),
    fract=character(),
    aLWR=character(),
    afract=character(),
    taxopath=character(),
    stringsAsFactors=FALSE)

for (asvid in unique(tax\$name)) {
    # subset per ASV ID
    temp = subset(tax, tax\$name == asvid)
    maxLWR = max(temp\$LWR)
    temp = subset(temp, temp\$LWR == maxLWR)

    if ( nrow(temp) == 1 ) {
        # STEP 1: if there is just 1 maximum, choose this row
        print( paste ( asvid,"was added in STEP 1" ) )
        temp = temp[which.max(temp\$LWR),]
        df <- rbind(df, temp)
    } else {
        # STEP2: be conservative, choose the taxonomy with less entries
        # count number of entries in "taxopath", simplified: number of semicolons
        temp\$taxlevels = nchar( temp\$taxopath ) - nchar(gsub(';', '', temp\$taxopath )) + 1
        minTaxlevels = min(temp\$taxlevels)
        temp = subset(temp, temp\$taxlevels == minTaxlevels)
        temp\$taxlevels = NULL
        if ( nrow(temp) == 1 ) {
            df <- rbind(df, temp)
            print( paste ( asvid,"was added in STEP 2" ) )
        } else {
            # STEP 3: if number of entries is same, remove last entry
            # then, check if reduced entries are identical > if yes, choose any row, if no, repeat
            # at that step the taxonomies have same length
            print( paste ( asvid,"enters STEP 3" ) )
            list_taxonpath <- str_split( temp\$taxopath, ";")
            df_taxonpath <- as.data.frame(do.call(rbind, list_taxonpath))
            for (i in ncol(df_taxonpath):0) {
                # choose first column and change taxon to reduced overlap
                if ( length(unique(df_taxonpath[,i])) == 1 ) {
                    if (i>1) {
                        temp\$taxopath <- apply(df_taxonpath[1,1:i], 1, paste, collapse=";")[[1]]
                    } else if (i==1) {
                        temp\$taxopath <- df_taxonpath[1,1]
                    }
                    df <- rbind(df, temp[1,])
                    print( paste ( asvid,"was added with",temp\$taxopath[[1]],"in STEP 3a" ) )
                    break
                } else if (i==0) {
                    # if all fails, i.e. there is no consensus on any taxonomic level, use NA
                    temp\$taxopath <- "NA"
                    df <- rbind(df, temp[1,])
                    print( paste ( asvid,"was added with",temp\$taxopath[[1]],"in STEP 3b" ) )
                }
            }
        }
    }
}

# output the cleaned, unified, unique taxonomic classification per ASV
write.table(df, file = "${meta.id}.taxonomy.per_query_unique.tsv", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '', sep = "\\t")

# output ASV taxonomy table for QIIME2
df_clean <- subset(df, select = c("name","taxopath"))
colnames(df_clean) <- c("ASV_ID","taxonomy")
write.table(df_clean, file = "${meta.id}.taxonomy.tsv", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '', sep = "\\t")

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2")) ), "versions.yml")
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
"""
${params.dada_ref_databases[params.dada_ref_taxonomy]["fmtscript"]}

#Giving out information
echo -e "--dada_ref_taxonomy: ${params.dada_ref_taxonomy}\n" >ref_taxonomy.${suffix}.txt
echo -e "Title: ${params.dada_ref_databases[params.dada_ref_taxonomy]["title"]}\n" >>ref_taxonomy.${suffix}.txt
echo -e "Citation: ${params.dada_ref_databases[params.dada_ref_taxonomy]["citation"]}\n" >>ref_taxonomy.${suffix}.txt
echo "All entries: ${params.dada_ref_databases[params.dada_ref_taxonomy]}" >>ref_taxonomy.${suffix}.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    bash: \$(bash --version | sed -n 1p | sed 's/GNU bash, version //g')
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
"""
${params.qiime_ref_databases[params.qiime_ref_taxonomy]["fmtscript"]}

#Giving out information
echo -e "--qiime_ref_taxonomy: ${params.qiime_ref_taxonomy}\n" >ref_taxonomy.txt
echo -e "Title: ${params.qiime_ref_databases[params.qiime_ref_taxonomy]["title"]}\n" >>ref_taxonomy.txt
echo -e "Citation: ${params.qiime_ref_databases[params.qiime_ref_taxonomy]["citation"]}\n" >>ref_taxonomy.txt
echo "All entries: ${params.qiime_ref_databases[params.qiime_ref_taxonomy]}" >>ref_taxonomy.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    bash: \$(bash --version | sed -n 1p | sed 's/GNU bash, version //g')
END_VERSIONS
"""
21
22
23
24
25
26
27
28
29
30
31
32
33
34
"""
${params.sintax_ref_databases[params.sintax_ref_taxonomy]["fmtscript"]}

#Giving out information
echo -e "--sintax_ref_taxonomy: ${params.sintax_ref_taxonomy}\n" >ref_taxonomy_sintax.txt
echo -e "Title: ${params.sintax_ref_databases[params.sintax_ref_taxonomy]["title"]}\n" >>ref_taxonomy_sintax.txt
echo -e "Citation: ${params.sintax_ref_databases[params.sintax_ref_taxonomy]["citation"]}\n" >>ref_taxonomy_sintax.txt
echo "All entries: ${params.sintax_ref_databases[params.sintax_ref_taxonomy]}" >>ref_taxonomy_sintax.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    bash: \$(bash --version | sed -n 1p | sed 's/GNU bash, version //g')
END_VERSIONS
"""
22
23
24
25
26
27
28
29
30
"""
add_full_sequence_to_taxfile.py $taxtable $fastafile $outfile

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
    pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
END_VERSIONS
"""
24
25
26
27
28
29
30
31
"""
convert_sintax_output.py -i $taxtable -f $fastafile -o $outfile -t $taxlevels -d \"${params.sintax_ref_databases[params.sintax_ref_taxonomy]["dbversion"]}\"

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
"""
ITSx \\
    -i $fasta \\
    $args \\
    --cpu $task.cpus \\
    -o ASV_ITS_seqs

if [ ! -s $outfile ]; then
    echo "ERROR: No ITS regions found by ITSx. You might want to modify --cut_its and/or --its_partial" >&2
    exit 1
fi

echo -e "ITSx\t$args" > ITSx.args.txt
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    ITSx: \$( ITSx -h 2>&1 > /dev/null | tail -n 2 | head -n 1 | cut -f 2 -d ' ' )
END_VERSIONS
"""
21
22
23
24
25
26
27
28
29
30
31
32
33
"""
#!/usr/bin/env Rscript
x <- read.table(\"file1.tsv\", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
y <- read.table(\"file2.tsv\", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

#merge
df <- merge(x, y, by = "sample", all = TRUE)

#write
write.table(df, file = \"overall_summary.tsv\", quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\t")

writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")) ), "versions.yml")
"""
21
22
23
24
25
26
27
28
"""
metadata_all.r ${metadata}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
END_VERSIONS
"""
21
22
23
24
25
26
27
28
"""
metadata_pairwise.r ${metadata}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
END_VERSIONS
"""
25
26
27
28
29
30
31
32
"""
novaseq_err_pe.r ${errormodel[0]} ${errormodel[1]} ${meta.run}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version | sed -n 1p | sed 's/R version //g' | sed 's/\\s.*\$//')
END_VERSIONS
"""
34
35
36
37
38
39
40
41
"""
novaseq_err_se.r ${errormodel} ${meta.run}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version | sed -n 1p | sed 's/R version //g' | sed 's/\\s.*\$//')
END_VERSIONS
"""
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
"""
#If input is QIIME2 file, than (1) the first line and (2) the first character (#) of the second line need to be removed
if [ "$source" == 'QIIME2' ]
then
    tail -n +2 "$abund" > "${abund}.tmp" && mv "${abund}.tmp" "$abund"
fi

picrust2_pipeline.py \\
    $args \\
    -s $seq \\
    -i $abund \\
    -o all_output \\
    -p $task.cpus \\
    --in_traits EC,KO \\
    --verbose

#Add descriptions to identifiers
add_descriptions.py -i all_output/EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                -o EC_pred_metagenome_unstrat_descrip.tsv
add_descriptions.py -i all_output/KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                -o KO_pred_metagenome_unstrat_descrip.tsv
add_descriptions.py -i all_output/pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                -o METACYC_path_abun_unstrat_descrip.tsv

echo "$message" > "${message}.txt"
echo -e "picrust\t$args" > "picrust.args.txt"

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
    picrust2: \$( picrust2_pipeline.py -v | sed -e "s/picrust2_pipeline.py //g" )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

maxdepth=\$(count_table_minmax_reads.py $stats maximum 2>&1)

#check values
if [ \"\$maxdepth\" -gt \"75000\" ]; then maxdepth=\"75000\"; fi
if [ \"\$maxdepth\" -gt \"5000\" ]; then maxsteps=\"250\"; else maxsteps=\$((maxdepth/20)); fi
qiime diversity alpha-rarefaction  \\
    --i-table ${table}  \\
    --i-phylogeny ${tree}  \\
    --p-max-depth \$maxdepth  \\
    --m-metadata-file ${metadata}  \\
    --p-steps \$maxsteps  \\
    --p-iterations 10  \\
    --o-visualization alpha-rarefaction.qzv
qiime tools export --input-path alpha-rarefaction.qzv  \\
    --output-path alpha-rarefaction

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime composition add-pseudocount \\
    --i-table ${table} \\
    --o-composition-table comp-${table}
qiime composition ancom \\
    --i-table comp-${table} \\
    --m-metadata-file ${metadata} \\
    --m-metadata-column ${table.baseName} \\
    --o-visualization comp-${table.baseName}.qzv
qiime tools export --input-path comp-${table.baseName}.qzv \\
    --output-path ancom/Category-${table.baseName}-ASV

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"
mkdir ancom

# Sum data at the specified level
qiime taxa collapse \\
    --i-table ${table} \\
    --i-taxonomy ${taxonomy} \\
    --p-level ${taxlevel} \\
    --o-collapsed-table lvl${taxlevel}-${table}

# Extract summarised table and output a file with the number of taxa
qiime tools export \\
    --input-path lvl${taxlevel}-${table} \\
    --output-path exported/
biom convert \\
    -i exported/feature-table.biom \\
    -o ${table.baseName}-level-${taxlevel}.feature-table.tsv \\
    --to-tsv

if [ \$(grep -v '^#' -c ${table.baseName}-level-${taxlevel}.feature-table.tsv) -lt 2 ]; then
    echo ${taxlevel} > ancom/\"WARNING Summing your data at taxonomic level ${taxlevel} produced less than two rows (taxa), ANCOM can't proceed -- did you specify a bad reference taxonomy?\".txt
else
    qiime composition add-pseudocount \\
            --i-table lvl${taxlevel}-${table} \\
            --o-composition-table comp-lvl${taxlevel}-${table}
    qiime composition ancom \\
            --i-table comp-lvl${taxlevel}-${table} \\
            --m-metadata-file ${metadata} \\
            --m-metadata-column ${table.baseName} \\
            --o-visualization comp-lvl${taxlevel}-${table.baseName}.qzv
    qiime tools export --input-path comp-lvl${taxlevel}-${table.baseName}.qzv \\
            --output-path ancom/Category-${table.baseName}-level-${taxlevel}
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime taxa barplot  \\
    --i-table ${table}  \\
    --i-taxonomy ${taxonomy}  \\
    ${metadata_cmd}  \\
    --o-visualization taxa-bar-plots.qzv  \\
    --verbose
qiime tools export \\
    --input-path taxa-bar-plots.qzv  \\
    --output-path barplot${suffix}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime feature-classifier classify-sklearn  \\
    --i-classifier ${trained_classifier}  \\
    --p-n-jobs ${task.cpus}  \\
    --i-reads ${repseq}  \\
    --o-classification taxonomy.qza  \\
    --verbose
qiime metadata tabulate  \\
    --m-input-file taxonomy.qza  \\
    --o-visualization taxonomy.qzv  \\
    --verbose
#produce "taxonomy/taxonomy.tsv"
qiime tools export \\
    --input-path taxonomy.qza  \\
    --output-path taxonomy
qiime tools export \\
    --input-path taxonomy.qzv  \\
    --output-path taxonomy
cp taxonomy/taxonomy.tsv .

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime diversity adonis \\
    --p-n-jobs $task.cpus \\
    --i-distance-matrix ${core} \\
    --m-metadata-file ${metadata} \\
    --o-visualization ${core.baseName}_adonis.qzv \\
    $args \\
    --p-formula "${formula}"
qiime tools export \\
    --input-path ${core.baseName}_adonis.qzv \\
    --output-path adonis/${core.baseName}-${formula}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime diversity alpha-group-significance \\
    --i-alpha-diversity ${core} \\
    --m-metadata-file ${metadata} \\
    --o-visualization ${core.baseName}-vis.qzv
qiime tools export \\
    --input-path ${core.baseName}-vis.qzv \\
    --output-path "alpha_diversity/${core.baseName}"

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime diversity beta-group-significance \\
    --i-distance-matrix ${core} \\
    --m-metadata-file ${metadata} \\
    --m-metadata-column \"${category}\" \\
    --o-visualization ${core.baseName}-${category}.qzv \\
    --p-pairwise
qiime tools export \\
    --input-path ${core.baseName}-${category}.qzv \\
    --output-path beta_diversity/${core.baseName}-${category}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"
mkdir beta_diversity

qiime emperor plot \\
    --i-pcoa ${core} \\
    --m-metadata-file ${metadata} \\
    --o-visualization ${core.baseName}-vis.qzv
qiime tools export \\
    --input-path ${core.baseName}-vis.qzv \
    --output-path beta_diversity/${core.baseName}-PCoA

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

mindepth=\$(count_table_minmax_reads.py $stats minimum 2>&1)
if [ \"\$mindepth\" -lt \"$mindepth\" ]; then mindepth=$mindepth; fi

# report the rarefaction depth and return warning, if needed
if [ \"\$mindepth\" -lt \"1000\" ]; then
    echo \$mindepth >\"WARNING The sampling depth of \$mindepth seems too small for rarefaction.txt\"
elif [ \"\$mindepth\" -lt \"5000\" ]; then
    echo \$mindepth >\"WARNING The sampling depth of \$mindepth is very small for rarefaction.txt\"
elif [ \"\$mindepth\" -lt \"10000\" ]; then
    echo \$mindepth >\"WARNING The sampling depth of \$mindepth is quite small for rarefaction.txt\"
else
    echo \$mindepth >\"Use the sampling depth of \$mindepth for rarefaction.txt\"
fi

qiime diversity core-metrics-phylogenetic \\
    --m-metadata-file ${metadata} \\
    --i-phylogeny ${tree} \\
    --i-table ${table} \\
    --p-sampling-depth \$mindepth \\
    --output-dir diversity_core \\
    --p-n-jobs-or-threads ${task.cpus} \\
    --verbose

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

#produce raw count table in biom format "table/feature-table.biom"
qiime tools export \\
    --input-path ${table} \\
    --output-path table
cp table/feature-table.biom .

#produce raw count table "table/feature-table.tsv"
biom convert \\
    -i table/feature-table.biom \\
    -o feature-table.tsv \\
    --to-tsv

#produce representative sequence fasta file "sequences.fasta"
qiime feature-table tabulate-seqs \\
    --i-data ${repseq} \\
    --o-visualization rep-seqs.qzv
qiime tools export \\
    --input-path rep-seqs.qzv \\
    --output-path representative_sequences
cp representative_sequences/sequences.fasta rep-seq.fasta
cp representative_sequences/*.tsv .

##on several taxa level
array=(\$(seq ${tax_agglom_min} 1 ${tax_agglom_max}))
for i in \${array[@]}
do
    #collapse taxa
    qiime taxa collapse \\
        --i-table ${table} \\
        --i-taxonomy ${taxonomy} \\
        --p-level \$i \\
        --o-collapsed-table table-\$i.qza
    #export to biom
    qiime tools export \\
        --input-path table-\$i.qza \\
        --output-path table-\$i
    #convert to tab separated text file
    biom convert \\
        -i table-\$i/feature-table.biom \\
        -o abs-abund-table-\$i.tsv --to-tsv
done

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

#convert to relative abundances
qiime feature-table relative-frequency \\
    --i-table ${table} \\
    --o-relative-frequency-table relative-table-ASV.qza

#export to biom
qiime tools export \\
    --input-path relative-table-ASV.qza \\
    --output-path relative-table-ASV

#convert to tab separated text file "rel-table-ASV.tsv"
biom convert \\
    -i relative-table-ASV/feature-table.biom \\
    -o rel-table-ASV.tsv --to-tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

##on several taxa level
array=(\$(seq ${tax_agglom_min} 1 ${tax_agglom_max}))

for i in \${array[@]}
do
    #collapse taxa
    qiime taxa collapse \\
        --i-table ${table} \\
        --i-taxonomy ${taxonomy} \\
        --p-level \$i \\
        --o-collapsed-table table-\$i.qza
    #convert to relative abundances
    qiime feature-table relative-frequency \\
        --i-table table-\$i.qza \\
        --o-relative-frequency-table relative-table-\$i.qza
    #export to biom
    qiime tools export \\
        --input-path relative-table-\$i.qza \\
        --output-path relative-table-\$i
    #convert to tab separated text file
    biom convert \\
        -i relative-table-\$i/feature-table.biom \\
        -o rel-table-\$i.tsv --to-tsv
done

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

### Import
qiime tools import \\
    --type \'FeatureData[Sequence]\' \\
    --input-path ${database[0]} \\
    --output-path ref-seq.qza
qiime tools import \\
    --type \'FeatureData[Taxonomy]\' \\
    --input-format HeaderlessTSVTaxonomyFormat \\
    --input-path ${database[1]} \\
    --output-path ref-taxonomy.qza
#Extract sequences based on primers
qiime feature-classifier extract-reads \\
    --i-sequences ref-seq.qza \\
    --p-f-primer ${meta.FW_primer} \\
    --p-r-primer ${meta.RV_primer} \\
    --o-reads ${meta.FW_primer}-${meta.RV_primer}-ref-seq.qza \\
    --quiet

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime feature-table filter-samples \\
    --i-table "${table}" \\
    --m-metadata-file "${metadata}" \\
    --p-where \"${category}<>\'\'\" \\
    --o-filtered-table "filtered_${category}.qza"

qiime feature-table group \\
    --i-table "filtered_${category}.qza" \\
    --p-axis 'sample' \\
    --m-metadata-file "${metadata}" \\
    --m-metadata-column "${category}" \\
    --p-mode 'sum' \\
    --o-grouped-table "${category}"

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime feature-table filter-samples \\
    --i-table ${table} \\
    --m-metadata-file ${metadata} \\
    $args \\
    --o-filtered-table ${prefix}.qza

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

if ! [ \"${exclude_taxa}\" = \"none\" ]; then
    #filter sequences
    qiime taxa filter-seqs \\
        --i-sequences ${repseq} \\
        --i-taxonomy ${taxonomy} \\
        --p-exclude ${exclude_taxa} --p-mode contains \\
        --o-filtered-sequences tax_filtered-sequences.qza
    #filter abundance table
    qiime taxa filter-table \\
        --i-table ${table} \\
        --i-taxonomy ${taxonomy} \
        --p-exclude ${exclude_taxa} \\
        --p-mode contains \\
        --o-filtered-table tax_filtered-table.qza
    filtered_table="tax_filtered-table.qza"
    filtered_sequences="tax_filtered-sequences.qza"
else
    filtered_table=${table}
    filtered_sequences=${repseq}
fi
qiime feature-table filter-features \\
    --i-table \$filtered_table \\
    --p-min-frequency ${min_frequency} \\
    --p-min-samples ${min_samples} \\
    --o-filtered-table filtered-table.qza

qiime feature-table filter-seqs \\
    --i-data \$filtered_sequences \\
    --i-table filtered-table.qza \\
    --o-filtered-data filtered-sequences.qza

#produce raw count table in biom format "table/feature-table.biom"
qiime tools export \\
    --input-path filtered-table.qza  \\
    --output-path table
#produce raw count table
biom convert \\
    -i table/feature-table.biom \\
    -o table/feature-table.tsv  \\
    --to-tsv
cp table/feature-table.tsv filtered-table.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""
# remove first line if needed
sed '/^# Constructed from biom file/d' "$asv" > biom-table.txt

# load into QIIME2
biom convert -i biom-table.txt -o table.biom --table-type="OTU table" --to-hdf5
qiime tools import \\
    --input-path table.biom \\
    --type 'FeatureTable[Frequency]' \\
    --input-format BIOMV210Format \\
    --output-path table.qza

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
"""
qiime tools import \\
    --input-path "$seq" \\
    --type 'FeatureData[Sequence]' \\
    --output-path rep-seqs.qza

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
"""
parse_dada2_taxonomy.r $tax

qiime tools import \\
    --type 'FeatureData[Taxonomy]' \\
    --input-format HeaderlessTSVTaxonomyFormat \\
    --input-path tax.tsv \\
    --output-path taxonomy.qza

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
"""
qiime tools import \\
    --type 'Phylogeny[Rooted]' \\
    --input-path $tree \\
    --output-path tree.qza

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

#Train classifier
qiime feature-classifier fit-classifier-naive-bayes \\
    --i-reference-reads ${meta.FW_primer}-${meta.RV_primer}-ref-seq.qza \\
    --i-reference-taxonomy ref-taxonomy.qza \\
    --o-classifier ${meta.FW_primer}-${meta.RV_primer}-classifier.qza \\
    --quiet

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"""
export XDG_CONFIG_HOME="\${PWD}/HOME"

qiime alignment mafft \\
    --i-sequences ${repseq} \\
    --o-alignment aligned-rep-seqs.qza \\
    --p-n-threads ${task.cpus}
qiime alignment mask \\
    --i-alignment aligned-rep-seqs.qza \\
    --o-masked-alignment masked-aligned-rep-seqs.qza
qiime phylogeny fasttree \\
    --i-alignment masked-aligned-rep-seqs.qza \\
    --p-n-threads ${task.cpus} \\
    --o-tree unrooted-tree.qza
qiime phylogeny midpoint-root \\
    --i-tree unrooted-tree.qza \\
    --o-rooted-tree rooted-tree.qza
qiime tools export \\
    --input-path rooted-tree.qza  \\
    --output-path phylogenetic_tree
cp phylogenetic_tree/tree.nwk .

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    qiime2: \$( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
"""
if [ ! -f  ${meta.id}.fastq.gz ]; then
    $args $reads ${meta.id}.fastq.gz
else
    touch ${meta.id}.fastq.gz
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    sed: \$(sed --version 2>&1 | sed -n 1p | sed 's/sed (GNU sed) //')
END_VERSIONS
"""
37
38
39
40
41
42
43
44
45
"""
[ -f "${meta.id}_1.fastq.gz" ] || $args "${reads[0]}" "${meta.id}_1.fastq.gz"
[ -f "${meta.id}_2.fastq.gz" ] || $args "${reads[1]}" "${meta.id}_2.fastq.gz"

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    sed: \$(sed --version 2>&1 | sed -n 1p | sed 's/sed (GNU sed) //')
END_VERSIONS
"""
24
25
26
27
28
29
30
31
"""
sbdiexport.R $args $asvtable $taxonomytable $metadata

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
"""
if [[ $workflow.manifest.version == *dev ]]; then
    ampliseq_version="v$workflow.manifest.version, revision: ${workflow.scriptId.substring(0,10)}"
else
    ampliseq_version="v$workflow.manifest.version"
fi

sbdiexportreannotate.R \"$dbversion\" $taxonomytable $taxonomymethod \"\$ampliseq_version\" $predictions

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    R: \$(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
END_VERSIONS
"""
22
23
24
25
26
27
28
29
30
"""
trunclen.py $qual_stats $args

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version 2>&1 | sed 's/Python //g')
    pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
"""
cutadapt \\
    --cores $task.cpus \\
    $args \\
    $trimmed \\
    $reads \\
    > ${prefix}.cutadapt.log
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cutadapt: \$(cutadapt --version)
END_VERSIONS
"""
41
42
43
44
45
46
47
48
49
"""
touch ${prefix}.cutadapt.log
touch ${trimmed}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cutadapt: \$(cutadapt --version)
END_VERSIONS
"""
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
"""
epa-ng \\
    $args \\
    --threads $task.cpus \\
    $queryarg \\
    $refalnarg \\
    $reftreearg \\
    $bfastarg \\
    $binaryarg

if [ -e epa_result.jplace ]; then
    gzip epa_result.jplace
    cp epa_result.jplace.gz ${prefix}.epa_result.jplace.gz
fi
[ -e epa_info.log ]      && cp epa_info.log ${prefix}.epa_info.log

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    epang: \$(echo \$(epa-ng --version 2>&1) | sed 's/^EPA-ng v//')
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
"""
epa-ng \\
    $args \\
    --split $refaln $fullaln

gzip -c query.fasta > ${prefix}.query.fasta.gz; rm query.fasta
gzip -c reference.fasta > ${prefix}.reference.fasta.gz; rm reference.fasta

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    epang: \$(echo \$(epa-ng --version 2>&1) | sed 's/^EPA-ng v//')
END_VERSIONS
"""
28
29
30
31
32
33
34
35
36
37
38
"""
printf "%s %s\\n" $rename_to | while read old_name new_name; do
    [ -f "\${new_name}" ] || ln -s \$old_name \$new_name
done
fastqc $args --threads $task.cpus $renamed_files

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" )
END_VERSIONS
"""
42
43
44
45
46
47
48
49
50
"""
touch ${prefix}.html
touch ${prefix}.zip

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" )
END_VERSIONS
"""
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"""
gappa \\
    examine assign \\
    $args \\
    --threads $task.cpus \\
    --jplace-path $jplace \\
    --taxon-file $taxonomy \\
    --file-prefix ${prefix}.

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gappa: \$(echo \$(gappa --version 2>&1 | sed 's/v//' ))
END_VERSIONS
"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
"""
gappa \\
    examine \\
    graft \\
    $args \\
    --threads $task.cpus \\
    --file-prefix ${prefix}. \\
    --jplace-path $jplace

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gappa: \$(echo \$(gappa --version 2>&1 | sed 's/v//' ))
END_VERSIONS
"""
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""
gappa \\
    examine \\
    heat-tree \\
    --threads $task.cpus \\
    --file-prefix ${prefix}. \\
    --jplace-path $jplace \\
    $args \\
    --log-file ${prefix}.log

grep '^ *At' ${prefix}.log > ${prefix}.colours.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gappa: \$(echo \$(gappa --version 2>&1 | sed 's/v//' ))
END_VERSIONS
"""
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
"""
esl-alimask \\
    $args \\
    $fmask_rfarg \\
    $fmask_allarg \\
    $gmask_rfarg \\
    $gmask_allarg \\
    $pmask_rfarg \\
    $pmask_allarg \\
    -o ${prefix}.masked.sthlm \\
    $unmaskedaln \\
    $maskfile

gzip ${prefix}.*mask*

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    hmmer/easel: \$(esl-reformat -h | grep -o '^# Easel [0-9.]*' | sed 's/^# Easel *//')
END_VERSIONS
"""
26
27
28
29
30
31
32
33
34
35
36
37
"""
esl-reformat \\
    $args \\
    $seqfile \\
    $postproc \\
    | gzip -c > ${prefix}.${suffix}.gz

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    hmmer/easel: \$(esl-reformat -h | grep -o '^# Easel [0-9.]*' | sed 's/^# Easel *//')
END_VERSIONS
"""
24
25
26
27
28
29
30
31
32
33
34
"""
hmmalign \\
    $args \\
    $hmm \\
    $fasta | gzip -c > ${prefix}.sthlm.gz

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    hmmer: \$(hmmalign -h | grep -o '^# HMMER [0-9.]*' | sed 's/^# HMMER *//')
END_VERSIONS
"""
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"""
hmmbuild \\
    $args \\
    --cpu $task.cpus \\
    -n ${prefix}  \\
    -o ${prefix}.hmmbuild.txt \\
    ${mxfileopt} \\
    ${prefix}.hmm \\
    $alignment

gzip ${prefix}.hmm

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    hmmer: \$(echo \$(hmmbuild -h | grep HMMER | sed 's/# HMMER //' | sed 's/ .*//' 2>&1))
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
"""
mafft \\
    --thread ${task.cpus} \\
    ${args} \\
    ${add} \\
    ${fasta} \\
    > ${prefix}.fas

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    mafft: \$(mafft --version 2>&1 | sed 's/^v//' | sed 's/ (.*)//')
END_VERSIONS
"""
28
29
30
31
32
33
34
35
36
37
38
39
40
"""
multiqc \\
    --force \\
    $args \\
    $config \\
    $extra_config \\
    .

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""
43
44
45
46
47
48
49
50
51
52
"""
touch multiqc_data
touch multiqc_plots
touch multiqc_report.html

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
"""
vsearch \\
    --sintax $queryfasta \\
    --db $db \\
    --threads $task.cpus \\
    $args \\
    --tabbedout ${prefix}.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    vsearch: \$(vsearch --version 2>&1 | head -n 1 | sed 's/vsearch //g' | sed 's/,.*//g' | sed 's/^v//' | sed 's/_.*//')
END_VERSIONS
"""
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
"""
vsearch \\
    --usearch_global $queryfasta \\
    --db $db \\
    --id $idcutoff \\
    --threads $task.cpus \\
    $args \\
    ${columns} \\
    ${outfmt} ${prefix}.${out_ext}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    vsearch: \$(vsearch --version 2>&1 | head -n 1 | sed 's/vsearch //g' | sed 's/,.*//g' | sed 's/^v//' | sed 's/_.*//')
END_VERSIONS
"""
ShowHide 66 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...