Snakemake Pipeline for Raw FASTQ Processing with Terra Compatibility
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Here is a snakemake pipeline that processes raw fastq files and outputs several tables for further processing in R.
I think a path to make this run on Terra is to use the snakemake offical docker container: https://hub.docker.com/r/snakemake/snakemake
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | library(DECIPHER) library(dada2) library(Biostrings) seqtab.nochim <- readRDS(snakemake@input[["seqtab"]]) dna <- DNAStringSet(getSequences(seqtab.nochim)) # Create a DNAStringSet from the ASVs load(snakemake@input[["decipher_db"]]) # CHANGE TO THE PATH OF YOUR TRAINING SET ids <- IdTaxa(dna, trainingSet, strand = "both", processors = snakemake@threads, verbose = FALSE) # use all processors ranks <- c("domain", "phylum", "class", "order", "family", "genus") # ranks of interest # Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy taxid <- t(sapply(ids, function(x) { m <- match(ranks, x$rank) taxa <- x$taxon[m] taxa[startsWith(taxa, "unclassified_")] <- NA taxa })) colnames(taxid) <- ranks rownames(taxid) <- getSequences(seqtab.nochim) saveRDS(taxid, snakemake@output[["taxonomy_decipher"]]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | library(dada2) library(Biostrings) seqtab.nochim <- readRDS(snakemake@input[["seqtab"]]) dna <- DNAStringSet(getSequences(seqtab.nochim)) # Create a DNAStringSet from the ASVs species_assignment <- assignSpecies( dna, snakemake@input[["silva_species_db"]], allowMultiple = TRUE, tryRC = TRUE, n = 1000, verbose = FALSE) saveRDS(species_assignment, snakemake@output[["taxonomy_silva_species"]]) |
41 42 | wrapper: "0.79.0/bio/dada2/filter-trim" |
62 63 | wrapper: "0.79.0/bio/dada2/learn-errors" |
78 79 | wrapper: "0.79.0/bio/dada2/dereplicate-fastq" |
96 97 | wrapper: "0.79.0/bio/dada2/sample-inference" |
112 113 | wrapper: "0.79.0/bio/dada2/make-table" |
125 126 | wrapper: "0.79.0/bio/dada2/remove-chimeras" |
138 139 | wrapper: "0.79.0/bio/dada2/collapse-nomismatch" |
150 151 | script: "scripts/assign_taxonomy_decipher.R" |
162 163 | script: "scripts/assign_taxonomy_silva_species.R" |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # Prepare arguments (no matter the order) args<-list( seqtab = readRDS(snakemake@input[[1]]) ) # Check if extra params are passed if(length(snakemake@params) > 0 ){ # Keeping only the named elements of the list for do.call() extra<-snakemake@params[ names(snakemake@params) != "" ] # Add them to the list of arguments args<-c(args, extra) } else{ message("No optional parameters. Using default parameters from dada2::collapseNoMismatch()") } # Collapse sequences taxa<-do.call(collapseNoMismatch, args) # Store the resulting table as a RDS file saveRDS(taxa, snakemake@output[[1]],compress = T) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # Prepare arguments (no matter the order) args<-list( fls = unlist(snakemake@input)) # Check if extra params are passed if(length(snakemake@params) > 0 ){ # Keeping only the named elements of the list for do.call() extra<-snakemake@params[ names(snakemake@params) != "" ] # Add them to the list of arguments args<-c(args, extra) } else{ message("No optional parameters. Using default parameters from dada2::derepFastq()") } # Dereplicate uniques<-do.call(derepFastq, args) # Store as RDS file saveRDS(uniques,snakemake@output[[1]]) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # Prepare arguments (no matter the order) args<-list( fwd = snakemake@input[["fwd"]], filt = snakemake@output[["filt"]], multithread=snakemake@threads ) # Test if paired end input is passed if(!is.null(snakemake@input[["rev"]]) & !is.null(snakemake@output[["filt_rev"]])){ args<-c(args, rev = snakemake@input[["rev"]], filt.rev = snakemake@output[["filt_rev"]] ) } # Check if extra params are passed if(length(snakemake@params) > 0 ){ # Keeping only the named elements of the list for do.call() extra<-snakemake@params[ names(snakemake@params) != "" ] # Check if 'compress=' option is passed if(!is.null(extra[["compress"]])){ stop("Remove the `compress=` option from `params`.\n", "The `compress` option is implicitly set here from the file extension.") } else { # Check if output files are given as compressed files # ex: in se version, all(TRUE, NULL) gives TRUE compressed <- c( endsWith(args[["filt"]], '.gz'), if(is.null(args[["filt.rev"]])) NULL else {endsWith(args[["filt.rev"]], 'gz')} ) if ( all(compressed) ) { extra[["compress"]] <- TRUE } else if ( any(compressed) ) { stop("Either all or no fastq output should be compressed. Please check `output.filt` and `output.filt_rev` for consistency.") } else { extra[["compress"]] <- FALSE } } # Add them to the list of arguments args<-c(args, extra) } else { message("No optional parameters. Using default parameters from dada2::filterAndTrim()") } # Call the function with arguments filt.stats<-do.call(filterAndTrim, args) # Write processed reads report write.table(filt.stats, snakemake@output[["stats"]], sep="\t", quote=F) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # Prepare arguments (no matter the order) args<-list( fls = snakemake@input, multithread=snakemake@threads ) # Check if extra params are passed if(length(snakemake@params) > 0 ){ # Keeping only the named elements of the list for do.call() extra<-snakemake@params[ names(snakemake@params) != "" ] # Add them to the list of arguments args<-c(args, extra) } else{ message("No optional parameters. Using defaults parameters from dada2::learnErrors()") } # Learn errors rates for both read types err<-do.call(learnErrors, args) # Plot estimated versus observed error rates to validate models perr<-plotErrors(err, nominalQ = TRUE) # Save the plots library(ggplot2) ggsave(snakemake@output[["plot"]], perr, width = 8, height = 8, dpi = 300) # Store the estimated errors as RDS files saveRDS(err, snakemake@output[["err"]],compress = T) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # If names are provided use them nm<-if(is.null(snakemake@params[["names"]])) NULL else snakemake@params[["names"]] # From a list of n lists to one named list of n elements smps<-setNames( object=unlist(snakemake@input), nm=nm ) # Read the RDS into the list smps<-lapply(smps, readRDS) # Prepare arguments (no matter the order) args<-list( samples = smps) # Check if extra params are passed (apart from [["names"]]) if(length(snakemake@params) > 1 ){ # Keeping only the named elements of the list for do.call() (apart from [["names"]]) extra<-snakemake@params[ names(snakemake@params) != "" & names(snakemake@params) != "names" ] # Add them to the list of arguments args<-c(args, extra) } else{ message("No optional parameters. Using default parameters from dada2::makeSequenceTable()") } # Make table seqTab<-do.call(makeSequenceTable, args) # Store the table as a RDS file saveRDS(seqTab, snakemake@output[[1]],compress = T) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # Prepare arguments (no matter the order) args<-list( unqs = readRDS(snakemake@input[[1]]), multithread=snakemake@threads ) # Check if extra params are passed if(length(snakemake@params) > 0 ){ # Keeping only the named elements of the list for do.call() extra<-snakemake@params[ names(snakemake@params) != "" ] # Add them to the list of arguments args<-c(args, extra) } else{ message("No optional parameters. Using default parameters from dada2::removeBimeraDenovo()") } # Remove chimeras seqTab_nochimeras<-do.call(removeBimeraDenovo, args) # Store the estimated errors as RDS files saveRDS(seqTab_nochimeras, snakemake@output[[1]],compress = T) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | log.file<-file(snakemake@log[[1]],open="wt") sink(log.file) sink(log.file,type="message") library(dada2) # Prepare arguments (no matter the order) args<-list( derep = readRDS(snakemake@input[["derep"]]), err = readRDS(snakemake@input[["err"]]), multithread = snakemake@threads ) # Check if extra params are passed if(length(snakemake@params) > 0 ){ # Keeping only the named elements of the list for do.call() extra<-snakemake@params[ names(snakemake@params) != "" ] # Add them to the list of arguments args<-c(args, extra) } else{ message("No optional parameters. Using default parameters from dada2::dada()") } # Learn errors rates for both read types inferred_composition<-do.call(dada, args) # Store the inferred sample composition as RDS files saveRDS(inferred_composition, snakemake@output[[1]],compress = T) # Proper syntax to close the connection for the log file # but could be optional for Snakemake wrapper sink(type="message") sink() |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/elsherbini/amplicon_for_terra
Name:
amplicon_for_terra
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows
ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...
Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...
ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics
RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...
This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...