Metagenomic pipeline managed by snakemake

public public 1yr ago Version: 1.0 0 bookmarks

1 - Download this repository

Download this repository to get the files describing the pipeline rules:

wget https://github.com/arthurvinx/Medusa/archive/refs/heads/main.zip
unzip main.zip
cd Medusa-main/

2 - Create the expected folder structure

Go to the folder containing the Snakefile (via command line) and create the expected folder structure with:

mkdir -p ./Pipeline/{result,data/{merged,assembled,collapsed,removal/{index,reference},raw,trimmed},alignment/{db,index},taxonomic/db,functional/db}

3 - Install the conda package manager

An Anaconda environment was created to ease the installation of the software required by this pipeline. The environment recipe is available at the Anaconda cloud: https://anaconda.org/arthurvinx/medusapipeline.

A conda package manager, such as miniconda, must be installed to get this environment and to install the Snakemake workflow management system.

Download and install the latest miniconda release for Linux (adapt the commands if needed):

cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh
chmod u+x Miniconda3-py39_4.9.2-Linux-x86_64.sh
./Miniconda3-py39_4.9.2-Linux-x86_64.sh

Close the terminal after the installation, open it again, and then check if the installation was successful with:

conda -V

4 - Get the pipeline environment

Get the pipeline environment from the Anaconda cloud:

conda activate base
conda install anaconda-client -y
conda env create arthurvinx/medusaPipeline
conda activate medusaPipeline
pip3 install -U plyvel --no-cache-dir --no-deps --force-reinstall

5 - Install Snakemake

The recommended way to install Snakemake is via the conda package manager. The following commands will create a conda environment with the full version of Snakemake. More details can be found at the Snakemake installation guide .

conda activate base
conda install -c conda-forge mamba -y
mamba create -c bioconda -c conda-forge -n snakemake snakemake

Check whether your installation succeeded by typing:

conda activate snakemake
snakemake --help

6 - Move the input files to the expected location

Move your raw FASTQ files to the inputDIR specified in the Snakefile. By default, the inputDIR is "Pipeline/data/raw" and paired-end filenames are expected to present the suffixes "_1.fastq" and "_2.fastq". Alternatively, you may change the inputDIR editing the Snakefile. It is worth to mention that all paths in the Snakefile are interpreted relative to the directory Snakemake is executed in.

7 - Run snakemake

To start this pipeline, go to the folder containing the Snakefile (via command line) and run:

snakemake --cores

This will use all available cores whenever possible. Alternatively, you may define the number of available cores as seen in the Snakemake command line interface guide .

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
args <- commandArgs(trailingOnly = TRUE)

library(dplyr)
library(tidyr)
library(data.table)
fixCollapsed <- function(df){
    colnames(df) <- c("key", "value")
    df <- df %>%
        mutate(key = strsplit(key, "; ")) %>%
        unnest(key)
    df <- df[, c(2, 1)]
    return(df)
}
fixDuplicated <- function(df){
    df <- df %>%
        group_by(key) %>%
        summarise(value = paste(value, collapse = "; "))
    values <- strsplit(df$value, "; ")
    values <- lapply(values, unique)
    values <- sapply(values, paste, collapse = "; ")
    df$value <- values
    return(df)
}
removeUnknown <- function(df){
    idx <- grepl("^-", df$key)
    df <- df[!idx,]
    return(df)
}
df <- fread(args[2], stringsAsFactors = FALSE,
    head = FALSE, nThread = as.integer(args[4]))
df <- as.data.frame(df)
df %>%
    fixCollapsed() %>%
    fixDuplicated() %>%
    removeUnknown() %>%
    fwrite(file = args[1], sep = "\t",
        nThread = as.integer(args[4]))
df <- fread(args[3], stringsAsFactors = FALSE,
    head = FALSE, nThread = as.integer(args[4]))
df <- as.data.frame(df)
df %>%
    fixCollapsed() %>%
    fixDuplicated() %>%
    removeUnknown() %>%
    fwrite(file = args[1], sep = "\t",
        append = TRUE, nThread = as.integer(args[4]))
51
52
53
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastp -i {input} -o {output.trimmed} -q {phredQuality} -w {threads} -h {output.html} -j {output.json}"
65
66
67
68
69
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastp -i {input.f} -I {input.r} -o {output.f} \
    -O {output.r} -q {phredQuality} -w {threads} \
    --detect_adapter_for_pe -h {output.html} -j {output.json}"
74
75
76
77
78
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {referenceDIR} \
    ftp://ftp.ensembl.org/pub/release-{ensemblRelease}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
    && pigz -d {output} -p {threads}"
SnakeMake From line 74 of main/Snakefile
87
88
89
90
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && bowtie2-build --large-index {input} {params.indexPrefix} --threads {threads} \
    && rm {input}"
100
101
102
103
104
105
106
107
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && bowtie2 -x {params.indexPrefix} -U {input.trimmed} -S {removalDIR}/{wildcards.id}.sam -p {threads} \
    && samtools view -bS {removalDIR}/{wildcards.id}.sam > {removalDIR}/{wildcards.id}.bam \
    && samtools view -b -f 4 -F 256 {removalDIR}/{wildcards.id}.bam > {removalDIR}/{wildcards.id}_unaligned.bam \
    && samtools sort -n {removalDIR}/{wildcards.id}_unaligned.bam -o {removalDIR}/{wildcards.id}_unaligned_sorted.bam \
    && samtools bam2fq {removalDIR}/{wildcards.id}_unaligned_sorted.bam > {removalDIR}/{wildcards.id}_unaligned.fastq \
    && rm {removalDIR}/{wildcards.id}.sam {removalDIR}/{wildcards.id}_unaligned.bam {removalDIR}/{wildcards.id}_unaligned_sorted.bam {input.trimmed}"
120
121
122
123
124
125
126
127
128
129
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && bowtie2 -x {params.indexPrefix} -1 {input.f} -2 {input.r} -S {removalDIR}/{wildcards.id}.sam -p {threads} \
    && samtools view -bS {removalDIR}/{wildcards.id}.sam > {removalDIR}/{wildcards.id}.bam \
    && samtools view -b -f 12 -F 256 {removalDIR}/{wildcards.id}.bam > {removalDIR}/{wildcards.id}_unaligned.bam \
    && samtools sort -n {removalDIR}/{wildcards.id}_unaligned.bam -o {removalDIR}/{wildcards.id}_unaligned_sorted.bam \
    && samtools bam2fq {removalDIR}/{wildcards.id}_unaligned_sorted.bam > {removalDIR}/{wildcards.id}_unaligned.fastq \
    && cat {removalDIR}/{wildcards.id}_unaligned.fastq | grep '^@.*/1$' -A 3 --no-group-separator > {removalDIR}/{wildcards.id}_unaligned_1.fastq \
    && cat {removalDIR}/{wildcards.id}_unaligned.fastq | grep '^@.*/2$' -A 3 --no-group-separator > {removalDIR}/{wildcards.id}_unaligned_2.fastq \
    && rm {removalDIR}/{wildcards.id}.sam {removalDIR}/{wildcards.id}_unaligned.bam {removalDIR}/{wildcards.id}_unaligned_sorted.bam {removalDIR}/{wildcards.id}_unaligned.fastq {input.f} {input.r}"
140
141
142
143
144
145
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastp -i {input.f} -I {input.r} -o {mergedDIR}/{wildcards.id}_unmerged_1.fastq \
    -O {mergedDIR}/{wildcards.id}_unmerged_2.fastq -q {phredQuality} -w {threads} \
    --detect_adapter_for_pe -h {output.html} -j {output.json} \
    -m --merged_out {output.merged}"
153
154
155
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && megahit -r {input.reads} --force -o {assembledDIR}/{wildcards.id} -t {threads} -m {megahitMemory}"
164
165
166
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && megahit -1 {input.f} -2 {input.r} --force -o {assembledDIR}/{wildcards.id} -t {threads} -m {megahitMemory}"
178
179
180
181
182
183
184
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju -t {input.nodes} -f {input.fmi} -i {input.reads} -o {resultDIR}/kaiju.out -z {threads} \
    && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \
    && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \
    && ktImportText -o {output.html} {resultDIR}/kaiju_krona \
    && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona"
197
198
199
200
201
202
203
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju -t {input.nodes} -f {input.fmi} -i {input.f} -j {input.r} -o {resultDIR}/kaiju.out -z {threads} \
    && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \
    && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \
    && ktImportText -o {output.html} {resultDIR}/kaiju_krona \
    && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona"
215
216
217
218
219
220
221
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju -t {input.nodes} -f {input.fmi} -i {input.reads} -o {resultDIR}/kaiju.out -z {threads} \
    && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \
    && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \
    && ktImportText -o {output.html} {resultDIR}/kaiju_krona \
    && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona"
226
227
228
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastx_collapser -i {input} -o {output}"
SnakeMake From line 226 of main/Snakefile
233
234
235
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastx_collapser -i {input} -o {output}"
SnakeMake From line 233 of main/Snakefile
240
241
242
243
244
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {NRDIR} \
    ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
    && pigz -d {output} -p {threads}"
SnakeMake From line 240 of main/Snakefile
250
251
252
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && diamond makedb --in {input} -d {output} --threads {threads}"
261
262
263
264
265
266
267
268
269
270
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {taxonomicDIR}/db ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz \
    && tar -xf {taxonomicDIR}/db/taxdump.tar.gz nodes.dmp names.dmp \
    && mv nodes.dmp {taxonomicDIR}/db && mv names.dmp {taxonomicDIR}/db \
    && rm {taxonomicDIR}/db/taxdump.tar.gz \
    && wget -P {taxonomicDIR}/db ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz \
    && pigz -d {taxonomicDIR}/db/prot.accession2taxid.gz -p {threads} \
    && kaiju-convertNR -t {output.nodes} -g {taxonomicDIR}/db/prot.accession2taxid -e $(conda info --base)/envs/medusaPipeline/bin/kaiju-excluded-accessions.txt -a -o {output.convertedNR} -i {input} \
    && rm {taxonomicDIR}/db/prot.accession2taxid {input}"
SnakeMake From line 261 of main/Snakefile
278
279
280
281
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju-mkbwt -n {threads} -a ACDEFGHIKLMNPQRSTVWY -o {taxonomicDIR}/db/kaijuNR {input} \
    && rm {input}"
SnakeMake From line 278 of main/Snakefile
289
290
291
292
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju-mkfmi {taxonomicDIR}/db/kaijuNR \
    && rm {input.bwt} {input.sa}"
SnakeMake From line 289 of main/Snakefile
302
303
304
305
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && touch {output.unaligned} \
    && diamond blastx -d {input.index} -q {input.reads} -o {output.matches} --top 3 --un {output.unaligned} --threads {threads}"
315
316
317
318
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && touch {output.unaligned} \
    && diamond blastx -d {input.index} -q {input.reads} -o {output.matches} --top 3 -F 15 --range-culling --un {output.unaligned} --threads {threads}"
323
324
325
326
327
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {functionalDIR} \
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz \
    && pigz -d {output} -p {threads}"
SnakeMake From line 323 of main/Snakefile
334
335
336
337
338
339
340
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && awk -F \"\t\" '{{if(($7!=\"\") && ($18!=\"\")){{print $18\"\t\"$7}}}}' {input} > {functionalDIR}/genbank2GO.txt \
    && awk -F \"\t\" '{{if(($4!=\"\") && ($7!=\"\")){{print $4\"\t\"$7}}}}' {input} > {functionalDIR}/refseq2GO.txt \
    && Rscript createDictionary.R {functionalDIR}/NR2GO.txt {functionalDIR}/genbank2GO.txt {functionalDIR}/refseq2GO.txt {threads} \
    && annotate createdb {functionalDIR}/NR2GO.txt NR2GO 0 1 -d {functionalDIR}/db \
    && rm {functionalDIR}/genbank2GO.txt {functionalDIR}/refseq2GO.txt"
SnakeMake From line 334 of main/Snakefile
349
350
351
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && annotate idmapping {input.matches} {output.GO} NR2GO -l 1 -d {functionalDIR}/db"
SnakeMake From line 349 of main/Snakefile
360
361
362
shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && annotate idmapping {input.matchesContigs} {output.contigsGO} NR2GO -l 1 -d {functionalDIR}/db"
SnakeMake From line 360 of main/Snakefile
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/arthurvinx/Medusa
Name: medusa
Version: 1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...