Metagenomic pipeline managed by snakemake

public 1yr ago Version: 1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

1 - Download this repository

Download this repository to get the files describing the pipeline rules:

wget https://github.com/arthurvinx/Medusa/archive/refs/heads/main.zip
unzip main.zip
cd Medusa-main/

2 - Create the expected folder structure

Go to the folder containing the Snakefile (via command line) and create the expected folder structure with:

mkdir -p ./Pipeline/{result,data/{merged,assembled,collapsed,removal/{index,reference},raw,trimmed},alignment/{db,index},taxonomic/db,functional/db}

3 - Install the conda package manager

An Anaconda environment was created to ease the installation of the software required by this pipeline. The environment recipe is available at the Anaconda cloud: https://anaconda.org/arthurvinx/medusapipeline.

A conda package manager, such as miniconda, must be installed to get this environment and to install the Snakemake workflow management system.

Download and install the latest miniconda release for Linux (adapt the commands if needed):

cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh
chmod u+x Miniconda3-py39_4.9.2-Linux-x86_64.sh
./Miniconda3-py39_4.9.2-Linux-x86_64.sh

Close the terminal after the installation, open it again, and then check if the installation was successful with:

conda -V

4 - Get the pipeline environment

Get the pipeline environment from the Anaconda cloud:

conda activate base
conda install anaconda-client -y
conda env create arthurvinx/medusaPipeline
conda activate medusaPipeline
pip3 install -U plyvel --no-cache-dir --no-deps --force-reinstall

5 - Install Snakemake

The recommended way to install Snakemake is via the conda package manager. The following commands will create a conda environment with the full version of Snakemake. More details can be found at the Snakemake installation guide .

conda activate base
conda install -c conda-forge mamba -y
mamba create -c bioconda -c conda-forge -n snakemake snakemake

Check whether your installation succeeded by typing:

conda activate snakemake
snakemake --help

6 - Move the input files to the expected location

Move your raw FASTQ files to the inputDIR specified in the Snakefile. By default, the inputDIR is "Pipeline/data/raw" and paired-end filenames are expected to present the suffixes "_1.fastq" and "_2.fastq". Alternatively, you may change the inputDIR editing the Snakefile. It is worth to mention that all paths in the Snakefile are interpreted relative to the directory Snakemake is executed in.

7 - Run snakemake

To start this pipeline, go to the folder containing the Snakefile (via command line) and run:

snakemake --cores

This will use all available cores whenever possible. Alternatively, you may define the number of available cores as seen in the Snakemake command line interface guide .

Code Snippets

args <- commandArgs(trailingOnly = TRUE)

library(dplyr)
library(tidyr)
library(data.table)
fixCollapsed <- function(df){
    colnames(df) <- c("key", "value")
    df <- df %>%
        mutate(key = strsplit(key, "; ")) %>%
        unnest(key)
    df <- df[, c(2, 1)]
    return(df)
}
fixDuplicated <- function(df){
    df <- df %>%
        group_by(key) %>%
        summarise(value = paste(value, collapse = "; "))
    values <- strsplit(df$value, "; ")
    values <- lapply(values, unique)
    values <- sapply(values, paste, collapse = "; ")
    df$value <- values
    return(df)
}
removeUnknown <- function(df){
    idx <- grepl("^-", df$key)
    df <- df[!idx,]
    return(df)
}
df <- fread(args[2], stringsAsFactors = FALSE,
    head = FALSE, nThread = as.integer(args[4]))
df <- as.data.frame(df)
df %>%
    fixCollapsed() %>%
    fixDuplicated() %>%
    removeUnknown() %>%
    fwrite(file = args[1], sep = "\t",
        nThread = as.integer(args[4]))
df <- fread(args[3], stringsAsFactors = FALSE,
    head = FALSE, nThread = as.integer(args[4]))
df <- as.data.frame(df)
df %>%
    fixCollapsed() %>%
    fixDuplicated() %>%
    removeUnknown() %>%
    fwrite(file = args[1], sep = "\t",
        append = TRUE, nThread = as.integer(args[4]))

R dplyr data.table tidyr From line 1 of main/createDictionary.R

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastp -i {input} -o {output.trimmed} -q {phredQuality} -w {threads} -h {output.html} -j {output.json}"

SnakeMake fastp From line 51 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastp -i {input.f} -I {input.r} -o {output.f} \
    -O {output.r} -q {phredQuality} -w {threads} \
    --detect_adapter_for_pe -h {output.html} -j {output.json}"

SnakeMake fastp From line 65 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {referenceDIR} \
    ftp://ftp.ensembl.org/pub/release-{ensemblRelease}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
    && pigz -d {output} -p {threads}"

SnakeMake From line 74 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && bowtie2-build --large-index {input} {params.indexPrefix} --threads {threads} \
    && rm {input}"

SnakeMake Bowtie 2 From line 87 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && bowtie2 -x {params.indexPrefix} -U {input.trimmed} -S {removalDIR}/{wildcards.id}.sam -p {threads} \
    && samtools view -bS {removalDIR}/{wildcards.id}.sam > {removalDIR}/{wildcards.id}.bam \
    && samtools view -b -f 4 -F 256 {removalDIR}/{wildcards.id}.bam > {removalDIR}/{wildcards.id}_unaligned.bam \
    && samtools sort -n {removalDIR}/{wildcards.id}_unaligned.bam -o {removalDIR}/{wildcards.id}_unaligned_sorted.bam \
    && samtools bam2fq {removalDIR}/{wildcards.id}_unaligned_sorted.bam > {removalDIR}/{wildcards.id}_unaligned.fastq \
    && rm {removalDIR}/{wildcards.id}.sam {removalDIR}/{wildcards.id}_unaligned.bam {removalDIR}/{wildcards.id}_unaligned_sorted.bam {input.trimmed}"

SnakeMake SAMtools Bowtie 2 From line 100 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && bowtie2 -x {params.indexPrefix} -1 {input.f} -2 {input.r} -S {removalDIR}/{wildcards.id}.sam -p {threads} \
    && samtools view -bS {removalDIR}/{wildcards.id}.sam > {removalDIR}/{wildcards.id}.bam \
    && samtools view -b -f 12 -F 256 {removalDIR}/{wildcards.id}.bam > {removalDIR}/{wildcards.id}_unaligned.bam \
    && samtools sort -n {removalDIR}/{wildcards.id}_unaligned.bam -o {removalDIR}/{wildcards.id}_unaligned_sorted.bam \
    && samtools bam2fq {removalDIR}/{wildcards.id}_unaligned_sorted.bam > {removalDIR}/{wildcards.id}_unaligned.fastq \
    && cat {removalDIR}/{wildcards.id}_unaligned.fastq | grep '^@.*/1$' -A 3 --no-group-separator > {removalDIR}/{wildcards.id}_unaligned_1.fastq \
    && cat {removalDIR}/{wildcards.id}_unaligned.fastq | grep '^@.*/2$' -A 3 --no-group-separator > {removalDIR}/{wildcards.id}_unaligned_2.fastq \
    && rm {removalDIR}/{wildcards.id}.sam {removalDIR}/{wildcards.id}_unaligned.bam {removalDIR}/{wildcards.id}_unaligned_sorted.bam {removalDIR}/{wildcards.id}_unaligned.fastq {input.f} {input.r}"

SnakeMake SAMtools Bowtie 2 From line 120 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastp -i {input.f} -I {input.r} -o {mergedDIR}/{wildcards.id}_unmerged_1.fastq \
    -O {mergedDIR}/{wildcards.id}_unmerged_2.fastq -q {phredQuality} -w {threads} \
    --detect_adapter_for_pe -h {output.html} -j {output.json} \
    -m --merged_out {output.merged}"

SnakeMake fastp From line 140 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && megahit -r {input.reads} --force -o {assembledDIR}/{wildcards.id} -t {threads} -m {megahitMemory}"

SnakeMake MEGAHIT From line 153 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && megahit -1 {input.f} -2 {input.r} --force -o {assembledDIR}/{wildcards.id} -t {threads} -m {megahitMemory}"

SnakeMake MEGAHIT From line 164 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju -t {input.nodes} -f {input.fmi} -i {input.reads} -o {resultDIR}/kaiju.out -z {threads} \
    && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \
    && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \
    && ktImportText -o {output.html} {resultDIR}/kaiju_krona \
    && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona"

SnakeMake Krona Kaiju From line 178 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju -t {input.nodes} -f {input.fmi} -i {input.f} -j {input.r} -o {resultDIR}/kaiju.out -z {threads} \
    && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \
    && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \
    && ktImportText -o {output.html} {resultDIR}/kaiju_krona \
    && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona"

SnakeMake Krona Kaiju From line 197 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju -t {input.nodes} -f {input.fmi} -i {input.reads} -o {resultDIR}/kaiju.out -z {threads} \
    && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \
    && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \
    && ktImportText -o {output.html} {resultDIR}/kaiju_krona \
    && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona"

SnakeMake Krona Kaiju From line 215 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastx_collapser -i {input} -o {output}"

SnakeMake From line 226 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && fastx_collapser -i {input} -o {output}"

SnakeMake From line 233 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {NRDIR} \
    ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
    && pigz -d {output} -p {threads}"

SnakeMake From line 240 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && diamond makedb --in {input} -d {output} --threads {threads}"

SnakeMake Diamond From line 250 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {taxonomicDIR}/db ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz \
    && tar -xf {taxonomicDIR}/db/taxdump.tar.gz nodes.dmp names.dmp \
    && mv nodes.dmp {taxonomicDIR}/db && mv names.dmp {taxonomicDIR}/db \
    && rm {taxonomicDIR}/db/taxdump.tar.gz \
    && wget -P {taxonomicDIR}/db ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz \
    && pigz -d {taxonomicDIR}/db/prot.accession2taxid.gz -p {threads} \
    && kaiju-convertNR -t {output.nodes} -g {taxonomicDIR}/db/prot.accession2taxid -e $(conda info --base)/envs/medusaPipeline/bin/kaiju-excluded-accessions.txt -a -o {output.convertedNR} -i {input} \
    && rm {taxonomicDIR}/db/prot.accession2taxid {input}"

SnakeMake From line 261 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju-mkbwt -n {threads} -a ACDEFGHIKLMNPQRSTVWY -o {taxonomicDIR}/db/kaijuNR {input} \
    && rm {input}"

SnakeMake From line 278 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && kaiju-mkfmi {taxonomicDIR}/db/kaijuNR \
    && rm {input.bwt} {input.sa}"

SnakeMake From line 289 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && touch {output.unaligned} \
    && diamond blastx -d {input.index} -q {input.reads} -o {output.matches} --top 3 --un {output.unaligned} --threads {threads}"

SnakeMake Diamond From line 302 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && touch {output.unaligned} \
    && diamond blastx -d {input.index} -q {input.reads} -o {output.matches} --top 3 -F 15 --range-culling --un {output.unaligned} --threads {threads}"

SnakeMake Diamond From line 315 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && wget -P {functionalDIR} \
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz \
    && pigz -d {output} -p {threads}"

SnakeMake From line 323 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && awk -F \"\t\" '{{if(($7!=\"\") && ($18!=\"\")){{print $18\"\t\"$7}}}}' {input} > {functionalDIR}/genbank2GO.txt \
    && awk -F \"\t\" '{{if(($4!=\"\") && ($7!=\"\")){{print $4\"\t\"$7}}}}' {input} > {functionalDIR}/refseq2GO.txt \
    && Rscript createDictionary.R {functionalDIR}/NR2GO.txt {functionalDIR}/genbank2GO.txt {functionalDIR}/refseq2GO.txt {threads} \
    && annotate createdb {functionalDIR}/NR2GO.txt NR2GO 0 1 -d {functionalDIR}/db \
    && rm {functionalDIR}/genbank2GO.txt {functionalDIR}/refseq2GO.txt"

SnakeMake From line 334 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && annotate idmapping {input.matches} {output.GO} NR2GO -l 1 -d {functionalDIR}/db"

SnakeMake From line 349 of main/Snakefile

shell: "set +eu \
    && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \
    && annotate idmapping {input.matchesContigs} {output.contigsGO} NR2GO -l 1 -d {functionalDIR}/db"