MAG Snakemake Workflow

public public 1yr ago Version: v1.0 0 bookmarks

MAG Snakemake Workflow

Contents

Overview

This pipeline has the scripts and modules used to generate, quality assess, and taxonomically classify skin MAGs which the pipeline will generate using different per sample and co-assembly approaches. The prok.Snakefile and euk.Snakefile detail the prokaryotic and eukaryotic analyses respectively. Note that for the eukaryotic analyses, EukRep and EukCC need to already have been run. To use the pipeline a file called coassembly_runs.txt and runs.txt must be produced which detail the co-assembly approaches and the per sample approaches used in our pipeline. The metagenomes must be put in the directory data/singlerun and data/coassembly for the single run and co-assembly approaches respectively. A sample coassembly_runs.txt and runs.txt file has been provided. This code has been adapted from https://github.com/Finn-Lab/MAG_Snakemake_wf/ accompanying the paper (https://www.nature.com/articles/s41596-021-00508-2). The viral analysis was done via the VIRify pipeline (https://github.com/EBI-Metagenomics/emg-viral-pipeline).

System Requirements

Hardware Requirements

HPC with at least 500 gigabytes of memory

Software Requirements

  • parallel-fastq-dump v0.6.6

  • KneadData v0.7.4 (Bowtie2 v2.3.5.1, Trimmommatic v0.39)

  • SPAdes v.3.13.0

  • metaWRAP v1.2.1

  • INFERNAL v1.1.2

  • tRNAScan-SE v2.0

  • BBMap v.37.62

  • GUNC v1.0.1

  • CAT v5.2.1

  • CheckM v1.1.2

  • Mash v.2.0

  • MUMmer v3.23

  • QUASTv5.0.2

  • dRep v2.3.2

  • GTDB-Tkv1.0.2

  • ncbi-genome-downloadv0.2.12

  • EukRep v.0.6.7

  • EukCC v0.2

Running pipeline

Submitting jobs

To run pipeline on the runs specified in runs.txt and coassembly_runs.txt, submit jobs with SLURM scheduler:

snakemake -s prok.Snakefile --use-singularity --restart-times 3 -k -j 50 --cluster-config clusterconfig.yaml --cluster "sbatch -n {cluster.nCPU} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -t {cluster.time}"

Submit jobs with LSF scheduler:

snakemake -s prok.Snakefile --use-singularity --restart-times 3 -k --jobs 50 --cluster-config clusterconfig.yaml --cluster "bsub -n {cluster.nCPU} -M {cluster.mem} -e {cluster.error} -o {cluster.output}"

Code Snippets

61
62
63
    shell:"""
       scp {input.tsv} {output.tsv}
"""
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
run:
    os.system("mkdir -p " + str(params.odir))
    infile=str(input)
    outfile=str(output)
    qc=str(params.qc)
    with open(infile) as inp:
         LoL=[x.strip().split('\t') for x in inp]
         row1=LoL[1]
         completeness=LoL[1][0]
         contamination=LoL[1][1]
         run=str(params.run)
         g = open(outfile, "w")
         if float(completeness)>50 and float(contamination)<5:
              g.write(run+','+completeness+','+contamination+'\n')
              g.close()
              h = open(params.qc, "a")
              h.write(run+','+completeness+','+contamination+'\n')
              h.close()
              f = str(params.f)
              odir=str(params.odir)
              os.system("scp "+f+ " "+str(odir))
         else:
              g.write("")
              g.close()
104
shell: "echo -e 'genome,completeness,contamination' | cat - {input} > {output}"
117
118
119
120
121
122
123
124
   shell:"""
        dRep dereplicate -p {threads} \
            {params.outfolder} -g {params.infolder}/*.fa \
            -pa 0.9 -sa 0.95 -nc 0.3 \
            -cm larger \
            --genomeInfo {input.genome_info} \
            -comp 50 -con 5 
"""
135
136
shell:
     "mash dist -p {threads} {input.db} {input.bins} > {output}"
146
147
148
149
shell:
    """
    sort -gk3 {input.mashdist}|sed -n 1p >{output}
    """
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
shell:
    """
    mkdir -p {params.genomesdir}
    while read col1 col2 rem
      do
        f="$(basename -- ${{col1}})"
        echo {params.genomesdir}/${{f%%.gz}}
        if [ ! -f {params.genomesdir}/${{f%%.gz}} ]; then
           echo "copying file over"
           scp ${{col1}} {params.genomesdir}
           gunzip {params.genomesdir}/${{f}}
        fi  
        echo 'dnadiff ${{col1}} ${{col2}} -p ${{col1%%.fasta}}_${{col2%%.fa}}_'
        dnadiff {params.genomesdir}/${{f%%.gz}} ${{col2}} -p {params.outdir}/{params.bins}
      done < {input.bestmash}
    """
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
run:
    outfile = str(output)
    f = open(input.dnadiff)
    data = f.read()
    first_line = data.split("\n", 1)[0]
    a = first_line.split(" ")
    ref = a[0]
    quer = a[1]
    with open(outfile, "w") as outf:
        path_dna = input.dnadiff
        base = os.path.basename(path_dna)
        base = base.split(".report")[0]
        with open(path_dna) as f:
            for line in f:
                if "TotalBases" in line:
                    cols = line.split()
                    lenref = int(cols[1])
                    lenquer = int(cols[2])
                if "AlignedBases" in line:
                    cols = line.split()
                    aliref = cols[1].split("(")[-1].split("%")[0]
                    alique = cols[2].split("(")[-1].split("%")[0]
                if "AvgIdentity" in line:
                    cols = line.split()
                    ident = float(cols[1])
        line = "%s\t%s\t%i\t%.2f\t%i\t%.2f\t%.2f" % (ref, quer, lenref, float(aliref), lenquer, float(alique), float(ident))
        outf.writelines(line + "\n")
218
219
220
221
shell:
    """
    cat {input}>{output}
    """
32
33
34
35
36
37
38
shell:
    """
    rm -rf {params.outdir}
    dmesg -T
    spades.py --meta -1 {input.fwd} -2 {input.rev} \
    -o {params.outdir} -t {threads} -m {resources.mem}
    """
54
55
56
57
58
59
60
shell:
    """
    rm -rf {params.outdir}
    dmesg -T
    spades.py --meta -1 {input.fwd} -2 {input.rev} \
    -o {params.outdir} -t {threads} -m {resources.mem}
    """
67
68
69
70
71
72
73
74
shell:
    """
    rm -rf {params.outdir}
    metawrap binning -t {threads} -m {resources.mem}\
    -a {input.fasta} --maxbin2 --metabat2 --concoct \
    -l {params.mincontiglength} -o {params.outdir} {input.fwd} {input.rev}
    touch {output.outfile}
    """
 93
 94
 95
 96
 97
 98
 99
100
shell:
    """
    rm -rf {params.outdir}
    metawrap binning -t {threads} -m {resources.mem} \
    -a {params.assembly} --metabat2 --maxbin2 --concoct \
    -l {params.mincontiglength} -o {params.outdir} {params.reads}
    touch {output}
    """
35
36
37
38
39
shell:
    """
    cat {input.r1} > {output.r1}
    cat {input.r2} > {output.r2}
    """
51
52
53
54
55
56
shell:
    """
    repair.sh in={input.fwd} in2={input.rev} out={output.fwd} out2={output.rev} repair
    rm -f {input.fwd}
    rm -f  {input.rev}
    """
69
70
71
72
73
74
75
76
77
78
79
shell:
    """
    rm -f {params.fwd}
    rm -f {params.rev}
    rm -f {output.fwd}
    rm -f {output.rev}
    gzip {input.fwd}
    gzip {input.rev}
    mv {params.fwd} {output.fwd}
    mv {params.rev} {output.rev} 
    """
89
90
91
92
93
94
95
96
97
98
run:
    outfile = str(output)
    if os.path.exists(outfile):
        os.remove(outfile)
    with open(outfile, "w") as outf:
        outf.writelines("Run\tReadcount\n")
        for run in input:
            readcount = int(linecount(run))
            line = run + "\t" + str(readcount)
            outf.writelines(line + "\n")
29
30
31
32
33
34
shell:
    """
     rm -rf {params.outdir}
     dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \
    -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5
    """
45
46
47
48
49
50
shell:
    """
    awk 'FNR!=1' {input.checkm} {input.checkm_coas}>{params.checkm}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm} > {output}
    rm {params.checkm}
    """
66
67
68
69
70
71
72
73
74
shell:
    """
    rm -rf {params.outdir}
    mkdir -p {params.indir}
    rsync {params.singlerun_dRep}/* {params.indir}
    rsync {params.all_coas}/* {params.indir}
    dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \
    -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5
    """
88
89
90
91
92
93
94
shell:
    """
    real=$(realpath {input.gtdbrelease})
    rm -rf {params.outdir}
    export GTDBTK_DATA_PATH=${{real}}
    gtdbtk classify_wf --cpus {threads} --genome_dir {params.indir} --out_dir {params.outdir} -x {params.ext}
    """
102
103
104
105
shell:
    """
    Rscript scripts/plotting/plot_gtdb.R {input}
    """
30
31
32
33
34
35
36
37
shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    rm -rf {output}
    metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10
     touch {output}
    """
48
49
50
51
52
53
shell:
    """
    rm -rf {params.folder}
    mkdir -p {params.folder}
    cp {params.refined_folder}/*.fa {params.folder}
    """
72
73
74
75
76
77
78
shell:
    """
    mkdir -p {params.dir1}
    mkdir -p {params.dir2}
    cp {input} {params.out1}
    cp {input} {params.out2}
    """
86
87
88
89
shell:
    """
    touch {output}
    """
101
102
103
104
105
106
shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir}
    """
117
118
119
120
121
122
123
124
125
shell:
    """
    sed -i '1d' {input}
    cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm}
    sed 's/,/.fa,/' {params.checkm}>{params.checkm2}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output}
    rm {params.checkm}
    rm {params.checkm2}
    """
30
31
32
33
34
35
36
shell:
    """
    checkm data setRoot ~/checkm_database/ 
    rm -rf {params.outdir}
    metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10
    touch {output}
    """
47
48
49
50
51
52
shell:
    """
    rm -rf {params.folder}
    mkdir -p {params.folder}
    cp {params.refined_folder}/*.fa {params.folder}
    """
72
73
74
75
76
77
78
shell:
    """
    mkdir -p {params.dir1}
    mkdir -p {params.dir2}
    cp {input} {params.out1}
    cp {input} {params.out2}
    """
86
87
88
89
shell:
    """
    touch {output}
    """
101
102
103
104
105
106
shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir}
    """
117
118
119
120
121
122
123
124
125
shell:
    """
    sed -i '1d' {input}
    cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm}
    sed 's/,/.fa,/' {params.checkm}>{params.checkm2}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output}
    rm {params.checkm}
    rm {params.checkm2}
    """
135
136
137
138
shell:
    """
    Rscript scripts/plotting/plot_checkm_mags.R {input.sr} {input.coas}
    """
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
library(ggplot2)


args <- commandArgs(trailingOnly = TRUE)

checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE)
checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE)

checkm_sr$Status2="Single run"
checkm_coas$Status2="Co-assembly"

checkm=rbind.data.frame(checkm_sr,checkm_coas)
checkm$Status=factor(checkm$Status, levels=c("Single run", "Co-assembly"))

setwd("data/figures/")
ggplot(checkm, aes(x=Status, y=completeness, fill=Status)) +
  geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Completeness") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+theme(legend.position="none")
ggsave("checkm_completeness.png",plot = last_plot())

ggplot(checkm, aes(x=Status, y=contamination, fill=Status))+geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Contamination") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+theme(legend.position="none")+ylim(0,10)
ggsave("checkm_contam.png",width=5,height=5)
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
library(RColorBrewer)
library(tidyr)
library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)
gtdb.taxa=read.delim(args[1],header=TRUE,stringsAsFactor=FALSE)

ranks = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
gtdb.taxa = separate(data = gtdb.taxa, col = classification, sep = ";", into = ranks)
gtdb.bac = gtdb.taxa[which(gtdb.taxa$Domain == "d__Bacteria"),]


# calculate bacteria proportions
if(exists("gtdb.fi.bac")){
  rm(gtdb.fi.bac)
}
for (r in ranks[2:(length(ranks)-1)]){
  total = nrow(gtdb.bac)   #total number of genomes
  rank.present = table(gtdb.bac[!grepl("__$", gtdb.bac[,r]),r])   #summary of genus
  total.rank = sum(rank.present) #sum of genus counts 
  gtdb.ranks = data.frame(sort(rank.present)[(length(rank.present)-6):length(rank.present)])
  gtdb.ranks$Prop = gtdb.ranks$Freq/total*100
  other.name = paste(tolower(substr(r, 1, 1)), "__Other", sep="")
  novel.name = paste(tolower(substr(r, 1, 1)), "__Novel", sep="")

  other.freq = total.rank-sum(gtdb.ranks$Freq)
  novel.freq=total-total.rank  #novel genomes

  other.prop = other.freq/total*100
  novel.prop = novel.freq/total*100

  other.ranks = data.frame(Var1=other.name, Freq=other.freq, Prop=other.prop)
  novel.ranks = data.frame(Var1=novel.name, Freq=novel.freq, Prop=novel.prop)

  ranks.fi = rbind(other.ranks, gtdb.ranks)
  ranks.fi = rbind(novel.ranks, ranks.fi)

  ranks.fi$Rank = r
  ranks.fi$Colour=c("#999999","#CAB2D6","#A6CEE3","#FF7F00","#FB9A99","#E31A1C","#B2DF8A","#33A02C","#1F78B4")

  if(exists("gtdb.fi.bac")){
    gtdb.fi.bac = rbind(gtdb.fi.bac, ranks.fi)
  } else {
    gtdb.fi.bac = ranks.fi
  }
}

gtdb.fi.bac$Rank=factor(gtdb.fi.bac$Rank,levels=c("Phylum","Class","Order","Family","Genus"))
# plot stacked plot taxa counts
p<-print(ggplot(gtdb.fi.bac, aes(x=Rank, y=Prop, fill=Var1)) 
      + geom_bar(stat="identity", colour="darkgrey", alpha=0.5, size=0.2, width=0.7)
      + theme_bw()
      + ylab("Proportion of species (%)")
      + coord_flip()
      + scale_fill_manual(values=as.vector(gtdb.fi.bac$Colour), name="Taxa")
      #+ guides(fill=FALSE)
      + scale_x_discrete(limits=ranks[(length(ranks)-1):2])
      + theme(legend.position="right")
      + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
      + theme(axis.title.x = element_text(size=14))
      + theme(axis.text.y = element_text(size=12))
      + theme(axis.title.y = element_blank())
      + theme(axis.text.x = element_text(size=12)))



ggsave("data/figures/gtdb_bacteria.png",p,width=15)
ShowHide 28 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Finn-Lab/skin_microbiome
Name: skin_microbiome
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...