Recovery of prokaryotic genomes from shotgun metagenomic sequencing data

public public 1yr ago 0 bookmarks

MAG Snakemake Workflow

Contents

Overview

This pipeline can be used for recovery and quality assessment of prokaryotic MAGs from short-read, host-associated metagenomic datasets. The data analyzed is specificed using two files, which detail the co-assembly samples and single runs that are to be analyzed. The file runs.txt has the SRA accession for the single runs with a different accession in each line. The file coassembly_runs.txt specifies the co-assembly samples. This file is in tabular format with three columns, the first column specifies the name of the resulting co-assembly sample. The r1 and r2 columns specify the path of the forward and reverse reads constituting each co-assembly sample, with each read path separated by a comma. In this pipeline, we used a small subset of gut dataset previously analyzed by Almeida et al. There are 40 single runs and 2 co-assembly samples. The co-assembly samples are named based on the metadata that was chosen to perform co-assembly. If the runs are not publicly available, they should be placed in the subdirectory data/raw with respect to the Snakefile. The forward and reverse reads must have the extensions _1.fastq and _2.fastq respectively. Note that the version of kneaddata used in this pipeline still requires the read suffixes /1 and /2, so the headers must be formatted accordingly. If the runs are publicly available, specify them in the runs.txt file and the sra_download module will download the runs from the SRA using their SRA accession to the directory data/raw.

System Requirements

Hardware Requirements

HPC with at least 500 gigabytes of memory

The CPU times below are generated using the cluster config file included in the repo

Software Requirements

MAG Snakemake pipeline (https://github.com/Finn-Lab/MAG_Snakemake_wf)

Singularity 3.5.0 (https://github.com/hpcng/singularity)

Snakemake (version 5.18) (https://github.com/snakemake/snakemake)

Running the MAG Snakemake pipeline will automatically download the sequencing data from the SRA. It will also download the relevant singularity containers so the relevant software needed for our pipeline can be used. Alternatively, the tools can be manually downloaded from:

ncbi-genome-download (version 0.3.0) (https://github.com/kblin/ncbi-genome-download)

mash (version 2.2.1) (https://github.com/marbl/Mash)

parallel-fastq-dump (version 0.6.6) & fastq-dump (version 2.8.0) (https://github.com/rvalieris/parallel-fastq-dump)

fastqc (version 0.11.7) (https://github.com/s-andrews/FastQC)

multiqc (version 1.3) (https://github.com/ewels/MultiQC)

kneaddata (version 0.7.4) with Trimmomatic (version 0.39) & Bowtie (version 2.4.2) (https://github.com/biobakery/kneaddata)

metaSPAdes (version 3.14.0) (https://github.com/ablab/spades)

metaWRAP (version 1.2.2) (https://github.com/bxlab/metaWRAP)

CheckM (version 1.0.12) (https://github.com/Ecogenomics/CheckM)

Bowtie (version 2.4.1) (https://github.com/BenLangmead/bowtie2)

Prokka (version 1.14.5) (https://github.com/tseemann/prokka)

CMSeq (version 1.0) (https://bitbucket.org/CibioCM/cmseq)

mummer (version 3.23) (https://github.com/mummer4/mummer)

dRep (version 2.3.2) (https://github.com/MrOlm/drep)

GTDB_Tk (version 1.2.0) (https://github.com/Ecogenomics/GTDBTk)

bwa (version 0.7.17) (https://github.com/lh3/bwa)

samtools (version 1.9) (https://github.com/samtools/samtools)

Other

RefSeq complete bacterial genomes (downloaded May 2020) (https://www.ncbi.nlm.nih.gov/refseq/)

GTDB database (release 95) (https://data.ace.uq.edu.au/public/gtdb/data/releases/)

Workflow setup

Download the GTDB database using:

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz

Download all RefSeq bacterial genomes using:

ncbi-genome-download bacteria --formats fasta --section refseq --assembly-levels complete

Next generate a Mash sketch of the database with default k-mer and sketch size from the main directory using:

mash sketch -o refseq.msh /path/to/RefSeq/*fasta

Download the code for the pipeline from (https://github.com/Finn-Lab/MAG_Snakemake_wf) in a location that has at least 1.5 TB of disk space. Change directory to this folder. Move the GTDB database and the RefSeq Mash sketch to the subfolder /data/databases using:

cd /path/to/MAG_Snakemake_wf/
mkdir -p data/databases
mv /path/to/refseq.msh data/databases
mv /path/to/gtdbtk_r95_data.tar.gz data/databases
tar -xvzf data/databases/gtdbtk_r95_data.tar.gz

Install snakemake into an environment using:

conda create -c conda-forge -c bioconda -n snakemake snakemake=5.18

Then activate the environment before using snakemake:

conda activate snakemake

Running pipeline

Submitting jobs

To run pipeline on the small gut dataset specified in runs.txt and coassembly_runs.txt, submit jobs with SLURM scheduler:

snakemake --use-singularity --restart-times 3 -k -j 50 --cluster-config clusterconfig.yaml --cluster "sbatch -n {cluster.nCPU} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -t {cluster.time}"

Submit jobs with LSF scheduler:

snakemake --use-singularity --restart-times 3 -k --jobs 50 --cluster-config clusterconfig.yaml --cluster "bsub -n {cluster.nCPU} -M {cluster.mem} -e {cluster.error} -o {cluster.output}"

CPU-time

The CPU time for the demo dataset and the cluster configuration file provided is as follows:

Data Download: 16 hours

Preprocessing: 48 hours

Assembly/Co-assembly: 2600 hours

Binning: 100 hours

Quality Assessment & Bin Refinement; Estimate completeness and contamination of MAGs: 25 hours

Quality Assessment & Bin Refinement; Estimate strain heterogeneity of MAGs: 700 hours

Quality Assessment & Bin Refinement; Compare MAGs to RefSeq genomes: 1 hour

Quality Assessment & Bin Refinement; Bin refinement: 260 hours

Dereplicate MAGs: 4 hours

Taxonomic Classification: 3 hours

Evaluate Bottlenecks: 120 hours

Citation

For a walk-through of this pipeline, please visit: Saheb Kashaf, S., Almeida, A., Segre, J.A. et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat Protoc (2021). https://doi.org/10.1038/s41596-021-00508-2

To generate results from the paper associated with this pipeline, please select a figure to reproduce in the Snakefile.

Code Snippets

34
35
36
37
38
39
40
shell:
    """
    rm -rf {params.outdir}
    dmesg -T
    spades.py --meta -1 {input.fwd} -2 {input.rev} \
    -o {params.outdir} -t {threads} -m {resources.mem}
    """
58
59
60
61
62
63
64
shell:
    """
    rm -rf {params.outdir}
    dmesg -T
    spades.py --meta -1 {input.fwd} -2 {input.rev} \
    -o {params.outdir} -t {threads} -m {resources.mem}
    """
69
70
71
72
73
74
75
76
shell:
    """
    rm -rf {params.outdir}
    metawrap binning -t {threads} -m {resources.mem}\
    -a {input.fasta} --maxbin2 --metabat2 --concoct \
    -l {params.mincontiglength} -o {params.outdir} {input.fwd} {input.rev}
    touch {output.outfile}
    """
 97
 98
 99
100
101
102
103
104
shell:
    """
    rm -rf {params.outdir}
    metawrap binning -t {threads} -m {resources.mem} \
    -a {params.assembly} --metabat2 --maxbin2 --concoct \
    -l {params.mincontiglength} -o {params.outdir} {params.reads}
    touch {output}
    """
40
41
42
43
shell:
    """
    scripts/rename_multifasta_prefix.py -f {input.MAG} -p {params.name} > {output}
    """
64
65
66
67
68
shell:
    """
    prokka {input} --kingdom Bacteria --outdir {params.out_prokka} \
    --prefix {params.prefix} --force --locustag {params.prefix} --cpus {threads}
    """
82
83
84
85
shell:
    """
    scripts/rename_multifasta_prefix.py -f {input.MAG} -p {params.name} > {output}
    """
106
107
108
109
110
shell:
    """
    prokka {input} --kingdom Bacteria --outdir {params.out_prokka} \
    --prefix {params.prefix} --force --locustag {params.prefix} --cpus {threads}
    """
135
136
137
138
shell:
    """
    scripts/cmseq.sh -t {threads} -i {input.r1} -n {input.r2} -r {input.MAG} -g {input.prokka} -o {params.name}
    """
172
173
174
175
176
177
178
179
180
181
shell:
    """
    rm -f {output.cmseq}
    rm -f {output.done}
    for i in {params.r1}; do run=$(basename ${{i}} _1.fastq); scripts/cmseq.sh\
    -t {threads} -i ${{i}} -n ${{i%%_1.fastq}}_2.fastq -r {input.MAG} -g {input.prokka}\
    -o {params.name}_${{run}}; awk 'NR==2' {params.name}_${{run}}.cmseq.csv >> {output.cmseq};done

    touch {output.done}
    """
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
run:
    if os.path.exists(str(output)):
        os.remove(str(output))
    else:
        print("Aggregating single run strain heterogeneity results")
    path = str(params)
    file1 = open(str(output), "w")
    for filename in glob.glob(os.path.join(path, "*.csv")):
        with open(filename, "r") as f:
            lines = f.readlines()
            if len(lines) > 1:
                lines_sub = lines[1].strip()
                L = filename + "\t" + lines_sub + "\n"
                file1.writelines(L)
    file1.close()
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
run:
    if os.path.exists(str(output)):
        os.remove(str(output))
    else:
        print("Aggregating coassembly strain heterogeneity results")
    path = str(params)
    file1 = open(str(output), "w")
    for filename in glob.glob(os.path.join(path, "*SRR*.csv")):
        with open(filename, "r") as f:
            lines = f.readlines()
            if len(lines) > 1:
                lines_sub = lines[1].strip()
                L = filename + "\t" + lines_sub + "\n"
                file1.writelines(L)
    file1.close()
239
240
241
242
shell:
    """
    cat {input}>{output}         
    """
250
251
252
253
shell:
    """
    cat {input}>{output}
    """
264
265
266
267
shell:
    """
    Rscript scripts/plotting/plot_cmseq.R {input.sr} {input.coas}
    """
35
36
37
38
39
shell:
    """
    cat {input.r1} > {output.r1}
    cat {input.r2} > {output.r2}
    """
51
52
53
54
55
56
shell:
    """
    repair.sh in={input.fwd} in2={input.rev} out={output.fwd} out2={output.rev} repair
    rm -f {input.fwd}
    rm -f  {input.rev}
    """
69
70
71
72
73
74
75
76
77
78
79
shell:
    """
    rm -f {params.fwd}
    rm -f {params.rev}
    rm -f {output.fwd}
    rm -f {output.rev}
    gzip {input.fwd}
    gzip {input.rev}
    mv {params.fwd} {output.fwd}
    mv {params.rev} {output.rev} 
    """
89
90
91
92
93
94
95
96
97
98
run:
    outfile = str(output)
    if os.path.exists(outfile):
        os.remove(outfile)
    with open(outfile, "w") as outf:
        outf.writelines("Run\tReadcount\n")
        for run in input:
            readcount = int(linecount(run))
            line = run + "\t" + str(readcount)
            outf.writelines(line + "\n")
29
30
shell:
    "mash dist -p {threads} {input.db} {input.bins} > {output}"
43
44
45
46
shell:
    """
    sort -gk3 {input.mashdist}|sed -n 1p >{output}
    """
62
63
64
65
66
67
68
69
shell:
    """
    while read col1 col2 rem
      do
        echo 'dnadiff ${{col1}} ${{col2}} -p ${{col1%%.fasta}}_${{col2%%.fa}}_'
        dnadiff ${{col1}} ${{col2}} -p {params.outdir}/{params.bins}
      done < {input.bestmash}
    """
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
run:
    outfile = str(output)
    f = open(input.dnadiff)
    data = f.read()
    first_line = data.split("\n", 1)[0]
    a = first_line.split(" ")
    ref = a[0]
    quer = a[1]
    with open(outfile, "w") as outf:
        path_dna = input.dnadiff
        base = os.path.basename(path_dna)
        base = base.split(".report")[0]
        with open(path_dna) as f:
            for line in f:
                if "TotalBases" in line:
                    cols = line.split()
                    lenref = int(cols[1])
                    lenquer = int(cols[2])
                if "AlignedBases" in line:
                    cols = line.split()
                    aliref = cols[1].split("(")[-1].split("%")[0]
                    alique = cols[2].split("(")[-1].split("%")[0]
                if "AvgIdentity" in line:
                    cols = line.split()
                    ident = float(cols[1])
        line = "%s\t%s\t%i\t%.2f\t%i\t%.2f\t%.2f" % (ref, quer, lenref, float(aliref), lenquer, float(alique), float(ident))
        outf.writelines(line + "\n")
118
119
120
121
shell:
    """
    cat {input}>{output}
    """
129
130
131
132
shell:
    """
    cat {input}>{output}
    """
147
148
149
150
shell:
    """
    cat {input}>{output}
    """
162
163
164
165
shell:
    """
    Rscript scripts/plotting/dnadiff_plot.R {input.checkm_sr} {input.checkm_coas} {input.summ}
    """
31
32
33
34
35
36
shell:
    """
     rm -rf {params.outdir}
     dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \
    -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5
    """
47
48
49
50
51
52
shell:
    """
    awk 'FNR!=1' {input.checkm} {input.checkm_coas}>{params.checkm}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm} > {output}
    rm {params.checkm}
    """
70
71
72
73
74
75
76
77
78
shell:
    """
    rm -rf {params.outdir}
    mkdir -p {params.indir}
    rsync {params.singlerun_dRep}/* {params.indir}
    rsync {params.all_coas}/* {params.indir}
    dRep dereplicate -p {threads} {params.outdir} -g {params.indir}/*.fa \
    -pa 0.9 -sa 0.95 -nc 0.30 -cm larger --genomeInfo {input.checkm} -comp 50 -con 5
    """
 94
 95
 96
 97
 98
 99
100
shell:
    """
    real=$(realpath {input.gtdbrelease})
    rm -rf {params.outdir}
    export GTDBTK_DATA_PATH=${{real}}
    gtdbtk classify_wf --cpus {threads} --genome_dir {params.indir} --out_dir {params.outdir} -x {params.ext}
    """
110
111
112
113
shell:
    """
    Rscript scripts/plotting/plot_gtdb.R {input}
    """
33
34
35
36
37
38
39
40
41
42
43
44
shell:
    """
    rm -rf {params.dir}
    mkdir -p {params.dir}
    scp {input.scaffold} {params.scaffold}
    bwa index {params.scaffold}
    bwa mem -t {threads} {params.scaffold} {input.fwd} {input.rev} | samtools view -bS - | \
    samtools sort -@ {threads} -o {params.alignedsorted} -
    samtools index {params.alignedsorted}
    samtools flagstat {params.alignedsorted} > {output.flagstat}
    rm {params.alignedsorted}
    """
54
55
56
57
shell:
    """
    crimson flagstat {input}>{output}
    """
65
66
67
68
69
70
71
72
73
74
75
76
77
run:
    with open(str(output), "w") as outf:
        for i in input:
            run = str(i).split("singlerun/")[1]
            run = run.split("/mapreads/")[0]
            dict = eval(open(str(i), "r").read())
            supplementary = dict["pass_qc"]["supplementary"]
            secondary = dict["pass_qc"]["secondary"]
            mapped = dict["pass_qc"]["mapped"]
            total = dict["pass_qc"]["total"]
            perc = (mapped - supplementary - secondary) 
            outf.write(str(run) + "\t" + str(perc) + "\n")
    outf.close()
94
95
96
97
98
shell:
    """
    cat {params.indir}/*fa>{output}
    bwa index {output}
    """
110
111
112
113
114
shell:
    """
    cat {params.indir}/*.fa>{output}
    bwa index {output}
    """
128
129
130
131
132
133
134
135
shell:
    """
    bwa mem -t {threads} {input.catalogue} {input.fwd} {input.rev} \
    | samtools view -bS - | samtools sort -@ {threads} -o {params.alignedsorted} -
    samtools index {params.alignedsorted}
    samtools flagstat {params.alignedsorted} > {output.flagstat}
    rm {params.alignedsorted}
    """
149
150
151
152
153
154
155
156
shell:
    """
    bwa mem -t {threads} {input.catalogue} {input.fwd} {input.rev}\
    | samtools view -bS - | samtools sort -@ {threads} -o {params.alignedsorted} -
    samtools index {params.alignedsorted}
    samtools flagstat {params.alignedsorted} > {output.flagstat}
    rm {params.alignedsorted}
    """
166
167
168
169
shell:
    """
    crimson flagstat {input}>{output}
    """
179
180
181
182
shell:
    """
    crimson flagstat {input}>{output}
    """
190
191
192
193
194
195
196
197
198
199
200
201
202
run:
    with open(str(output), "w") as outf:
        for i in input:
            base = os.path.basename(str(i))
            run = os.path.splitext(base)[0]
            dict = eval(open(str(i), "r").read())
            supplementary = dict["pass_qc"]["supplementary"]
            secondary = dict["pass_qc"]["secondary"]
            mapped = dict["pass_qc"]["mapped"]
            total = dict["pass_qc"]["total"]
            perc = (mapped - supplementary - secondary) 
            outf.write(str(run) + "\t" + str(perc) + "\n")
    outf.close()
211
212
213
214
215
216
217
218
219
220
221
222
223
run:
    with open(str(output), "w") as outf:
        for i in input:
            base = os.path.basename(str(i))
            run = os.path.splitext(base)[0]
            dict = eval(open(str(i), "r").read())
            supplementary = dict["pass_qc"]["supplementary"]
            secondary = dict["pass_qc"]["secondary"]
            mapped = dict["pass_qc"]["mapped"]
            total = dict["pass_qc"]["total"]
            perc = (mapped - supplementary - secondary) 
            outf.write(str(run) + "\t" + str(perc) + "\n")
    outf.close()
241
242
243
244
shell:
    """
    Rscript scripts/plotting/plot_framework.R {input.readcounts} {input.flagstat} {input.mapreads_sr} {input.mapreads_coas} {params.summary}
    """
30
31
32
33
shell:
    """
    fastqc {input.fwd} --outdir {params.outdir}
    """
47
48
49
50
shell:
    """
    fastqc {input.rev} --outdir {params.outdir}
    """
65
66
67
68
69
70
shell:
    """
    rm -f {params.outfile}
    multiqc {params.indir} -o {params.outdir}
    mv {params.outfile} {output}
    """
85
86
87
88
shell:
    """
    kneaddata_database --download human_genome bowtie2 {params.outdir}
    """
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
shell:
    """
    #this version of kneaddata requires read identifiers (/1, /2). Given we did not download the file from the sra using the option --readids, we need to remove spaces in the headers
    # and then add /1 and /2 using reformat.sh. Kneaddata by default does not sort the reads so Next we sort the reads. 
    # Alternatively, one could sort using kneaddata with the option --reorder
    rm -f {params.tmp_fwd} {params.tmp_rev} {params.tmp_fwd2} {params.tmp_rev2} {params.fwd} {params.rev} {output.fwd} {output.rev}
    mkdir -p {params.outdir}
    seqtk seq -C {input.fwd} > {params.tmp_fwd}
    seqtk seq -C {input.rev} > {params.tmp_rev}
    reformat.sh in={params.tmp_fwd} in2={params.tmp_rev} out1={params.tmp_fwd2} out2={params.tmp_rev2} addslash spaceslash=f
    kneaddata --remove-intermediate-output --threads {threads} \
    --input {params.tmp_fwd2} --input {params.tmp_rev2}\
    --output {params.outdir} \
    --reference-db {params.indx} \
    --trimmomatic-options "ILLUMINACLIP:/data/adapters/TruSeq3-PE.fa:2:30:10: SLIDINGWINDOW:4:20 MINLEN:50" --trimmomatic /data/\
    --bowtie2-options "--very-sensitive --dovetail" 
    repair.sh in={params.fwd} in2={params.rev} out={output.fwd} out2={output.rev} repair
    rm {params.tmp_fwd} {params.tmp_rev} {params.tmp_fwd2} {params.tmp_rev2}
    """
147
148
149
150
shell:
    """
    fastqc {input.fwd} --outdir {params.outdir}
    """
163
164
165
166
shell:
    """
    fastqc {input.rev} --outdir {params.outdir}
    """
181
182
183
184
185
shell:
    """
    multiqc --force {params.indir} -o {params.outdir}
    mv {params.outfile} {output}
    """
198
199
200
201
202
203
204
205
206
207
run:
    outfile = str(output)
    if os.path.exists(outfile):
        os.remove(outfile)
    with open(outfile, "w") as outf:
        outf.writelines("Run\tReadcount\n")
        for run in input:
            readcount = int(linecount(run))
            line = run + "\t" + str(readcount)
            outf.writelines(line + "\n")
32
33
34
35
36
37
38
39
shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    rm -rf {output}
    metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10
     touch {output}
    """
50
51
52
53
54
55
shell:
    """
    rm -rf {params.folder}
    mkdir -p {params.folder}
    cp {params.refined_folder}/*.fa {params.folder}
    """
74
75
76
77
78
79
80
shell:
    """
    mkdir -p {params.dir1}
    mkdir -p {params.dir2}
    cp {input} {params.out1}
    cp {input} {params.out2}
    """
88
89
90
91
shell:
    """
    touch {output}
    """
105
106
107
108
109
110
shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir}
    """
121
122
123
124
125
126
127
128
129
shell:
    """
    sed -i '1d' {input}
    cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm}
    sed 's/,/.fa,/' {params.checkm}>{params.checkm2}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output}
    rm {params.checkm}
    rm {params.checkm2}
    """
32
33
34
35
36
37
38
shell:
    """
    checkm data setRoot ~/checkm_database/ 
    rm -rf {params.outdir}
    metawrap bin_refinement -o {params.outdir} -t {threads} -A {params.metabat} -B {params.maxbin} -C {params.concoct} -c 50 -x 10
    touch {output}
    """
49
50
51
52
53
54
shell:
    """
    rm -rf {params.folder}
    mkdir -p {params.folder}
    cp {params.refined_folder}/*.fa {params.folder}
    """
74
75
76
77
78
79
80
shell:
    """
    mkdir -p {params.dir1}
    mkdir -p {params.dir2}
    cp {input} {params.out1}
    cp {input} {params.out2}
    """
88
89
90
91
shell:
    """
    touch {output}
    """
105
106
107
108
109
110
shell:
    """
    checkm data setRoot ~/checkm_database/
    rm -rf {params.outdir}
    checkm lineage_wf -t {threads} -x {params.ext} --tab_table -f {output} {params.indir} {params.outdir}
    """
121
122
123
124
125
126
127
128
129
shell:
    """
    sed -i '1d' {input}
    cut -f1,12,13,14 {input} | tr '\t' ','>{params.checkm}
    sed 's/,/.fa,/' {params.checkm}>{params.checkm2}
    echo -e "genome,completeness,contamination,strain_heterogeneity" | cat - {params.checkm2} > {output}
    rm {params.checkm}
    rm {params.checkm2}
    """
141
142
143
144
shell:
    """
    Rscript scripts/plotting/plot_checkm_mags.R {input.sr} {input.coas}
    """
32
33
34
35
36
37
shell:
    """
    echo {threads}
    prefetch {params.run} && vdb-validate {params.run} && parallel-fastq-dump --threads {threads} \
    --outdir {params.outdir} --skip-technical --split-3 --sra-id {params.run} --gzip
    """
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
usage()
{
cat << EOF
usage: $0 options

CMseq workflow to infer strain heterogeneity of MAG.

'bsub' with '-M 5000'

OPTIONS:
   -t      Number of threads [REQUIRED]
   -i      Input first or only read (.fastq or .fastq.gz format) [REQUIRED]
   -n      Input reverse read (.fastq or fastq.gz [OPTIONAL]
   -r      MAG genome file (.fa or .fasta) [REQUIRED]
   -g      MAG prokka output (.gff)
   -o      Output files prefix (include path) [REQUIRED]
EOF
}

# variables
threads=
ref=
reads=
reads2=
outprefix=
gff=

while getopts "t:m:i:n:r:o:g:" OPTION
do
     case ${OPTION} in
         t)
             threads=${OPTARG}
             ;;
         i)
             reads=${OPTARG}
             ;;
         n)
             reads2=${OPTARG}
             ;;
         r)
             ref=${OPTARG}
             ;;
         o)
             outprefix=${OPTARG}
             ;;
         g)
             gff=${OPTARG}
             ;;
         ?)
             usage
             exit
             ;;
     esac
done

# check arguments
if [[ -z ${threads} ]] || [[ -z ${reads} ]] || [[ -z ${ref} ]] || [[ -z ${outprefix} ]] || [[ -z ${gff} ]]
then
     echo "ERROR : Please supply correct arguments"
     usage
     exit 1
fi


timestamp() {
  date +"%H:%M:%S"
}

readname=$(basename ${outprefix})
refix=$(basename ${ref%%.fa*})
if [ ${threads} -eq 1 ]
then
    threads_sam=1
else
    threads_sam=$((${threads}-1))
fi

# index reference file
if [[ ! -s ${outprefix}.1.bt2 ]]
then
    echo "$(timestamp) [ CMseq ] Indexing MAG FASTA file"
    bowtie2-build ${ref} ${outprefix}
else
    echo "$(timestamp) [ CMseq ] Index found, skipping bowtie2-build"
fi


# initial mapping and sorting
echo "$(timestamp) [ CMseq ] Mapping reads against MAG ..."
if [[ -z ${reads2} ]]
then
    bowtie2 --very-sensitive-local -x ${outprefix} -p ${threads} -U ${reads} | samtools view -@ ${threads_sam} -uS - | samtools sort -@ ${threads_sam} - -o ${outprefix}.bam
    samtools index -@ ${threads_sam} ${outprefix}.bam
else
    bowtie2 --very-sensitive-local -x ${outprefix} -p ${threads} -1 ${reads} -2 ${reads2} | samtools view -@ ${threads_sam} -uS - | samtools sort -@ ${threads_sam} - -o ${outprefix}.bam
    samtools index -@ ${threads_sam} ${outprefix}.bam
fi

# calculate polymorphic sites
echo "$(timestamp) [ CMseq ] Checking polymorphic sites ..."
scripts/polymut.py -c ${ref} ${outprefix}.bam --gff_file ${gff} --mincov 10 --minqual 30 --dominant_frq_thrsh 0.8 > ${outprefix}.cmseq.csv
rm -f ${outprefix}.bam ${outprefix}.bai
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
library(ggplot2)
library(gridExtra)

args <- commandArgs(trailingOnly = TRUE)

checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE)
checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE)
dnadiff=read.table(args[3],header=TRUE,stringsAsFactor=FALSE)


colnames(dnadiff)=c("Ref","genome","Ref_length", "Refcovered","Query_length","Queryaligned", "ANI")
dnadiff$genome= gsub(".*/","",dnadiff$genome)
dnadiff$genome=gsub(".fa","",dnadiff$genome)

checkm_sr$`Assembly approach`="Single run"
checkm_coas$`Assembly approach`="Co-assembly"
checkm=rbind.data.frame(checkm_sr,checkm_coas)

checkm$Quality="Medium quality"
checkm$Quality[checkm$completeness>=90&checkm$contamination<=5]="High quality"
checkm$genome=gsub(".fa","",checkm$genome)

comb=merge(dnadiff,checkm,by="genome")
left<-ggplot(comb, aes(x = Queryaligned, y = Refcovered, color=`Assembly approach`, shape=Quality)) +
         geom_point(stroke = 1,size=1)+xlab("MAG Aligned (%)")+ylab("Reference Aligned (%)") + theme_classic() +scale_color_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+xlim(0,100)+ ylim(0,100)+scale_shape_manual(values = c(1,2))+guides(color=guide_legend(title="Approach"))
right<-
ggplot(comb, aes(x=ANI,color=`Assembly approach`)) + geom_density()+theme_classic()+xlab("ANI") +xlim(0,100)+ylab("Density")+scale_color_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))

p<-grid.arrange(left, right, nrow = 1)
ggsave("data/figures/dnadiff.png",p, width=10,height=5)
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
library(ggplot2)


args <- commandArgs(trailingOnly = TRUE)

checkm_sr=read.csv(args[1],header=TRUE,stringsAsFactor=FALSE)
checkm_coas=read.csv(args[2],header=TRUE,stringsAsFactor=FALSE)

checkm_sr$Status2="Single run"
checkm_coas$Status2="Co-assembly"

checkm=rbind.data.frame(checkm_sr,checkm_coas)
checkm$Status=factor(checkm$Status, levels=c("Single run", "Co-assembly"))

setwd("data/figures/")
ggplot(checkm, aes(x=Status, y=completeness, fill=Status)) +
  geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Completeness") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+theme(legend.position="none")
ggsave("checkm_completeness.png",plot = last_plot())

ggplot(checkm, aes(x=Status, y=contamination, fill=Status))+geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Contamination") + xlab("Assembly approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+theme(legend.position="none")+ylim(0,10)
ggsave("checkm_contam.png",width=5,height=5)
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)

singlerun=read.table(args[1],header=FALSE,stringsAsFactor=FALSE)
coassembly=read.table(args[2],header=FALSE,fill=TRUE,stringsAsFactor=FALSE)
colnames(singlerun)=c("path","strain")
colnames(coassembly)=c("path","strain")

coassembly$status="Co-assembly"
singlerun$status="Single run"
coassembly$strain=as.numeric(as.vector(coassembly$strain))
singlerun$strain=as.numeric(as.vector(singlerun$strain))
coassembly=coassembly[complete.cases(coassembly$strain),]
singlerun=singlerun[complete.cases(singlerun$strain),]

all=rbind(coassembly,singlerun)
all$status=factor(all$status, levels=c("Single run", "Co-assembly"))
ggplot(all, aes(x=status, y=strain, fill=status)) +geom_boxplot(outlier.shape=NA)+theme_classic()+ylab("Strain Heterogeneity") + xlab("Approach")+scale_fill_manual(breaks=c("Single run","Co-assembly"), values=c("#BBBBBB","#4477AA"))+guides(color=guide_legend(title="Approach"))+ylim(0,12)+theme(legend.position="none")
ggsave("data/figures/cmseq_plot.png",width=5,height=5)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
library(ggplot2)
library(gridExtra)

args <- commandArgs(trailingOnly = TRUE)

readcounts=read.table(args[1],header=TRUE,stringsAsFactor=FALSE)
df_flag=read.table(args[2],header=FALSE,stringsAsFactor=FALSE)
bwa.counts_sr=read.table(args[3],header=FALSE,stringsAsFactor=FALSE)
bwa.counts_coas=read.table(args[4],header=FALSE,stringsAsFactor=FALSE)
summary_out=args[5]

readcounts$Run=gsub("data/00_preprocessing/processed/singlerun/","",readcounts$Run)
readcounts=readcounts[!grepl("_2.fastq", readcounts$Run),]
readcounts$Run=gsub("_1.fastq","",readcounts$Run)
readcounts$Readcount=2*readcounts$Readcount

colnames(df_flag)=c("Run","Assembly")
df_comb=merge(df_flag,readcounts,by="Run")
df_comb$percassemb=df_comb$Assembly/df_comb$Readcount*100

colnames(bwa.counts_sr)=c("Run","Catalogue")
bwa.counts_sr$`Assembly Approach`="Single run"

colnames(bwa.counts_coas)=c("Run","Catalogue")
bwa.counts_coas$`Assembly Approach`="Coassembly"

bwa.counts=rbind.data.frame(bwa.counts_sr,bwa.counts_coas)
merged=merge(df_comb,bwa.counts, by="Run")
merged$percmags=merged$Catalogue/merged$Readcount*100
merged$percassemb=merged$Assembly/merged$Readcount*100

write.csv(merged,summary_out, quote=FALSE,row.names=FALSE)

ggplot(merged, aes(x=percmags, y=percassemb, color=`Assembly Approach`)) + geom_point() +xlab("Reads mapping to MAGs (%)") + ylab("Reads mapping to assembly (%)")+theme_classic()+scale_color_manual(breaks=c("Coassembly","Single run"), values=c("#a8ddb5","#c994c7"))+xlim(0,100)+ylim(0,100)
ggsave("data/figures/perassemb_perref.png",width=5,height=5)
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
library(RColorBrewer)
library(tidyr)
library(ggplot2)

args <- commandArgs(trailingOnly = TRUE)
gtdb.taxa=read.delim(args[1],header=TRUE,stringsAsFactor=FALSE)

ranks = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
gtdb.taxa = separate(data = gtdb.taxa, col = classification, sep = ";", into = ranks)
gtdb.bac = gtdb.taxa[which(gtdb.taxa$Domain == "d__Bacteria"),]


# calculate bacteria proportions
if(exists("gtdb.fi.bac")){
  rm(gtdb.fi.bac)
}
for (r in ranks[2:(length(ranks)-1)]){
  total = nrow(gtdb.bac)   #total number of genomes
  rank.present = table(gtdb.bac[!grepl("__$", gtdb.bac[,r]),r])   #summary of genus
  total.rank = sum(rank.present) #sum of genus counts 
  gtdb.ranks = data.frame(sort(rank.present)[(length(rank.present)-6):length(rank.present)])
  gtdb.ranks$Prop = gtdb.ranks$Freq/total*100
  other.name = paste(tolower(substr(r, 1, 1)), "__Other", sep="")
  novel.name = paste(tolower(substr(r, 1, 1)), "__Novel", sep="")

  other.freq = total.rank-sum(gtdb.ranks$Freq)
  novel.freq=total-total.rank  #novel genomes

  other.prop = other.freq/total*100
  novel.prop = novel.freq/total*100

  other.ranks = data.frame(Var1=other.name, Freq=other.freq, Prop=other.prop)
  novel.ranks = data.frame(Var1=novel.name, Freq=novel.freq, Prop=novel.prop)

  ranks.fi = rbind(other.ranks, gtdb.ranks)
  ranks.fi = rbind(novel.ranks, ranks.fi)

  ranks.fi$Rank = r
  ranks.fi$Colour=c("#999999","#CAB2D6","#A6CEE3","#FF7F00","#FB9A99","#E31A1C","#B2DF8A","#33A02C","#1F78B4")

  if(exists("gtdb.fi.bac")){
    gtdb.fi.bac = rbind(gtdb.fi.bac, ranks.fi)
  } else {
    gtdb.fi.bac = ranks.fi
  }
}

gtdb.fi.bac$Rank=factor(gtdb.fi.bac$Rank,levels=c("Phylum","Class","Order","Family","Genus"))
# plot stacked plot taxa counts
p<-print(ggplot(gtdb.fi.bac, aes(x=Rank, y=Prop, fill=Var1)) 
      + geom_bar(stat="identity", colour="darkgrey", alpha=0.5, size=0.2, width=0.7)
      + theme_bw()
      + ylab("Proportion of species (%)")
      + coord_flip()
      + scale_fill_manual(values=as.vector(gtdb.fi.bac$Colour), name="Taxa")
      #+ guides(fill=FALSE)
      + scale_x_discrete(limits=ranks[(length(ranks)-1):2])
      + theme(legend.position="right")
      + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
      + theme(axis.title.x = element_text(size=14))
      + theme(axis.text.y = element_text(size=12))
      + theme(axis.title.y = element_blank())
      + theme(axis.text.x = element_text(size=12)))



ggsave("data/figures/gtdb_bacteria.png",p,width=15)
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import argparse
import sys

def ren_fasta(args):
	n = 0
	for line in open(args.fasta_file, "rU"):
		if line[0] == ">":
			name = line.strip("\n").replace(">","")
			n += 1
			print ">%s_%i\t%s" % (args.prefix, n, name)
		else:
			print line.strip("\n")


if __name__ == '__main__':
        parser = argparse.ArgumentParser(description='Rename multifasta file')
        parser.add_argument('-f', dest='fasta_file', help='Input FASTA file')
	parser.add_argument('-p', dest='prefix', help='Header prefix')
        if len(sys.argv) == 1:
                parser.print_help()
                sys.exit()
        else:
                args = parser.parse_args()
                ren_fasta(args)	
ShowHide 59 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Finn-Lab/MAG_Snakemake_wf
Name: mag_snakemake_wf
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...