Snakemake workflow for bacterial genome assembly + polishing for Oxford Nanopore (ONT) sequencing using multiple tools

public public 1yr ago Version: v1.2 0 bookmarks

A snakemake-wrapper for easily creating de novo bacterial genome assemblies from Oxford Nanopore (ONT) sequencing data, and optionally Illumina data, using any combination of read filtering, assembly, long and short read polishing, and reference-based polishing.

See the preprint here: Snakemake Workflows for Long-read Bacterial Genome Assembly and Evaluation, Preprints.org 2022

Included programs

read filtering assembly long read polishing short read polishing reference-based polishing
Filtlong
Rasusa
Flye
raven
miniasm
Unicycler
Canu
racon
medaka
pilon
Polypolish
POLCA
Homopolish
proovframe

Quick start

# Install
git clone https://github.com/pmenzel/ont-assembly-snake.git
conda config --add channels bioconda
conda env create -n ont-assembly-snake --file ont-assembly-snake/env/conda-main.yaml
conda activate ont-assembly-snake
# Prepare ONT reads, one file per sample
mkdir fastq-ont
cp /path/to/my/data/my_sample/ont_reads.fastq.gz fastq-ont/mysample.fastq.gz
# optionally: add Illumina paired-end reads
mkdir fastq-illumina
cp /path/to/my/data/my_sample/illumina_reads_R1.fastq.gz fastq-illumina/mysample_R1.fastq.gz
cp /path/to/my/data/my_sample/illumina_reads_R2.fastq.gz fastq-illumina/mysample_R2.fastq.gz
# Declare desired combination of read filtering, assembly and polishing
mkdir assemblies
mkdir assemblies/mysample_flye+medaka
mkdir assemblies/mysample+filtlongMB500_flye+racon2+medaka
mkdir assemblies/mysample_raven2+medaka+pilon
[...]
# Run workflow
snakemake -s ont-assembly-snake/Snakefile --use-conda --cores 20

Setup

Clone repository, for example into the existing folder /opt/software/ :

git clone https://github.com/pmenzel/ont-assembly-snake.git /opt/software/ont-assembly-snake

Install conda and create a new environment called ont-assembly-snake :

conda config --add channels bioconda
conda env create -n ont-assembly-snake --file /opt/software/ont-assembly-snake/env/conda-main.yaml

Activate the environment:

conda activate ont-assembly-snake

Usage

First, prepare a folder called fastq-ont/ containing the ONT sequencing reads as one .fastq or .fastq.gz file per sample, e.g. fastq-ont/sample1.fastq.gz .

Unicycler and some polishing tools additonally use paired-end Illumina reads, which need to be placed in the folder fastq-illumina/ using _R[12].fastq or _R[12].fastq.gz suffixes, e.g. fastq-illumina/sample1_R1.fastq.gz and fastq-illumina/sample1_R2.fastq.gz .

Next, create a folder assemblies and in there, create empty folders specifying the desired combinations of read filtering, assembly, and polishing steps by using specific keywords for each program, see below.

The first part of a folder name is a sample name, which must match the filenames in fastq-ont/ and, optionally, fastq-illumina/ .

The next part can be a keyword for read filtering with Filtlong, see below, which is separated from the sample name by + .

Then follows, separated by an underscore, a keyword for the assembler.
NB: This also means that sample names must not contain underscores.

After the keyword for the assembler follow the keywords for one ore more polishing steps, all separated by + .

After making the desired subfolders in assemblies/ , run the workflow, e.g. with 20 threads:

snakemake -s /opt/software/ont-assembly-snake/Snakefile --use-conda --cores 20

Assemblies created in each step are contained in the files output.fa in each folder and symlinked as .fa files in the assemblies/ folder, see the example below.

Included Programs

Filtlong

The ONT reads can be filtered by length and quality using Filtlong prior to the assembly.

The available keywords are:

filtlong :
This will filter the ONT reads in fastq-ont/mysample.fastq and keep only reads longer than 1000 bases; using the Filtlong option --min_length . The filtered read set is written to fastq-ont/mysample+filtlong.fastq . The length can be changed using the snakemake configuration option filtlong_min_read_length .

filtlongPC<p>
This will filter the reads to only include the top p percent of megabases from reads with highest average quality using the Filtlong option --keep_percent . Further, reads are filtered by their length as above. The output is written to fastq-ont/mysample+filtlongPC<p>.fastq .

filtlongMB<m>
This will filter the reads to only include reads with highest average quality up to a total length of m megabases. Further, reads are filtered by their length. The output is written to fastq-ont/mysample+filtlongMB<m>.fastq .

filtlongMB<m>,<q>,<l>
This will filter the reads to only include reads up to a total length of m megabases, which are filtered by length and quality, where q and l set the priority for each using the Filtlong options --mean_q_weight and --length_weight , respectively. See also the section in the Filtlong docs . Further, reads are filtered by their length as above. The output is written to fastq-ont/mysample+filtlongMB<m>,<q>,<l>.fastq .

filtlongMB<m>,<q>,<l>,<n>
As above, but the the minimum read length is explicitly specified by n and not by the global option filtlong_min_read_length : The output is written to fastq-ont/mysample+filtlongMB<m>,<q>,<l>,<n>.fastq .

When using any of the Filtlong keywords in a folder name, they must be followed by an underscore, followed by the keyword for the assembler.

Rasusa

The ONT reads can be randomly subsampled prior to the assembly.

The available keywords are:

rasusaMB<m>
This will subsample the ONT reads to a total of m megabases.
The output is written to fastq-ont/mysample+rasusaMB<m>.fastq .

When using any the Rasusa keyword in a folder name, it must be followed by an underscore, followed by the keyword for the assembler.

Flye

Following keywords can be used to run the assembly with Flye:

flye
Default assembly, which includes one round of internal polishing the assembly with the ONT reads.

flyeX
Assembly with X rounds of internal polishing. Setting X to 0 disables polishing altogether.

flyehq
Assembly for high-quality ONT reads using Flye option --nano-hq for ONT Guppy5+ in SUP mode, with one round of internal polishing.

flyehqX
High-quality assembly, with X rounds of internal polishing. Setting X to 0 disables polishing altogether.

Raven

Following keywords can be used to run the assembly with raven:

raven
Default assembly, which includes two rounds of internal polishing with racon using the ONT reads.

ravenX
Assembly with X rounds of internal polishing with racon. Setting X to 0 disables polishing altogether.

Miniasm

Following keywords can be used to run the assembly with miniasm:

miniasm
Default assembly. Miniasm does not do any polishing by itself.

Unicycler

unicycler
Unicycler does a hybrid assembly, i.e., both ONT and Illumina reads must be present in fastq-ont and fastq-illumina , respectively.

Canu

canu
The Canu assembler requires to know the genome size (in Megabases) beforehand, use Snakemake option: --config genome_size=5.2 (e.g. for 5.2 Mb)

racon

Following keywords can be used to polish an assembly using ONT reads:

racon
Polishing the assembly once.

raconX
Run racon polishing iteratively X times.

medaka

medaka
Medaka polishes the assembly using the ONT reads, but also requires the name of the Medaka model to be used, which depends on the flow cell and basecalling that were used for creating the reads.

The model name can either be set globally for all samples using the snakemake configuration option medaka_model , or by supplying a tab-separated file with two columns that maps sample names to medaka models using the snakemake configuration option map_medaka_model .

Options are specified using snakemake's --config parameter, for example:

snakemake /opt/software/ont-assembly-snake/Snakefile --cores 20 --config map_medaka_model=map_medaka.tsv

where map_medaka.tsv contains, for example, the two columns:

sample1 r941_min_high_g330
sample2 r941_min_high_g351

pilon

pilon
Pilon polishes an assembly using Illumina reads, which must be located in the fastq-illumina folder.

Polypolish

polypolish
Polypolish polishes an assembly using Illumina reads, which must be located in the fastq-illumina folder.

POLCA

polca
POLCA polishes an assembly using Illumina reads, which must be located in the fastq-illumina folder.

Homopolish

homopolish
Homopolish does reference-based polishing based on one ore more provided reference genomes in fasta format located in references/NAME1.fa , references/NAME2.fa , etc., where NAME1 and NAME2 can be any string. Snakemake will create output files ...+homopolish/output_NAME1.fa , ...+homopolish/output_NAME1.fa , etc., containing the polished assemblies.

When using homopolish, it must be the last keyword in the folder name.

proovframe

proovframe
Proovframe does reference-based polishing based on one ore more provided reference proteomes in fasta format containing the amino acid sequences located in references-protein/NAME1.faa , references-protein/NAME2.faa , etc., where NAME1 and NAME2 can be any string. Snakemake will create output files ...+proovframe/output_NAME1.fa , ...+proovframe/output_NAME1.fa , etc., containing the polished assemblies.

When using proovframe, it must be the last keyword in the folder name.

Example

This example contains two samples with ONT sequencing reads and Illumina reads for sample 2 only.

For sample 1, the assembly should be done with flye (including the default single round of polishing), followed by polishing the assembly with racon (twice), medaka, and eventually homopolish, which will use the E. coli genome in the file references/Ecoli.faa .
In another assembly, we also want to filter the ONT reads of sample 1 to only include the highest quality reads up to a total of 500Mb using Filtlong and apply the same assembly and polishing protocol.

Sample 2 should be assembled by raven including two internal polishing rounds, followed by medaka and pilon polishing using the Illumina reads, and finally reference-based polishing with E. coli proteins using proovframe, which requires the file references-protein/Ecoli.faa .

We therefore create the folders and files as follows:

.
├── assemblies
│ ├── sample1+filtlongMB500_flye+racon2+medaka+homopolish
│ ├── sample1_flye+racon2+medaka+homopolish
│ ├── sample2_raven2+medaka+pilon+proovframe
├── fastq-illumina
│ ├── sample2_R1.fastq
│ └── sample2_R2.fastq
├── fastq-ont
│ ├── sample1.fastq
│ └── sample2.fastq
├── references
│ └── Ecoli.fa
└── references-protein └── Ecoli.faa

We also want to set the minimum read length threshold for Filtlong to 500nt and use the medaka model r941_min_high_g351 for both samples.

Therefore, we run the workflow with:

snakemake -s /opt/software/ont-assembly-snake/Snakefile --use-conda --cores 20 --config medaka_model=r941_min_high_g351 filtlong_min_read_length=500

Snakemake will recursively handle the dependencies for each assembly, and create folders for all intermediate steps automatically. Additionally, a symlink is created for each output assembly in the assemblies/ folder, so they can easily be used as input for score-assemblies .

For the above example, the folders will look like this after running the workflow:

.
├── assemblies
 ├── sample1+filtlongMB500_flye
 ├── sample1+filtlongMB500_flye.fa -> sample1+filtlongMB500_flye/output.fa
 ├── sample1+filtlongMB500_flye+racon2
 ├── sample1+filtlongMB500_flye+racon2.fa -> sample1+filtlongMB500_flye+racon2/output.fa
 ├── sample1+filtlongMB500_flye+racon2+medaka
 ├── sample1+filtlongMB500_flye+racon2+medaka.fa -> sample1+filtlongMB500_flye+racon2+medaka/output.fa
 ├── sample1+filtlongMB500_flye+racon2+medaka+homopolish
 ├── sample1+filtlongMB500_flye+racon2+medaka+homopolishEcoli.fa -> sample1+filtlongMB500_flye+racon2+medaka+homopolish/output_Ecoli.fa
 ├── sample1_flye
 ├── sample1_flye.fa -> sample1_flye/output.fa
 ├── sample1_flye+racon2
 ├── sample1_flye+racon2.fa -> sample1_flye+racon2/output.fa
 ├── sample1_flye+racon2+medaka
 ├── sample1_flye+racon2+medaka.fa -> sample1_flye+racon2+medaka/output.fa
 ├── sample1_flye+racon2+medaka+homopolish
 ├── sample1_flye+racon2+medaka+homopolishEcoli.fa -> sample1_flye+racon2+medaka+homopolish/output_Ecoli.fa
 ├── sample2_raven2
 ├── sample2_raven2.fa -> sample2_raven2/output.fa
 ├── sample2_raven2+medaka
 ├── sample2_raven2+medaka.fa -> sample2_raven2+medaka/output.fa
 ├── sample2_raven2+medaka+pilon
 ├── sample2_raven2+medaka+pilon.fa -> sample2_raven2+medaka+pilon/output.fa
 ├── sample2_raven2+medaka+pilon+proovframe
 └── sample2_raven2+medaka+pilon+proovframeEcoli.fa -> sample2_raven2+medaka+pilon+proovframe/output_Ecoli.fa
├── fastq-illumina
 ├── sample2_R1.fastq
 └── sample2_R2.fastq
├── fastq-ont
 ├── sample1.fastq
 ├── sample1+filtlongMB500.fastq
 └── sample2.fastq
├── references
 └── Ecoli.fa
└── references-protein
 └── Ecoli.faa

(Not shown is the content of each subfolder in assemblies/ and some auxiliary files.)

Code Snippets

160
161
162
163
shell:
    """
    filtlong --min_length {filtlong_min_read_length} {input} > {output} 2>{log}
    """
176
177
178
179
shell:
    """
     rasusa --bases {wildcards.num}m -i {input} -o {output} 2>{log}
    """
190
191
192
193
shell:
    """
    filtlong --min_length {filtlong_min_read_length} -t {wildcards.num}000000 {input} > {output} 2>{log}
    """
205
206
207
208
shell:
    """
    filtlong --min_length {filtlong_min_read_length} --keep_percent {wildcards.num} {input} > {output} 2>{log}
    """
223
224
225
226
shell:
    """
    filtlong --min_length {filtlong_min_read_length} --mean_q_weight {wildcards.qweight} --length_weight {wildcards.lweight}  -t {wildcards.mb}000000 {input} > {output} 2>{log}
    """
242
243
244
245
shell:
    """
    filtlong --min_length {wildcards.readlen} --mean_q_weight {wildcards.qweight} --length_weight {wildcards.lweight}  -t {wildcards.mb}000000 {input} > {output} 2>{log}
    """
259
260
261
262
263
264
265
shell:
    """
    minimap2 -x ava-ont -t {threads} {input} {input} > assemblies/{wildcards.sample}_miniasm/minimap2_overlap.paf 2>{log}
    miniasm -f {input} assemblies/{wildcards.sample}_miniasm/minimap2_overlap.paf > assemblies/{wildcards.sample}_miniasm/minimap2_miniasm.gfa 2>>{log}
    perl -lsane 'print ">$F[1]\n$F[2]" if $F[0] =~ /S/;' assemblies/{wildcards.sample}_miniasm/minimap2_miniasm.gfa > {output.fa} 2>>{log}
    ln -sr {output.fa} {output.link}
    """
281
282
283
284
285
286
287
288
shell:
    """
    # del spades folder if already exists (e.g. when workflow was canceled), so that it does not warn about it upon restart
    [ -d "assemblies/{wildcards.sample}_unicycler/spades_assembly" ] && rm -r "assemblies/{wildcards.sample}_unicycler/spades_assembly" >{log}
    unicycler -1 {input.fq1} -2 {input.fq2} -l {input.fqont} -t {threads} --keep 0 -o assemblies/{wildcards.sample}_unicycler/ >>{log} 2>&1
    cp assemblies/{wildcards.sample}_unicycler/assembly.fasta {output.fa} 2>>{log}
    ln -sr {output.fa} {output.link}
    """
303
304
305
306
307
308
shell:
    """
    flye --nano-raw {input.fq} -o assemblies/{wildcards.sample}_flye/ -t {threads} 2>{log}
    mv assemblies/{wildcards.sample}_flye/assembly.fasta {output.fa}
    ln -sr {output.fa} {output.link}
    """
322
323
324
325
326
327
shell:
    """
    flye --nano-raw {input.fq} -o assemblies/{wildcards.sample}_flye{wildcards.num}/ -t {threads} -i {wildcards.num} 2>{log}
    mv assemblies/{wildcards.sample}_flye{wildcards.num}/assembly.fasta {output.fa}
    ln -sr {output.fa} {output.link}
    """
342
343
344
345
346
347
shell:
    """
    flye --nano-hq {input.fq} -o assemblies/{wildcards.sample}_flyehq/ -t {threads} 2>{log}
    mv assemblies/{wildcards.sample}_flyehq/assembly.fasta {output.fa}
    ln -sr {output.fa} {output.link}
    """
361
362
363
364
365
366
shell:
    """
    flye --nano-hq {input.fq} -o assemblies/{wildcards.sample}_flyehq{wildcards.num}/ -t {threads} -i {wildcards.num} 2>{log}
    mv assemblies/{wildcards.sample}_flyehq{wildcards.num}/assembly.fasta {output.fa}
    ln -sr {output.fa} {output.link}
    """
381
382
383
384
385
shell:
    """
    raven --disable-checkpoints -t {threads} {input.fq} >{output.fa} 2>{log}
    ln -sr {output.fa} {output.link}
    """
400
401
402
403
404
shell:
    """
    raven --disable-checkpoints -p {wildcards.num} -t {threads} {input.fq} >{output.fa} 2>{log}
    ln -sr {output.fa} {output.link}
    """
418
419
420
421
422
423
shell:
    """
    canu -nanopore -d assemblies/{wildcards.sample}_canu/ -p output useGrid=false maxThreads={threads} genomeSize={target_genome_size}m {input.fqont} >>{log} 2>&1
    cp assemblies/{wildcards.sample}_canu/output.contigs.fasta {output.fa} 2>>{log}
    ln -sr {output.fa} {output.link}
    """
440
441
442
443
444
445
shell:
    """
    minimap2 -ax map-ont -t {threads} {input.prev_fa} {input.fq} > {output.sam} 2>{log}
    racon --threads {threads} --include-unpolished {input.fq} {output.sam} {input.prev_fa} > {output.fa} 2>>{log}
    ln -sr {output.fa} {output.link}
    """
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
shell:
    """
    DIR_temp=$(mktemp -d --suffix=.raconX)
    trap "rm -r $DIR_temp" EXIT

    cp {input.prev_fa} $DIR_temp/prev.fa

    for i in `seq 1 {wildcards.num}`
    do
        echo "Polishing round $i / {wildcards.num}" >>{log}
        minimap2 -ax map-ont -t {threads} $DIR_temp/prev.fa {input.fq} > $DIR_temp/map.sam 2>>{log}
        racon --threads {threads} --include-unpolished {input.fq} $DIR_temp/map.sam $DIR_temp/prev.fa > $DIR_temp/polished.fa 2>>{log}
        mv $DIR_temp/polished.fa $DIR_temp/prev.fa
    done

    mv $DIR_temp/prev.fa {output.fa}
    ln -sr {output.fa} {output.link}
    """
509
510
511
512
513
514
shell:
    """
    medaka_consensus -f -i {input.fq} -d {input.prev_fa} -o assemblies/{wildcards.sample}_{wildcards.assembly}+medaka -t {threads} {params.model} >{log} 2>&1
    mv assemblies/{wildcards.sample}_{wildcards.assembly}+medaka/consensus.fasta assemblies/{wildcards.sample}_{wildcards.assembly}+medaka/output.fa
    ln -sr {output.fa} {output.link}
    """
531
532
533
534
535
536
537
538
539
shell:
    """
    bwa index -p assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/bwa_index {input.prev_fa} >{log} 2>&1
    bwa mem -t {threads} assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/bwa_index {input.fq1} {input.fq2} 2>>{log} | samtools sort -o {output.bam} - 2>>{log}
    samtools index {output.bam} 2>>{log}
    pilon -Xmx60G --genome {input.prev_fa} --frags {output.bam} --outdir assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/ --output pilon --changes --vcf >>{log} 2>&1
    mv assemblies/{wildcards.sample}_{wildcards.assembly}+pilon/pilon.fasta {output.fa}
    ln -sr {output.fa} {output.link}
    """
557
558
559
560
561
562
shell:
    """
    polca.sh -t {threads} -a {input.prev_fa} -r '{input.fq1} {input.fq2}' >{log} 2>&1
    mv output.fa.PolcaCorrected.fa {output.fa}
    ln -sr {output.fa} {output.link}
    """
SnakeMake From line 557 of master/Snakefile
578
579
580
581
582
583
584
585
586
shell:
    """
    bwa index -p assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/bwa_index {input.prev_fa} >{log} 2>&1
    bwa mem -a -t {threads} assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/bwa_index {input.fq1} > assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R1.sam 2>>{log}
    bwa mem -a -t {threads} assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/bwa_index {input.fq2} > assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R2.sam 2>>{log}
    polypolish_insert_filter.py --in1 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R1.sam --in2 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/alignments_R2.sam --out1 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R1.sam --out2 assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R2.sam >>{log} 2>&1
    polypolish {input.prev_fa} assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R1.sam assemblies/{wildcards.sample}_{wildcards.assembly}+polypolish/filtered_R2.sam > {output.fa} 2>>{log}
    ln -sr {output.fa} {output.link}
    """
601
602
603
604
605
606
607
608
shell:
    """
    DIR_temp=$(mktemp -d --suffix=.raconX)
    trap "rm -r $DIR_temp" EXIT
    homopolish polish -a {input.prev_fa} -m R9.4.pkl -o $DIR_temp -l {input.ref} >{log} 2>&1
    cp $DIR_temp/*_homopolished.fasta {output.fa}
    ln -sr {output.fa} {output.link}
    """
SnakeMake From line 601 of master/Snakefile
621
622
623
624
shell:
    """
    diamond makedb -p {threads} --in {input} --db {output} >{log} 2>&1
    """
640
641
642
643
644
645
shell:
    """
    proovframe map -d {input.ref} -o {output.tsv} {input.prev_fa} >{log} 2>&1
    proovframe fix -o {output.fa} {input.prev_fa} {output.tsv} >>{log} 2>&1
    ln -sr {output.fa} {output.link}
    """
ShowHide 12 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/pmenzel/ont-assembly-snake
Name: ont-assembly-snake
Version: v1.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...