Snakemake-based workflow for the assembly of chloroplast genomes

public public 1yr ago 0 bookmarks

CLAW

About CLAW

CLAW (Chloroplast Long-read Assembly Workflow) is an mostly-automated Snakemake-based workflow for the assembly of chloroplast genomes. CLAW uses chloroplast long-reads, which are baited out of larger read libraries (e.g., an Oxford Nanopore Technologies MinION read library derived from photosynthetic tissue), for assembly with Flye and/or Unicycler. CLAW was designed with the novice bioinformatician in mind - it is easy to install and easy to use, requiring only minimal user input.

Download and Install

  1. Clone Git repository:

    git clone https://github.com/aaronphillips7493/CLAW

  2. Move into the directory:

     cd long-read-chloroplast-assembly
    
  3. Install conda (if not already present):

https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

  1. Make sure you have snakemake and Biopython installed. We use Mamba for increased speed:

    a) Create a conda environment:

     conda create --name snakemake
    

    b) Activate the conda environment:

     conda activate snakemake
    

    c) Install mamba in the environment:

     conda install mamba
    

    d) Install snakemake and biopython in the environment:

     mamba install snakemake biopython
    

    e) OPTIONAL: create all environments needed for CLAW:

     snakemake -j 1 --conda-create-envs-only --use-conda
    

Steps

  1. Test CLAW. We provide a test read file containing ONT reads from Oryza sativa ("chloro_assembly/reads/DRR196880_subset.fastq") and the reference Oryza sativa chloroplast genome ("chloro_assembly/reference/NC_008155.1_single.fasta").

     snakemake --profile profiles/slurm --use-conda --keep-going
    

    #note: profiles/slurm may not be appropriate for your PC

This test should complete with no errors, and should generate a rotated choloroplast fasta file ("chloro_assembly/{sample}~{assembler}_chloroplast.fasta") derived from Flye and/or Unicycler. If outside network access is problematic, run downloadReference.sh to download the reference genome. Alternatively, download your reference genome of interest manually and save it into the directory

"chloro_assembly/reference/{file_name}\_single.fasta"
  1. Run your samples through CLAW.

    a) modify "config.yml":

     i) ncbi_reference_accession = change this to the NCBI accession number for the reference chloroplast genome of interest. Default = NC_008155.1 (O. sativa). If outside network access is problematic, run downloadReference.sh to download the reference genome. Alternatively, download your reference genome of interest manually and save it into the directory "chloro_assembly/reference/{file_name}\_single.fasta". Make sure to change this parameter even if you do not use CLAW to download the reference genome.
     ii) my_Email = provide an email address. Neccessary for the automated download of genomes from NCBI - prevents system abuse.
     iii) fast_file = declare whether your read file is in FASTA ("fasta"/"fasta.gz") or FASTQ ("fastq"/"fastq.gz") format.
     iv) randseed = you can supply a seed (integer) here if you wish to re-run read extraction the same way with each redo, or leave this blank to let CLAW randomly generate a seed each time it is run. Change this parameter if assembly fails, and re-run random read subsampling.
     v) numberReads = declare the max. number of reads CLAW will subsample. Change value if assembly fails or to increase or decrease genome coverage. Default = 3000.
     vi) readMinLength = declare the smallest read to be used in the assembly. Default = 5000 bp.
     vii) flyeParameter = tell Flye what kind of reads you are using as input (raw, corrected or HQ ONT long reads, or raw, corrected, or HIFI PacBio long reads). Default = --nano-raw.
     viii) minimap2Parameter = tell minimap2 what kinds of reads you are using (ONT, PacBio, or HIFI). Default = map-ont.
     iX) chloroplastSize = the expected size of the chloroplast genome to be assembled. This is usually set as the size of the reference chloroplast genome. Default = 135000.
     X) cpus = declare the number of CPUs to use. Default = 4.
    

    b) make sure you have saved your reads in the directory:

     chloro_assembly/reads
    

    c) make sure you have saved your reference genome in the directory:

     chloro_assembly/reference
    

    d) run {The Workflow}:

     snakemake --profile profiles/slurm --use-conda --keep-going
    
  2. If {The Workflow} fails, try modifying "randSeed" and/or "numberReads" in config.yml. You will need to delete the files in the following directories to re-run {The Workflow}:

     chloro_assembly/assemblies chloro_assembly/subReads chloro_assembly/alignments chloro_assembly/dotPlots
    

    You may also need to delete the file:

    chloro_assembly/{sample}~{assembler}_chloroplast.fasta

Note

While the read library used for chloroplast genome assembly is enriched for chloroplast reads, there is a possibility that the library will also contain mitochondrial reads because of high sequence similarity between mitochondrial and chloroplast genomes. Thus, it is possible for CLAW to assemble mitochondrial (likely fragmented) contigs too. CLAW makes no attempt to resolve this, so users will have to manually check the assembled contigs for similarity to chloroplast and/or mitochondrial sequences.

While we report on the successful runs of CLAW here, there were several instances in which CLAW failed to complete or when chloroplast genome assembly failed. Typically, CLAW failed to complete due to insufficient memory allocation for certain steps of CLAW. These steps were commonly the read mapping and genome assembly steps. In this case, we were able to overcome this issue by modifying the memory allocation in the “cluster-configs/default.yaml” file. Future releases should aim to integrate dynamic memory allocations to further minimise user input for the successful completion of CLAW. When genome assembly failed, we found that re-running CLAW with a different seed (randomly generated as part of the CLAW, unless specified by the user in “config.yaml”) resulted in successful completion of genome assembly. We also found it necessary to, on occasion, adjust the number of chloroplast reads used for genome assembly as too many reads (and therefore too high genome coverage) could cause the assemblers to fail. Thus, by modifying the random seed and the number of reads used for assembly we were able to generate assemblies for all 19 test datasets.

Code Snippets

 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
shell:
    """
    if [ -s {input.genome} ]
    then
        mkdir -p {params.dir}
        for contig in `grep '^>' {input.genome} | sed -e 's/>//g'`
        do
            echo $contig > {params.dir}/tmp
            seqtk subseq {input.genome} {params.dir}/tmp > {params.dir}/$contig.fasta

            nucmer --maxmatch {input.index} {params.dir}/$contig.fasta -p {params.dir}/out
            show-coords -THrd {params.dir}/out.delta > {params.dir}/out.coords
            start=`sort -k6,6hr {params.dir}/out.coords | head -n 1| cut -f3`
            echo ">$contig" >> {output}
            echo "$start XXX"
            if [ $start == 1 ]
            then
                grep -v '^>' {params.dir}/$contig.fasta | tr -d '\n' >> {output}
                echo "" >> {output} 
            elif [ ! -z $start ]
            then
                grep -v '^>' {params.dir}/$contig.fasta | tr -d '\n' > {params.dir}/temp.fasta
                cut -c ${{start}}- {params.dir}/temp.fasta > {params.dir}/start.fasta
                cut -c -$[start-1] {params.dir}/temp.fasta > {params.dir}/end.fasta
                cat {params.dir}/start.fasta {params.dir}/end.fasta | tr -d '\n' >> {output}
                echo "" >> {output}
            else
                grep -v '^>' {params.dir}/$contig.fasta | tr -d '\n' >> {output}
                echo "" >> {output}
            fi
            rm -rf {params.dir}/*
        done
        rm -rf {params.dir}
    else
        touch {output}
    fi
    """
143
144
145
146
147
148
149
shell:
    """
    nucmer {input.reference} {input.query} -p {params}
    # delta-filter -1 -i 50 {params}.delta > {params}.1delta
    mummerplot --fat --png --large {params}.delta -p {params}
    rm {params}.filter {params}.fplot {params}.gp {params}.rplot
    """
166
167
168
169
shell:
    """
    bamCoverage -b {input} -o {output.bigwig}
    """
182
183
184
185
186
shell:
    """
    samtools sort -o {output.bam} {input}
    samtools index {output.bam}
    """
202
203
204
205
206
207
shell:
    """
    minimap2 -ax {config[minimap2_parameter]} -t {threads} {input.assembly} {input.fastFile} \
        | samtools view -b -F 4 -@ {threads} \
        > {output.bam}
    """
229
230
231
232
233
shell:
    """
    mkdir -p chloro_assembly/assemblies/
    unicycler -l {input} -o {params.assembly}
    """
250
251
252
253
254
shell:
    """
    mkdir -p chloro_assembly/assemblies/
    flye --threads {threads} --genome-size {config[chloroplast_size]} -o {params.flye_assembly} {config[flye_parameter]} {input}
    """
271
272
273
274
shell:
    """
    bamCoverage -b {input} -o {output.bigwig}
    """
287
288
289
290
291
shell:
    """
    samtools sort -o {output.bam} {input}
    samtools index {output.bam}
    """
307
308
309
310
311
312
shell:
    """
    minimap2 -ax {config[minimap2_parameter]} -t {threads} {input.reference} {input.fastFile} \
        | samtools view -b -F 4 -@ {threads} \
        > {output.bam}
    """
332
333
334
335
336
shell:
    """
    echo {params.random_seed}
    seqtk sample -s {params.random_seed} {input} {config[number_reads]} > {output}
    """
355
356
357
358
359
360
361
362
shell:
    """
    samtools view {input.bam} | cut -f1 | sort | uniq > {output.list}
    seqtk subseq {input.fastFile} {output.list} \
        | bioawk -c fastx \
            'length($seq) > {config[read_min_length]} && length($seq) < {config[chloroplast_size]} \
            {{print \">\"$name\"\\n\"$seq}}' > {output.fastFile}
    """
382
383
384
385
386
387
shell:
    """
    minimap2 -ax {config[minimap2_parameter]} -t {threads} {input.reference} {input.fastFile} \
        | samtools view -b -F 4 -@ {threads} \
        | samtools sort -o {output.bam}
    """
405
406
407
408
409
410
shell:
    """
    awk 'NR == 1 {{print substr($1,2,length($1)), \"0\", \"10000\"}}' {input} > chloro_assembly/reference/index.bed
    seqtk subseq {input} chloro_assembly/reference/index.bed > {output}
    rm chloro_assembly/reference/index.bed
    """
426
427
428
429
430
431
shell:
    """
    head -n 1 {input} > {output}
    tail -n +2 {input} | tr -d '\n' >> {output}
    tail -n +2 {input} | tr -d '\n' >> {output}
    """
SnakeMake From line 426 of main/Snakefile
453
454
455
456
shell:
    """
    mv {input} {output}
    """
SnakeMake From line 453 of main/Snakefile
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/aaronphillips7493/CLAW
Name: claw
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Accessed: 11
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...