Bacterial Genome Assembly with Snakemake Workflow

public public 1yr ago 0 bookmarks

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Code Snippets

13
14
shell:
    "prokka {params.prokka} --cpus {threads} --outdir {output} --prefix {wildcards.sample} {input} 2> {log}"
16
17
18
19
20
shell:
    """
    spades.py {params.spades} --threads {threads} -1 {input.fq1} -2 {input.fq2} -o {output[0]} 2> {log}
    seqkit seq {params.seqkit} {output[0]}/scaffolds.fasta > {output[1]}
    """
37
38
39
40
41
shell:
    """
    skesa {params.skesa} --cores {threads} --reads {input.fq1},{input.fq2} --contigs_out {output[0]} 2> {log}
    seqkit seq {params.seqkit} {output[0]} > {output[1]}
    """
56
57
shell:
    "quast.py {params.quast} --threads {threads} -l spades,skesa -o {output} {input[0]} {input[1]} 2> {log}"
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
run:
    import pandas as pd
    from shutil import copy

    quast = pd.read_csv(f"{input.quast}/report.tsv", sep="\t", header=0).set_index("Assembly", drop=False)
    quast.drop('Assembly', axis='columns', inplace=True)

    score = { i : 0 for i in quast.columns.to_list() }
    number_contigs = quast.loc['# contigs'].to_dict()
    largest_contig = quast.loc['Largest contig'].to_dict()
    total_length = quast.loc['Total length'].to_dict()
    n50 = quast.loc['N50'].to_dict()
    n75 = quast.loc['N75'].to_dict()
    predict_genes = quast.loc['# predicted genes (unique)'].to_dict()

    score[min(number_contigs, key=number_contigs.get)] += 1
    score[max(largest_contig, key=largest_contig.get)] += 1
    score[max(total_length, key=total_length.get)] += 1
    score[max(n50, key=n50.get)] += 1
    score[max(n75, key=n75.get)] += 1
    score[max(predict_genes, key=predict_genes.get)] += 3

    assembly = max(score, key=score.get)

    print(score)

    if assembly == 'spades':
        copy(f'{input.spades}', f'{output[0]}') 
    elif assembly == 'skesa':
        copy(f'{input.skesa}', f'{output[0]}') 
109
110
shell:
    "busco {params.busco} --cpu {threads} -i {input} --out_path $(dirname {output}) -o $(basename {output}) 2> {log}"
15
16
shell:
    "trim_galore {params.trim} --basename {wildcards.sample} --cores {threads} --output_dir {output.out_dir} {input} 2> {log}"
ShowHide 1 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/osvaldoreisss/miseq_bac_assembly_annot_workflow
Name: miseq_bac_assembly_annot_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...