NanoPlasm

public public 1yr ago 0 bookmarks

v1.0

NanoPlasm is a specialized workflow designed for the purpose of assembling, typing, and analyzing plasmid sequences using long reads such as Nanopore or PacBio, as well as a combination of long and short reads. To ensure reliability, reproducibility, and optimization in the analysis process, NanoPlasm harnesses the power of the Snakemake workflow framework.

The tool is still under development.

Workflow overview

Nanoplasm_worflow

Install

Currently, the only method of installing and utilizing NanoPlasm consists in cloning this repository and subsequently installing the various dependencies and databases (as outlined below).

# Clone NanoPlasm repository
git clone https://github.com/Flomauf/NanoPlasm.git NanoPlasm

Dependencies

While NanoPlasm relies on Docker images for the majority of its tools, certain dependencies must be installed beforehand to ensure seamless execution. It is strongly recommended to set up a dedicated environment specifically for this purpose. The subsequent instructions illustrate how to achieve this using Conda:

# Create the environment
conda create -n nanoplasm python=3.10.0
# Install dependencies
conda install -c bioconda -c conda-forge snakemake bwa biopython mge-cluster
# Activate the environment
conda activate nanoplasm

Database

In order for Homopolish to function properly, a dedicated database is necessary. This database should be installed within the NanoPlasm directory. To streamline this process, a script has been developed to simplify the installation.

# Go into NanoPlasm directory
cd NanoPlasm
# Install database
sh install_db.sh

NanoPlasm is now ready to use.

Usage

To streamline the usage of NanoPlasm, a wrapper script has been implemented to facilitate the invocation of the Snakemake pipeline. However, if needed, arguments can still be passed to Snakemake using the --options argument, allowing for additional customization and flexibility.

To maintain consistency and organization, it is essential to adhere to the following file naming conventions when working with NanoPlasm:

For long reads:

  • Each long read file should be named as "ID.fastq", where "ID" represents a unique identifier associated with the sample.

For short reads:

  • Paired-end short read files should be stored in a separate folder.

  • The forward read file should be named as "ID_R1.fastq", where "ID" corresponds to the sample identifier.

  • The reverse read file should be named as "ID_R2.fastq", following the same sample identifier pattern.

Adhering to these file naming conventions ensures clarity and facilitates the organization of data within the NanoPlasm workflow.

Creation of the analysis folder

The initial step entails creating a designated folder where the analysis will take place. Within this folder, an automatically generated configuration file will be created, encompassing all the parameters required for the analysis. It is imperative to edit this file in accordance with your specific analysis requirements and preferences.

usage: nanoplasm init [-h] -i dir [-I dir] -o dir
Generate the folder and configuration file for NanoPlasm analysis.
options:
-h, --help show this help message and exit
-i dir, --input dir directory with Nanopore fastq files. Files must be in the following format: ID.fastq
-I dir, --INPUT dir directory with Illumina fastq files. Files must be in the following format: ID_R#.fastq
-o dir, --output dir directory to create for the run

Starting NanoPlasm run

After successfully creating the analysis folder, you can initiate a NanoPlasm run. The workflow offers two distinct assembly modes: "long" and "hybrid."

  • "long" mode: This mode is specifically designed for long-read sequencing data, such as Nanopore or PacBio reads. It focuses on leveraging the unique characteristics of long reads to perform the plasmid assembly, typing, and analysis.

  • "hybrid" mode: In this mode, a combination of long and short reads is utilized. The workflow harnesses the strengths of both long and short reads to achieve more comprehensive and accurate results in plasmid assembly, typing, and analysis.

You can select the appropriate assembly mode based on the sequencing data available for your analysis.

usage: nanoplasm typing [-h] [-m dir] -d dir [-t int] [--quality_check] [--options str]
Start Nanoplasm with long|hybrid mode.
options:
-h, --help show this help message and exit
-m dir, --mode dir mode for assembling genomes [long]
-d dir, --dir dir directory for the run (created with nanoplasm init)
-t int, --thread int number of threads [1]
--quality_check enable reads quality check [disabled]
--options str Snakemake options. Use as --options= and snakemake arguments between brackets

Example

# Generate the analysis folder and configuration file
path/to/nanoplasm init -i fastq_nanopore -o analysis_folder
# Edit the configuration file
nano analysis_folder/config.yaml
# Start NanoPlasm run
path/to/nanoplasm typing -d analysis_folder -t 24 --quality_check

Future features

  • Module for discarding reads when in excess. Samples with reads in excess (very high coverage) can lead to errors during Flye assembly.

  • Merge functions for combining different analysis folders.

Tips

  • The pipeline might crash because a sample assembly failed.
    This can be due to a very coverage for this sample leading Flye to fail disjointing assembly. To avoid this error, you can use the --asm-coverage and --genome-size options in Flye to limit the coverage in your assemblies. Add these options in the config.yaml file of the run.

  • I need to analyse groups of sample with different parameters but I want to compare them all together ultimately.
    Run the analyses in different folders with appropriate parameters for each analysis and, once all analyses are completed, merge all folders using the merge function.

Code Snippets

 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from Bio import SeqIO
import os

min_size = snakemake.config["classify_contigs"]["min_size"]
max_size = snakemake.config["classify_contigs"]["max_size"]
assemblies = snakemake.input
output = snakemake.output["class_file"]
plasmids_files_path = snakemake.output["plasmids_list"]

# If only one sample, must converted into list
if len(assemblies) < 2:
    assemblies = [assemblies]

# Create folders
if not os.path.exists("Sequences/plasmids"):
    os.mkdir("Sequences/plasmids")
if not os.path.exists("Sequences/chromosomes"):
    os.mkdir("Sequences/chromosomes")

with open(output, "w") as report, open(plasmids_files_path, "w") as plasmids_list:
    # Write report first line
    report.write("Sample\tContig ID\tType\tLength\n")

    for file in assemblies:
        sample = file.split("/")[1] # sample ID (second term in path)
        assembly = SeqIO.parse(file, "fasta")
        for seq in assembly:
            # If plasmid
            if min_size <= len(seq.seq) <= max_size:
                with open(f"Sequences/plasmids/{sample}_plasmid_{seq.id}.fasta", "w") as out:
                    SeqIO.write([seq], out, "fasta")
                report.write(f"{sample}\t{seq.id}\tPlasmid\t{len(seq.seq)}\n")
                plasmids_list.write(f"Sequences/plasmids/{sample}_plasmid_{seq.id}.fasta\n")

            # If chromosome
            elif max_size < len(seq.seq):
                with open(f"Sequences/chromosomes/{sample}_chromosome_{seq.id}.fasta", "w") as out:
                    SeqIO.write([seq], out, "fasta")
                report.write(f"{sample}\t{seq.id}\tChromosome\t{len(seq.seq)}\n")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import pandas as pd

""" Gather results from Resfinder and mobsuite """

def add_resfinder(master_df, res_files):
    " Add resfinder results to the classification file "

    # Creation of dataframe for resfinder results
    res_df = pd.DataFrame({"Sample":[], "Contig ID":[], "ResFinder Resistance genes":[], "ResFinder phenotype":[]})

    # Parsing of all Resfinder files
    for f in res_files:
        sample = f.split("/")[1] # extracted from the path
        results = {} # stores the results per contig
        text = open(f, "r").readlines()[1:] # open and skip header
        for line in text:
            gene = line.split("\t")[0]
            contig = line.split("\t")[5]
            phenotype = line.split("\t")[7].split(", ") # list of phenotypes
            if contig not in results:
                results[contig] = [gene, phenotype] # list of genes in pos 0 and list of phenotype in pos 1
            else:
                results[contig][0] += f", {gene}" # add the new gene to the string in position 0 (list of genes)
                results[contig][1].extend(x for x in phenotype if x not in results[contig][1]) # add phenotype if not already present

        # Looping over results dict to complete dataframe. The label used is the length of the dataframe, creating a simple index.
        for elt in results:
            res_df.loc[len(res_df)] = [sample, elt, results[elt][0], ", ".join(results[elt][1])]

    # Merging the classification dataframe with the resfinder dataframe
    new_df = master_df.merge(res_df, how="outer", on=["Sample", "Contig ID"])

    return new_df

def add_mobsuite(master_df, mob_files):
    " Add mobsuite results to the master file "

    # Creation of dataframe for mobsuite results
    res_df = pd.DataFrame({"Sample":[], "Contig ID":[], "Mob-suite replicon types":[], "Mob-suite relaxase types":[]})

    # Parsing of all mobsuite files
    for f in mob_files:
        sample = f.split("/")[1].split("_")[0] # extracted from the path
        results = {} # stores the results per contig
        text = open(f, "r").readlines()[1:] # open and skip header
        for line in text:
            contig = line.split("\t")[0]
            replicon = line.split("\t")[5]
            relaxase = line.split("\t")[7]
            results[contig] = [replicon, relaxase]

        # Looping over results dict to complete dataframe. The label used is the length of the dataframe, creating a simple index.
        for elt in results:
            res_df.loc[len(res_df)] = [sample, elt, results[elt][0], results[elt][1]]

    # Merging the classification dataframe with the mobsuite dataframe
    new_df = master_df.merge(res_df, how="outer", on=["Sample", "Contig ID"])

    return new_df


if __name__ == "__main__":

    # Load dataframe and list of files from the Snakefile
    class_file = pd.read_csv(snakemake.input["classification"], sep='\t', header=0)
    class_file = class_file.astype({"Sample": str}) # Convert first column to string in case samples ID are just numbers
    res_files = snakemake.input["resfinder"]
    mob_files = snakemake.input["mobsuite"]

    # If only one sample, must converted into list
    if len(res_files) > 1:
        res_list = res_files
    else:
        res_list = [res_files[0]]

    if len(mob_files) > 1:
        mob_list = mob_files
    else:
        mob_list = [mob_files[0]]

    # Add information for each analysis
    infos_resfinder = add_resfinder(class_file, res_list)
    infos_mobsuite = add_mobsuite(infos_resfinder, mob_list)

    # Export the final dataframe
    final_df = infos_mobsuite.sort_values(by=["Sample", "Type", 'Length'])
    final_df.to_csv(snakemake.output[0], sep="\t")
35
36
shell:
    "NanoFilt -q {params.quality} -l {params.length} {input} --logfile {log}> {output}"
50
51
shell:
    "medaka_consensus -i {input.reads} -d {input.assembly} -t {threads} -m {params.model} -o 04-Medaka/{wildcards.sample} > {log} 2>&1"
62
63
shell:
    "nanoq -i {input} -s -vvv 2> {output}"
SnakeMake From line 62 of main/Snakefile
75
76
shell:
    "flye --nano-hq {input} -t {threads} -o 03-assembly/{wildcards.sample}_long {params.others} > {log} 2> /dev/null"
89
90
shell:
    "homopolish polish -a {input} -s {workflow.basedir}/{params.db} -m R10.3.pkl -o 05-Homopolish/{wildcards.sample} > {log} 2>&1"
SnakeMake From line 89 of main/Snakefile
115
116
117
shell:
    "fastp -i {input.R1} -I {input.R2} -o {output.R1} -O {output.R2} -q {params.qual} "
    "-u {params.unqual} -e {params.ave_qual} -l {params.length} -h {output.html} -j {output.json} 2> {log}"
128
129
130
shell:
    "mkdir QC/{wildcards.sample};"
    "fastqc {input} -o QC > /dev/null 2>&1"
144
145
146
shell:
    "unicycler -1 {input.short_1} -2 {input.short_2} -l {input.long} -o 03-assembly/{wildcards.sample}_hybrid -t {threads} --mode {params.mode} "
    "--verbosity 0 --keep 0"
160
161
162
163
shell:
    "bwa index {input.draft} > /dev/null 2>&1;"
    "bwa mem -t {threads} -a {input.draft} {input.short_1} > {output.alignments_1} 2> {log};"
    "bwa mem -t {threads} -a {input.draft} {input.short_1} > {output.alignments_2} 2> {log}"
179
180
181
shell:
    "polypolish_insert_filter.py --in1 {input.alignments_1} --in2 {input.alignments_2} --out1 {output.alignments_filt_1} --out2 {output.alignments_filt_2} > /dev/null 2>&1;"
    "polypolish {input.draft} {output.alignments_filt_1} {output.alignments_filt_2} > {output.assembly} 2> {log}"
SnakeMake From line 179 of main/Snakefile
196
197
shell:
    "python -m resfinder -ifa {input} --nanopore -acq -o 06-resfinder/{wildcards.sample} -l {params.coverage_threshold} -t {params.id_threshold} > /dev/null 2>&1"
SnakeMake From line 196 of main/Snakefile
207
208
209
210
shell:
    """java -cp {workflow.basedir}/KARGA KARGA {input} d:{workflow.basedir}/{params.db} -Xmx16GB;
    mv 01-NanoFilt/{wildcards.sample}_nanofilt_KARGA_mappedReads.csv 09-karga;
    mv 01-NanoFilt/{wildcards.sample}_nanofilt_KARGA_mappedGenes.csv 09-karga"""
SnakeMake From line 207 of main/Snakefile
221
222
shell:
    "mob_typer -i {input} -o {output} -n {threads} --multi > {log} 2>&1"
SnakeMake From line 221 of main/Snakefile
232
233
script:
    "bin/classify_contigs.py"
SnakeMake From line 232 of main/Snakefile
245
246
shell:
    "mge_cluster --create --input {input} --outdir 08-mge-cluster --perplexity {params.perplexity} --min_cluster {params.min_cluster} --kmer {params.kmer} --threads {threads}"
SnakeMake From line 245 of main/Snakefile
255
256
script:
    "bin/gather_typing_results.py"
SnakeMake From line 255 of main/Snakefile
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Flomauf/NanoPlasm
Name: nanoplasm
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...