NanoPlasm

public 1yr ago 0 bookmarks

View Workflow

v1.0

NanoPlasm is a specialized workflow designed for the purpose of assembling, typing, and analyzing plasmid sequences using long reads such as Nanopore or PacBio, as well as a combination of long and short reads. To ensure reliability, reproducibility, and optimization in the analysis process, NanoPlasm harnesses the power of the Snakemake workflow framework.

The tool is still under development.

Workflow overview

Nanoplasm_worflow

Install

Currently, the only method of installing and utilizing NanoPlasm consists in cloning this repository and subsequently installing the various dependencies and databases (as outlined below).

# Clone NanoPlasm repository
git clone https://github.com/Flomauf/NanoPlasm.git NanoPlasm

Dependencies

While NanoPlasm relies on Docker images for the majority of its tools, certain dependencies must be installed beforehand to ensure seamless execution. It is strongly recommended to set up a dedicated environment specifically for this purpose. The subsequent instructions illustrate how to achieve this using Conda:

# Create the environment
conda create -n nanoplasm python=3.10.0
# Install dependencies
conda install -c bioconda -c conda-forge snakemake bwa biopython mge-cluster
# Activate the environment
conda activate nanoplasm

Database

In order for Homopolish to function properly, a dedicated database is necessary. This database should be installed within the NanoPlasm directory. To streamline this process, a script has been developed to simplify the installation.

# Go into NanoPlasm directory
cd NanoPlasm
# Install database
sh install_db.sh

NanoPlasm is now ready to use.

Usage

To streamline the usage of NanoPlasm, a wrapper script has been implemented to facilitate the invocation of the Snakemake pipeline. However, if needed, arguments can still be passed to Snakemake using the --options argument, allowing for additional customization and flexibility.

To maintain consistency and organization, it is essential to adhere to the following file naming conventions when working with NanoPlasm:

For long reads:

Each long read file should be named as "ID.fastq", where "ID" represents a unique identifier associated with the sample.

For short reads:

Paired-end short read files should be stored in a separate folder.
The forward read file should be named as "ID_R1.fastq", where "ID" corresponds to the sample identifier.
The reverse read file should be named as "ID_R2.fastq", following the same sample identifier pattern.

Adhering to these file naming conventions ensures clarity and facilitates the organization of data within the NanoPlasm workflow.

Creation of the analysis folder

The initial step entails creating a designated folder where the analysis will take place. Within this folder, an automatically generated configuration file will be created, encompassing all the parameters required for the analysis. It is imperative to edit this file in accordance with your specific analysis requirements and preferences.

usage: nanoplasm init [-h] -i dir [-I dir] -o dir
Generate the folder and configuration file for NanoPlasm analysis.
options:
-h, --help show this help message and exit
-i dir, --input dir directory with Nanopore fastq files. Files must be in the following format: ID.fastq
-I dir, --INPUT dir directory with Illumina fastq files. Files must be in the following format: ID_R#.fastq
-o dir, --output dir directory to create for the run

Starting NanoPlasm run

After successfully creating the analysis folder, you can initiate a NanoPlasm run. The workflow offers two distinct assembly modes: "long" and "hybrid."

"long" mode: This mode is specifically designed for long-read sequencing data, such as Nanopore or PacBio reads. It focuses on leveraging the unique characteristics of long reads to perform the plasmid assembly, typing, and analysis.
"hybrid" mode: In this mode, a combination of long and short reads is utilized. The workflow harnesses the strengths of both long and short reads to achieve more comprehensive and accurate results in plasmid assembly, typing, and analysis.

You can select the appropriate assembly mode based on the sequencing data available for your analysis.

usage: nanoplasm typing [-h] [-m dir] -d dir [-t int] [--quality_check] [--options str]
Start Nanoplasm with long|hybrid mode.
options:
-h, --help show this help message and exit
-m dir, --mode dir mode for assembling genomes [long]
-d dir, --dir dir directory for the run (created with nanoplasm init)
-t int, --thread int number of threads [1]
--quality_check enable reads quality check [disabled]
--options str Snakemake options. Use as --options= and snakemake arguments between brackets

Example

# Generate the analysis folder and configuration file
path/to/nanoplasm init -i fastq_nanopore -o analysis_folder
# Edit the configuration file
nano analysis_folder/config.yaml
# Start NanoPlasm run
path/to/nanoplasm typing -d analysis_folder -t 24 --quality_check

Future features

Module for discarding reads when in excess. Samples with reads in excess (very high coverage) can lead to errors during Flye assembly.
Merge functions for combining different analysis folders.

Tips

The pipeline might crash because a sample assembly failed.
This can be due to a very coverage for this sample leading Flye to fail disjointing assembly. To avoid this error, you can use the --asm-coverage and --genome-size options in Flye to limit the coverage in your assemblies. Add these options in the config.yaml file of the run.
I need to analyse groups of sample with different parameters but I want to compare them all together ultimately.
Run the analyses in different folders with appropriate parameters for each analysis and, once all analyses are completed, merge all folders using the merge function.

Code Snippets

from Bio import SeqIO
import os

min_size = snakemake.config["classify_contigs"]["min_size"]
max_size = snakemake.config["classify_contigs"]["max_size"]
assemblies = snakemake.input
output = snakemake.output["class_file"]
plasmids_files_path = snakemake.output["plasmids_list"]

# If only one sample, must converted into list
if len(assemblies) < 2:
    assemblies = [assemblies]

# Create folders
if not os.path.exists("Sequences/plasmids"):
    os.mkdir("Sequences/plasmids")
if not os.path.exists("Sequences/chromosomes"):
    os.mkdir("Sequences/chromosomes")

with open(output, "w") as report, open(plasmids_files_path, "w") as plasmids_list:
    # Write report first line
    report.write("Sample\tContig ID\tType\tLength\n")

    for file in assemblies:
        sample = file.split("/")[1] # sample ID (second term in path)
        assembly = SeqIO.parse(file, "fasta")
        for seq in assembly:
            # If plasmid
            if min_size <= len(seq.seq) <= max_size:
                with open(f"Sequences/plasmids/{sample}_plasmid_{seq.id}.fasta", "w") as out:
                    SeqIO.write([seq], out, "fasta")
                report.write(f"{sample}\t{seq.id}\tPlasmid\t{len(seq.seq)}\n")
                plasmids_list.write(f"Sequences/plasmids/{sample}_plasmid_{seq.id}.fasta\n")

            # If chromosome
            elif max_size < len(seq.seq):
                with open(f"Sequences/chromosomes/{sample}_chromosome_{seq.id}.fasta", "w") as out:
                    SeqIO.write([seq], out, "fasta")
                report.write(f"{sample}\t{seq.id}\tChromosome\t{len(seq.seq)}\n")

Python Biopython From line 5 of bin/classify_contigs.py

import pandas as pd

""" Gather results from Resfinder and mobsuite """

def add_resfinder(master_df, res_files):
    " Add resfinder results to the classification file "

    # Creation of dataframe for resfinder results
    res_df = pd.DataFrame({"Sample":[], "Contig ID":[], "ResFinder Resistance genes":[], "ResFinder phenotype":[]})

    # Parsing of all Resfinder files
    for f in res_files:
        sample = f.split("/")[1] # extracted from the path
        results = {} # stores the results per contig
        text = open(f, "r").readlines()[1:] # open and skip header
        for line in text:
            gene = line.split("\t")[0]
            contig = line.split("\t")[5]
            phenotype = line.split("\t")[7].split(", ") # list of phenotypes
            if contig not in results:
                results[contig] = [gene, phenotype] # list of genes in pos 0 and list of phenotype in pos 1
            else:
                results[contig][0] += f", {gene}" # add the new gene to the string in position 0 (list of genes)
                results[contig][1].extend(x for x in phenotype if x not in results[contig][1]) # add phenotype if not already present

        # Looping over results dict to complete dataframe. The label used is the length of the dataframe, creating a simple index.
        for elt in results:
            res_df.loc[len(res_df)] = [sample, elt, results[elt][0], ", ".join(results[elt][1])]

    # Merging the classification dataframe with the resfinder dataframe
    new_df = master_df.merge(res_df, how="outer", on=["Sample", "Contig ID"])

    return new_df

def add_mobsuite(master_df, mob_files):
    " Add mobsuite results to the master file "

    # Creation of dataframe for mobsuite results
    res_df = pd.DataFrame({"Sample":[], "Contig ID":[], "Mob-suite replicon types":[], "Mob-suite relaxase types":[]})

    # Parsing of all mobsuite files
    for f in mob_files:
        sample = f.split("/")[1].split("_")[0] # extracted from the path
        results = {} # stores the results per contig
        text = open(f, "r").readlines()[1:] # open and skip header
        for line in text:
            contig = line.split("\t")[0]
            replicon = line.split("\t")[5]
            relaxase = line.split("\t")[7]
            results[contig] = [replicon, relaxase]

        # Looping over results dict to complete dataframe. The label used is the length of the dataframe, creating a simple index.
        for elt in results:
            res_df.loc[len(res_df)] = [sample, elt, results[elt][0], results[elt][1]]

    # Merging the classification dataframe with the mobsuite dataframe
    new_df = master_df.merge(res_df, how="outer", on=["Sample", "Contig ID"])

    return new_df


if __name__ == "__main__":

    # Load dataframe and list of files from the Snakefile
    class_file = pd.read_csv(snakemake.input["classification"], sep='\t', header=0)
    class_file = class_file.astype({"Sample": str}) # Convert first column to string in case samples ID are just numbers
    res_files = snakemake.input["resfinder"]
    mob_files = snakemake.input["mobsuite"]

    # If only one sample, must converted into list
    if len(res_files) > 1:
        res_list = res_files
    else:
        res_list = [res_files[0]]

    if len(mob_files) > 1:
        mob_list = mob_files
    else:
        mob_list = [mob_files[0]]

    # Add information for each analysis
    infos_resfinder = add_resfinder(class_file, res_list)
    infos_mobsuite = add_mobsuite(infos_resfinder, mob_list)

    # Export the final dataframe
    final_df = infos_mobsuite.sort_values(by=["Sample", "Type", 'Length'])
    final_df.to_csv(snakemake.output[0], sep="\t")

Python Pandas From line 3 of bin/gather_typing_results.py

shell:
    "NanoFilt -q {params.quality} -l {params.length} {input} --logfile {log}> {output}"

SnakeMake nanofilt From line 35 of main/Snakefile

shell:
    "medaka_consensus -i {input.reads} -d {input.assembly} -t {threads} -m {params.model} -o 04-Medaka/{wildcards.sample} > {log} 2>&1"

SnakeMake Medaka From line 50 of main/Snakefile

shell:
    "nanoq -i {input} -s -vvv 2> {output}"

SnakeMake From line 62 of main/Snakefile

shell:
    "flye --nano-hq {input} -t {threads} -o 03-assembly/{wildcards.sample}_long {params.others} > {log} 2> /dev/null"

SnakeMake Flye From line 75 of main/Snakefile

shell:
    "homopolish polish -a {input} -s {workflow.basedir}/{params.db} -m R10.3.pkl -o 05-Homopolish/{wildcards.sample} > {log} 2>&1"

SnakeMake From line 89 of main/Snakefile

shell:
    "fastp -i {input.R1} -I {input.R2} -o {output.R1} -O {output.R2} -q {params.qual} "
    "-u {params.unqual} -e {params.ave_qual} -l {params.length} -h {output.html} -j {output.json} 2> {log}"

SnakeMake fastp From line 115 of main/Snakefile

shell:
    "mkdir QC/{wildcards.sample};"
    "fastqc {input} -o QC > /dev/null 2>&1"

SnakeMake FastQC From line 128 of main/Snakefile

shell:
    "unicycler -1 {input.short_1} -2 {input.short_2} -l {input.long} -o 03-assembly/{wildcards.sample}_hybrid -t {threads} --mode {params.mode} "
    "--verbosity 0 --keep 0"

SnakeMake Unicycler From line 144 of main/Snakefile

shell:
    "bwa index {input.draft} > /dev/null 2>&1;"
    "bwa mem -t {threads} -a {input.draft} {input.short_1} > {output.alignments_1} 2> {log};"
    "bwa mem -t {threads} -a {input.draft} {input.short_1} > {output.alignments_2} 2> {log}"

SnakeMake BWA From line 160 of main/Snakefile

shell:
    "polypolish_insert_filter.py --in1 {input.alignments_1} --in2 {input.alignments_2} --out1 {output.alignments_filt_1} --out2 {output.alignments_filt_2} > /dev/null 2>&1;"
    "polypolish {input.draft} {output.alignments_filt_1} {output.alignments_filt_2} > {output.assembly} 2> {log}"

SnakeMake From line 179 of main/Snakefile

shell:
    "python -m resfinder -ifa {input} --nanopore -acq -o 06-resfinder/{wildcards.sample} -l {params.coverage_threshold} -t {params.id_threshold} > /dev/null 2>&1"

SnakeMake From line 196 of main/Snakefile

shell:
    """java -cp {workflow.basedir}/KARGA KARGA {input} d:{workflow.basedir}/{params.db} -Xmx16GB;
    mv 01-NanoFilt/{wildcards.sample}_nanofilt_KARGA_mappedReads.csv 09-karga;
    mv 01-NanoFilt/{wildcards.sample}_nanofilt_KARGA_mappedGenes.csv 09-karga"""

SnakeMake From line 207 of main/Snakefile

shell:
    "mob_typer -i {input} -o {output} -n {threads} --multi > {log} 2>&1"

SnakeMake From line 221 of main/Snakefile

script:
    "bin/classify_contigs.py"

SnakeMake From line 232 of main/Snakefile

shell:
    "mge_cluster --create --input {input} --outdir 08-mge-cluster --perplexity {params.perplexity} --min_cluster {params.min_cluster} --kmer {params.kmer} --threads {threads}"

SnakeMake From line 245 of main/Snakefile

script:
    "bin/gather_typing_results.py"

SnakeMake From line 255 of main/Snakefile

ShowHide 9 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/Flomauf/NanoPlasm

Name: nanoplasm

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

raw sequence reads Genome assembly Plot nanofilt Biopython BWA fastp FastQC Flye Medaka Pandas Snakemake Unicycler Genomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free