Bulk Typing of Bacterial Species down to Strain Level

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

ON-rep-seq analysis toolbox

ON-rep-seq is a molecular method where bacterial (or yeast) selective intragenomic fragments generated with Rep-PCR are sequenced using Oxford Nanopore Technologies. This apporoch allows for species and sub-species level identification but also often strain level discrimination of bacterial and yeast isolates at very low cost. Current version of ON-rep-seq allows for analysis of up to 192 isolates in one R9 flow cell but will give most cost effective results by using flongle <https://nanoporetech.com/products/flongle> _ for which it wass initially designed.

Requirements

Anaconda

You can follow the installation guide <https://docs.anaconda.com/anaconda/install/> _ .

Installation

Clone github repo and enter directory::

git clone https://github.com/lauramilena3/On-rep-seq cd On-rep-seq

Create On-rep-seq virtual environment and activate it::

conda env create -n On-rep-seq -f On-rep-seq.yaml source activate On-rep-seq

Go into On-rep-seq directory and create variables to your basecalled data and the results directory of your choice::

fastqDir="/path/to/your/basecalled/data" reusultDir="/path/to/your/desired/results/dir"

Note to macOS users (Canu)

If you are using os then you need to edit the config file to set a new directory for canu::

sed -i'.bak' -e 's/Linux-amd64/Darwin-amd64/g' config.yaml

Download kraken database

View the number of avaliable cores in your machine and set a number::

nproc nCores="n"

If you are using your laptop we suggest you to leave 2 free cores for other system tasks.

Download kraken database. Notice this step can take up to 48 hours (!needs to be done only once)::

kraken2-build --download-taxonomy --db db/NCBI-bacteria --threads $nCores #4h kraken2-build --download-library bacteria --db db/NCBI-bacteria --threads $nCores #33h kraken2-build --build --db db/NCBI-bacteria --threads $nCores #4h

Running On-rep-seq analysis

Note to all users

ON-rep-seq is under regular updates. For better results, please keep your local installation up to date::

cd On-rep-seq git pull

Input data

The input data is basecalled fastq files. Please check Guppy basecaller <https://community.nanoporetech.com/downloads> _ For best performance we strongly recommend basecalling on GPU (tested on GTX 1080Ti and RTX 2080).

Running

Run the snakemake pipeline with the desired number of cores::

snakemake -j $nCores --use-conda --config basecalled_dir=$fastqDir results_dir=$reusultDir

Limiting memory ...............

You can limit the memory resources (in Megabytes) used per core by using the resources directive as follows::

snakemake -j $nCores --use-conda --config basecalled_dir=$fastqDir results_dir=$reusultDir --resources mem_mb=$max_mem

View dag of jobs to visualize the workflow ++++++++++++++++++++++++++++++++++++++++++

To view the dag run::

snakemake --dag | dot -Tpdf > dag.pdf

Results structure

All results are stored in the Results folder as follows::

Results ├── 01_porechopped_data
│ └── {barcode} demultiplexed.fastq # Demultiplexed fastq per barcode ├── 02_LCPs │ ├── LCP_clustering_heatmaps.ipynb # Clustering jupyter notebook │ ├── LCP_plots.pdf # Plots │ ├── {barcode}.txt # All LCPs │ └── LCPsClusteringData
│ └── {barcode}.txt # LCPs used for clustering ├── 03_LCPs_peaks
│ ├── 00_peak_consensus
│ │ └── fixed {barcode} {peak}.fasta # Corrected consensus fasta of peaks │ ├── 01_taxonomic_assignments
│ │ ├── taxonomy_assignments.txt # Taxonomy of all barcodes │ │ └── taxonomy {barcode}.txt # Taxonomy per Barcode │ └── peaks_{barcode}.txt # File with the peaks of each barcode └── check.txt # Final file "On-rep-seq succesfuly executed"

Publications & citing

bioRxiv <https://www.biorxiv.org/content/10.1101/402156v1> _

Code Snippets

shell:
    """
    echo {wildcards.sample}
    porechop -i {input}/{wildcards.sample}.fastq -b {params} -t {threads} --discard_unassigned --verbosity 2 > /dev/null 2>&1
    line=$(echo {BARCODES})
    for barcode in $line
    do
        touch {params}/$barcode.fastq
    done
    """

SnakeMake porechop From line 13 of rules/demultiplex.smk

shell:
    """
    cat {params}/*/{wildcards.barcode}.fastq > {output}
    echo "{params}/*/{wildcards.barcode}.fastq > {output}"
    """

SnakeMake From line 34 of rules/demultiplex.smk

shell:
    """
    if [ -s {input} ]
    then
        porechop -i {input} -o {output} --fp2ndrun --verbosity 0
        reads=$(grep -c "^@" {output})
        if (( $reads < 2000 ))
        then
            mv {output} {params}
            touch {output}
        fi
    else
        touch {output}
    fi
    """

SnakeMake porechop From line 49 of rules/demultiplex.smk

shell:
	"""
	cat {input} | awk '{{if(NR%4==2) print length($1)+0}}' | sort -n | uniq -c | sed "s/   //g" |  sed "s/  //g" | sed "s/^ *// " > {output}
	"""

SnakeMake From line 8 of rules/LCPs.smk

run:
	#import libraries
	import matplotlib.pyplot as plt
	import numpy as np
	import math

	#set subplot features	
	filelist=sorted(input, key=lambda x: int(x.split('BC')[1].split(".")[0]))
	nro=math.ceil(len(filelist)/3)
	fig, axes = plt.subplots(nrows=nro, ncols=3, figsize=(12, 50), 
		sharex=True, sharey=True)
	plt.xlim(0,3500)

	#plot each barcode
	i = 0
	for row in axes:
		for ax in row:
			if i < len(filelist):
				if os.path.getsize(filelist[i]) > 10:
					data=np.loadtxt(filelist[i])
					X=data[:,0]
					Y=data[:,1]
				else:
					X=0
					Y=0

				ax.plot(Y, X)
				#add label to barcode subplot
				ax.text(0.9, 0.5, filelist[i].split("/")[-1].split(".")[0],
					transform=ax.transAxes, ha="right")

				i += 1
	#save figure to pdf				
	fig.savefig(OUTPUT_DIR + "/02_LCPs/LCP_plots.pdf", bbox_inches='tight')

SnakeMake numpy matplotlib From line 17 of rules/LCPs.smk

shell:
	"""
	Rscript --vanilla scripts/peakpicker.R -f {input} -o {output.txt} -v TRUE || true 
	touch {output.txt}
	touch {output.pdf}
	"""

SnakeMake From line 60 of rules/LCPs.smk

	shell:
		"""
		mkdir -p "{output.directory}"
		cp "{params.work_directory}"/*.txt "{output.directory}"
		find "{output.directory}" -size -{params.min_size}c -delete
		Rscript -e "IRkernel::installspec()"
        CLUSTSCRIPT="$(realpath ./scripts/LCpCluster.R)"
        ( cd "{params.work_directory}"; "$CLUSTSCRIPT" LCPsClusteringData/ "{params.ipynb}" )
		mv "{params.work_directory}/{params.ipynb}" "{output.ipynb}"
		mv "{params.work_directory}/{params.png1}" "{output.png1}"
		mv "{params.work_directory}/{params.png2}" "{output.png2}"
		mv "{params.work_directory}/{params.fl_pdf}" "{output.fl_pdf}"
		jupyter-nbconvert --to html --template full "{output.ipynb}" 
		"""

SnakeMake From line 86 of rules/LCPs.smk

shell:
	"""
	sed 1d {input} | while read line
	do
		P1=$(echo $line | cut -d',' -f 5 )
		P2=$(echo $line | cut -d',' -f 6)
		if [ $P2 > 300 ]
		then
			name=$(echo $line | cut -d',' -f 3)
			cutadapt -m $P1 {params.porechopped}/{wildcards.barcode}_demultiplexed.fastq -o {params.peaks}/{wildcards.barcode}_short_$name.fastq 
			cutadapt -M $P2 {params.peaks}/{wildcards.barcode}_short_$name.fastq -o {params.peaks}/{wildcards.barcode}_$name.fastq
			echo "{wildcards.barcode}_$name" >> {output}
		fi
	done
	touch {output}
	"""

SnakeMake Cutadapt From line 11 of rules/peakCorrection.smk

	shell:
		"""
		cat {input} | while read line
		do
			echo $line
			./{config[canu_dir]}/canu -correct -p peak -d {params}/fixed_$line genomeSize=5k -nanopore-raw {params}/$line.fastq \
			minReadLength=300 correctedErrorRate=0.01 corOutCoverage=5000 corMinCoverage=2 minOverlapLength=300 cnsErrorRate=0.1 \
			cnsMaxCoverage=5000 useGrid=false || true
			if [ -s {params}/fixed_$line/peak.correctedReads.fasta.gz ];
        		then
        			gunzip -c {params}/fixed_$line/peak.correctedReads.fasta.gz > {params}/fixed_$line.fastq
        			echo "fixed_$line" >> {output}
        		fi
		done
		touch {output}
		"""

SnakeMake CANU From line 36 of rules/peakCorrection.smk

shell:
	"""
	mkdir -p {params.consensus}
	cat {input} | while read line
	do
		count=$(grep -c ">" {params.LCPs}/$line.fastq )
		min=$(echo "scale=0 ; $count / 5" | bc )
		echo "$line" >> {output}
		vsearch --sortbylength {params.LCPs}/$line.fastq --output {params.LCPs}/sorted_$line.fasta
		vsearch --cluster_fast {params.LCPs}/sorted_$line.fasta -id 0.9  --consout {params.LCPs}/consensus_$line.fasta -strand both -minsl 0.80 -sizeout -minuniquesize $min
		vsearch --sortbysize {params.LCPs}/consensus_$line.fasta --output {params.consensus}/$line.fasta --minsize 50
	done
	touch {output}
	"""

SnakeMake VSEARCH From line 62 of rules/peakCorrection.smk

shell:
	"""
	mkdir -p {params.taxonomy}
	cat {input} | while read line
	do
		echo "{params.consensus}/$line.fasta"
		if [ -s {params.consensus}/$line.fasta ]
		then
			cat {params.consensus}/$line.fasta >> {output.merged}
		fi
	done
	kraken2 --db {config[kraken_db]} {output.merged} --use-names > {output.taxonomy} || true 
	touch {output.taxonomy}
	touch {output.merged}
	awk -F '\t' '{{print FILENAME " " $3}}' {output.taxonomy} | sort | uniq -c | sort -nr >> {params.taxonomy_final} 
	"""

SnakeMake kraken2 From line 11 of rules/taxonomyAssignment.smk

shell:
	"""
	rm {params.peaks}/*fastq {params.peaks}/*fasta 
	echo "On-rep-seq succesfuly executed" >> {output}
	"""

SnakeMake From line 34 of rules/taxonomyAssignment.smk

ShowHide 6 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/lauramilena3/On-rep-seq

Name: on-rep-seq

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

porechop CANU Cutadapt kraken2 Snakemake VSEARCH matplotlib numpy

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free