Bulk Typing of Bacterial Species down to Strain Level

public public 1yr ago 0 bookmarks

ON-rep-seq analysis toolbox

ON-rep-seq is a molecular method where bacterial (or yeast) selective intragenomic fragments generated with Rep-PCR are sequenced using Oxford Nanopore Technologies. This apporoch allows for species and sub-species level identification but also often strain level discrimination of bacterial and yeast isolates at very low cost. Current version of ON-rep-seq allows for analysis of up to 192 isolates in one R9 flow cell but will give most cost effective results by using flongle <https://nanoporetech.com/products/flongle> _ for which it wass initially designed.

Requirements

  • Anaconda

You can follow the installation guide <https://docs.anaconda.com/anaconda/install/> _ .

Installation

Clone github repo and enter directory::

git clone https://github.com/lauramilena3/On-rep-seq cd On-rep-seq

Create On-rep-seq virtual environment and activate it::

conda env create -n On-rep-seq -f On-rep-seq.yaml source activate On-rep-seq

Go into On-rep-seq directory and create variables to your basecalled data and the results directory of your choice::

fastqDir="/path/to/your/basecalled/data" reusultDir="/path/to/your/desired/results/dir"

Note to macOS users (Canu)

If you are using os then you need to edit the config file to set a new directory for canu::

sed -i'.bak' -e 's/Linux-amd64/Darwin-amd64/g' config.yaml

Download kraken database

View the number of avaliable cores in your machine and set a number::

nproc nCores="n"

If you are using your laptop we suggest you to leave 2 free cores for other system tasks.

Download kraken database. Notice this step can take up to 48 hours (!needs to be done only once)::

kraken2-build --download-taxonomy --db db/NCBI-bacteria --threads $nCores #4h kraken2-build --download-library bacteria --db db/NCBI-bacteria --threads $nCores #33h kraken2-build --build --db db/NCBI-bacteria --threads $nCores #4h

Running On-rep-seq analysis

Note to all users

ON-rep-seq is under regular updates. For better results, please keep your local installation up to date::

cd On-rep-seq git pull

Input data

The input data is basecalled fastq files. Please check Guppy basecaller <https://community.nanoporetech.com/downloads> _ For best performance we strongly recommend basecalling on GPU (tested on GTX 1080Ti and RTX 2080).

Running

Run the snakemake pipeline with the desired number of cores::

snakemake -j $nCores --use-conda --config basecalled_dir=$fastqDir results_dir=$reusultDir

Limiting memory ...............

You can limit the memory resources (in Megabytes) used per core by using the resources directive as follows::

snakemake -j $nCores --use-conda --config basecalled_dir=$fastqDir results_dir=$reusultDir --resources mem_mb=$max_mem

View dag of jobs to visualize the workflow ++++++++++++++++++++++++++++++++++++++++++

To view the dag run::

snakemake --dag | dot -Tpdf > dag.pdf

Results structure

All results are stored in the Results folder as follows::

Results ├── 01_porechopped_data
│ └── {barcode} demultiplexed.fastq # Demultiplexed fastq per barcode ├── 02_LCPs │ ├── LCP_clustering_heatmaps.ipynb # Clustering jupyter notebook │ ├── LCP_plots.pdf # Plots │ ├── {barcode}.txt # All LCPs │ └── LCPsClusteringData
│ └── {barcode}.txt # LCPs used for clustering ├── 03_LCPs_peaks
│ ├── 00_peak_consensus
│ │ └── fixed
{barcode} {peak}.fasta # Corrected consensus fasta of peaks │ ├── 01_taxonomic_assignments
│ │ ├── taxonomy_assignments.txt # Taxonomy of all barcodes │ │ └── taxonomy
{barcode}.txt # Taxonomy per Barcode │ └── peaks_{barcode}.txt # File with the peaks of each barcode └── check.txt # Final file "On-rep-seq succesfuly executed"

Publications & citing

bioRxiv <https://www.biorxiv.org/content/10.1101/402156v1> _

Code Snippets

13
14
15
16
17
18
19
20
21
22
shell:
    """
    echo {wildcards.sample}
    porechop -i {input}/{wildcards.sample}.fastq -b {params} -t {threads} --discard_unassigned --verbosity 2 > /dev/null 2>&1
    line=$(echo {BARCODES})
    for barcode in $line
    do
        touch {params}/$barcode.fastq
    done
    """
34
35
36
37
38
shell:
    """
    cat {params}/*/{wildcards.barcode}.fastq > {output}
    echo "{params}/*/{wildcards.barcode}.fastq > {output}"
    """
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
shell:
    """
    if [ -s {input} ]
    then
        porechop -i {input} -o {output} --fp2ndrun --verbosity 0
        reads=$(grep -c "^@" {output})
        if (( $reads < 2000 ))
        then
            mv {output} {params}
            touch {output}
        fi
    else
        touch {output}
    fi
    """
 8
 9
10
11
shell:
	"""
	cat {input} | awk '{{if(NR%4==2) print length($1)+0}}' | sort -n | uniq -c | sed "s/   //g" |  sed "s/  //g" | sed "s/^ *// " > {output}
	"""
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
run:
	#import libraries
	import matplotlib.pyplot as plt
	import numpy as np
	import math

	#set subplot features	
	filelist=sorted(input, key=lambda x: int(x.split('BC')[1].split(".")[0]))
	nro=math.ceil(len(filelist)/3)
	fig, axes = plt.subplots(nrows=nro, ncols=3, figsize=(12, 50), 
		sharex=True, sharey=True)
	plt.xlim(0,3500)

	#plot each barcode
	i = 0
	for row in axes:
		for ax in row:
			if i < len(filelist):
				if os.path.getsize(filelist[i]) > 10:
					data=np.loadtxt(filelist[i])
					X=data[:,0]
					Y=data[:,1]
				else:
					X=0
					Y=0

				ax.plot(Y, X)
				#add label to barcode subplot
				ax.text(0.9, 0.5, filelist[i].split("/")[-1].split(".")[0],
					transform=ax.transAxes, ha="right")

				i += 1
	#save figure to pdf				
	fig.savefig(OUTPUT_DIR + "/02_LCPs/LCP_plots.pdf", bbox_inches='tight')
60
61
62
63
64
65
shell:
	"""
	Rscript --vanilla scripts/peakpicker.R -f {input} -o {output.txt} -v TRUE || true 
	touch {output.txt}
	touch {output.pdf}
	"""
SnakeMake From line 60 of rules/LCPs.smk
86
87
88
89
90
91
92
93
94
95
96
97
98
99
	shell:
		"""
		mkdir -p "{output.directory}"
		cp "{params.work_directory}"/*.txt "{output.directory}"
		find "{output.directory}" -size -{params.min_size}c -delete
		Rscript -e "IRkernel::installspec()"
        CLUSTSCRIPT="$(realpath ./scripts/LCpCluster.R)"
        ( cd "{params.work_directory}"; "$CLUSTSCRIPT" LCPsClusteringData/ "{params.ipynb}" )
		mv "{params.work_directory}/{params.ipynb}" "{output.ipynb}"
		mv "{params.work_directory}/{params.png1}" "{output.png1}"
		mv "{params.work_directory}/{params.png2}" "{output.png2}"
		mv "{params.work_directory}/{params.fl_pdf}" "{output.fl_pdf}"
		jupyter-nbconvert --to html --template full "{output.ipynb}" 
		"""
SnakeMake From line 86 of rules/LCPs.smk
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
shell:
	"""
	sed 1d {input} | while read line
	do
		P1=$(echo $line | cut -d',' -f 5 )
		P2=$(echo $line | cut -d',' -f 6)
		if [ $P2 > 300 ]
		then
			name=$(echo $line | cut -d',' -f 3)
			cutadapt -m $P1 {params.porechopped}/{wildcards.barcode}_demultiplexed.fastq -o {params.peaks}/{wildcards.barcode}_short_$name.fastq 
			cutadapt -M $P2 {params.peaks}/{wildcards.barcode}_short_$name.fastq -o {params.peaks}/{wildcards.barcode}_$name.fastq
			echo "{wildcards.barcode}_$name" >> {output}
		fi
	done
	touch {output}
	"""
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
	shell:
		"""
		cat {input} | while read line
		do
			echo $line
			./{config[canu_dir]}/canu -correct -p peak -d {params}/fixed_$line genomeSize=5k -nanopore-raw {params}/$line.fastq \
			minReadLength=300 correctedErrorRate=0.01 corOutCoverage=5000 corMinCoverage=2 minOverlapLength=300 cnsErrorRate=0.1 \
			cnsMaxCoverage=5000 useGrid=false || true
			if [ -s {params}/fixed_$line/peak.correctedReads.fasta.gz ];
        		then
        			gunzip -c {params}/fixed_$line/peak.correctedReads.fasta.gz > {params}/fixed_$line.fastq
        			echo "fixed_$line" >> {output}
        		fi
		done
		touch {output}
		"""
62
63
64
65
66
67
68
69
70
71
72
73
74
75
shell:
	"""
	mkdir -p {params.consensus}
	cat {input} | while read line
	do
		count=$(grep -c ">" {params.LCPs}/$line.fastq )
		min=$(echo "scale=0 ; $count / 5" | bc )
		echo "$line" >> {output}
		vsearch --sortbylength {params.LCPs}/$line.fastq --output {params.LCPs}/sorted_$line.fasta
		vsearch --cluster_fast {params.LCPs}/sorted_$line.fasta -id 0.9  --consout {params.LCPs}/consensus_$line.fasta -strand both -minsl 0.80 -sizeout -minuniquesize $min
		vsearch --sortbysize {params.LCPs}/consensus_$line.fasta --output {params.consensus}/$line.fasta --minsize 50
	done
	touch {output}
	"""
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
shell:
	"""
	mkdir -p {params.taxonomy}
	cat {input} | while read line
	do
		echo "{params.consensus}/$line.fasta"
		if [ -s {params.consensus}/$line.fasta ]
		then
			cat {params.consensus}/$line.fasta >> {output.merged}
		fi
	done
	kraken2 --db {config[kraken_db]} {output.merged} --use-names > {output.taxonomy} || true 
	touch {output.taxonomy}
	touch {output.merged}
	awk -F '\t' '{{print FILENAME " " $3}}' {output.taxonomy} | sort | uniq -c | sort -nr >> {params.taxonomy_final} 
	"""
34
35
36
37
38
shell:
	"""
	rm {params.peaks}/*fastq {params.peaks}/*fasta 
	echo "On-rep-seq succesfuly executed" >> {output}
	"""
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/lauramilena3/On-rep-seq
Name: on-rep-seq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...