A tool for generating bacterial genomes from metagenomes with nanopore long read sequencing

public public 1yr ago 0 bookmarks

Lathe

A tool for generating bacterial genomes from metagenomes with Nanopore long read sequencing

Installation

First, install miniconda3

Then install snakemake. This can be done with the following.

conda install snakemake
snakemake --version #please ensure this is >=5.4.3

Next, clone this github directory to some location where it can be stored permanently. Remember to keep it updated with git pull .

git clone https://github.com/elimoss/lathe.git

Instructions to enable cluster execution with SLURM can be found at https://github.com/bhattlab/slurm

Inputs

Alter config.yaml to provide the following:

  • sample_name : Name of sample and output directory

  • fast5_dirs_list : text file containing a list of absolute paths to run/fast5/* subfolders containing .fast5 files. A good way to generate this is with find -maxdepth 2 -mindepth 2 fast5_parent_dir > fodn.txt

  • flowcell : flowcell code, e.g. FLO-MIN106, passed to basecaller

  • kit : kit code, e.g. SQK-LSK109, passed to basecaller

  • genome_size : Estimated genome size, e.g. 50m, passed to Canu.

  • singularity : location (including on the internet) of a singularity image to be used for the workflow. Don't change this.

  • short_reads : location of short reads to be used for Pilon polishing, or empty quotes for long-read polishing.

  • use_grid : should Canu execute in distributed mode on a cluster?

  • grid_options : Extra options for execution on a cluster

  • canu_args : Extra options for Canu

  • skip_circularization : Should circularization be omitted from the workflow?

Lathe uses the Flye assembler by default. For Canu, please specify 'canu' for the assembler parameter in the config. For cluster Canu execution, please note: if set to True, you will need to install Canu, e.g. conda install -c conda-forge -c bioconda Canu=1.8 as well as provide any additional required parameters for your job scheduler in the config.yaml file. Please see the example config file. When executing on a cluster, Canu will appear to fail, as the first process does not produce an assembly and instead spawns subsequent jobs on the cluster. Don't worry, just re-run Lathe when the assembly completes.

To execute please run the following. Please note, you must substitute a parent directory containing all of your data and working directories for /labs/ .

snakemake --use-singularity --singularity-args '--bind /labs/ --bind /scratch/ --bind /scg/ ' -s /path/to/lathe/Snakefile \
--configfile path/to/modified_config.yaml --restart-times 0 --keep-going --latency-wait 30
# --profile scg #enable cluster support, highly recommended. See above.

Outputs

The outputs generated by this workflow will look like the following:

samplename/
├── 0.basecall ├── samplename.fq └── nanoplots
├── 1.assemble ├── samplename_merged.fasta ├── samplename_raw_assembly.fa ├── samplename_raw_assembly.fa.amb ├── samplename_raw_assembly.fa.ann ├── samplename_raw_assembly.fa.bwt ├── samplename_raw_assembly.fa.fai ├── samplename_raw_assembly.fa.pac ├── samplename_raw_assembly.fa.paf ├── samplename_raw_assembly.fa.sa ├── assemble_100m (if specified) └── assemble_250m (if specified)
├── 2.polish ├── samplename_polished.corrected.fasta ├── samplename_polished.fasta ├── samplename_polished.fasta.bam ├── samplename_polished.fasta.bam.bai ├── samplename_polished.fasta.fai ├── samplename_polished.fasta.misassemblies.tsv ├── medaka (if specified) ├── pilon (if specified) └── racon (if specified)
├── 3.circularization ├── 1.candidate_genomes ├── 2.circularization ├── 3.circular_sequences #circularized genomes ├── 4.samplename_circularized.corrected.fasta ├── 4.samplename_circularized.fasta ├── 4.samplename_circularized.fasta.bam ├── 4.samplename_circularized.fasta.bam.bai ├── 4.samplename_circularized.fasta.fai └── 4.samplename_circularized.fasta.misassemblies.tsv
└── 5.final
 ├── samplename_final.fa
 └── samplename_final.fa.fai

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import sys
import os
import argparse
import pysam

fastaf = snakemake.input[0]
out = open(snakemake.output[0], 'w')

#knobs
smooth_gap_width = 150000
contig_edge_margin = 150000
min_smoothed_aln_len = 10000
min_aln_len = 5000

#align
print("Aligning...".format(t=str(snakemake.threads)))
os.system("nucmer -p {delta} -b 4000 -l 2000 --maxmatch {fa} {fa}".format( #-t {threads} back to mummer3, no more multithreading : (
	threads = str(snakemake.threads),
	delta = snakemake.params['delta'],
	fa = fastaf
))
os.system("show-coords -T {delta}.delta -L 2000 | sed '1,5d' > coords.tsv".format(
	delta = snakemake.params['delta']
)) #the 1,5d gets rid of the header and identity hit as well
print("Trimming genome...")

max_tiglen = 0

with open('coords.tsv') as coords:
	lines = coords.readlines()
	smoothed_lines = []
	if len(lines) > 0:
		aln_start_line = lines[0].strip().split("\t")
		prev_line = lines[0].strip().split("\t")

		#goal: identify corner-cutting parallel off-diagonals.
	#	print(prev_line)

		for l in lines[1:] + [lines[0]]: #just so we don't miss the last item
			s = l.strip().split("\t")
			#print(s)

			tigname = s[-1] #store the name of the tig.  this repeats nugatorily.

			if int(s[1]) > max_tiglen: #keep track of the largest alignment coordinate found, used as the tig length
				max_tiglen = int(s[1])

			if int(s[0]) > int(s[1]): #ignore inversions
	#				print('ignoring')
				continue

			if int(s[1]) - int(s[0]) < min_aln_len:
		#		print('ignoring')
				continue

			if abs(int(s[0]) - int(prev_line[1])) < smooth_gap_width and \
				abs(int(s[2]) - int(prev_line[3])) < smooth_gap_width:
			#	print("joining")
				pass
			else:
	#				print("terminating")
				newline = aln_start_line
				newline[1] = prev_line[1]
				newline[3] = prev_line[3]
				if int(newline[1]) - int(newline[0]) > min_smoothed_aln_len:
	#				print("storing")
					smoothed_lines.append(newline)
				aln_start_line = s
			prev_line = s


#print("output")
#for s in smoothed_lines:
#	print(s)

if len(smoothed_lines) > 0:
	#is it a corner cutting parallel off-diagonal?
	if int(smoothed_lines[0][0]) < contig_edge_margin and int(smoothed_lines[0][3]) > max_tiglen - contig_edge_margin:
		if int(smoothed_lines[-1][2]) < contig_edge_margin and int(smoothed_lines[-1][1]) > max_tiglen - contig_edge_margin:
			#out.write("It's overcircularized!")

			out.write(tigname + ":" + smoothed_lines[0][0] + '-' + smoothed_lines[-1][0] + "\n")

out.write("done\n")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import fileinput
import os

prev = []
prev_is_terminal = False
verbose = True

margin = snakemake.params['margin']

with open(snakemake.output[0], 'w') as oot:
	for l in open(snakemake.input[0]):
		s = l.strip().split("\t")
		if prev == []:
			next
		if s == ['no circularizations']:
			oot.write("no circularizations\n")
			break

		target_tig = s[9]
		spanner = s[10]

		#test for terminal alignment
		r_start = min([int(i) for i in s[0:2]])
		r_end = max([int(i) for i in s[0:2]])
		q_start = min([int(i) for i in s[2:4]])
		q_end = max([int(i) for i in s[2:4]])

		r_len = int(s[7])
		q_len = int(s[8])

		if r_start < margin or r_end > (r_len - margin): #ref is terminal
			if q_start < margin or q_end > (q_len - margin): #query is terminal
				if prev_is_terminal and spanner == prev[10] and target_tig == prev[9]: #spanned event

					#trim when the center of the spanning contig aligns to both ends of the genome (consistent with overcircularization)
					if q_start < prev_q_end:
						trim_bases = prev_q_end - q_start
			#			oot.write('Need to trim ' + str(trim_bases) + " bases.\n")
						oot.write("{c}:1-{e}\n".format(c = target_tig, e = str(r_len-trim_bases)))

					#extend with spanning contig when it closes a gap
					else:
						insert_bases = [q_start, prev_q_end]
						insert_bases.sort()
						insert_string = spanner + ":" + '-'.join([str(i) for i in insert_bases])
						#print(insert_string)
						oot.write(target_tig + "\n")
						oot.write(insert_string + "\n")

					prev_is_terminal = False
					next

				prev_is_terminal=True
			else:
				prev_is_terminal = False
		else:
			prev_is_terminal = False

		#assign prevs
		prev = s
		prev_r_start = r_start
		prev_r_end   = r_end
		prev_q_start = q_start
		prev_q_end = q_end

		prev_r_len = r_len
		prev_q_len = q_len

	oot.write('done\n')
52
53
shell:
	"ln -s {input} {output}"
67
68
69
70
71
72
73
74
75
76
	shell:
		"guppy_basecaller --cpu_threads_per_caller {threads} -i {params.in_dir} -s {params.out_dir} " +
		"--flowcell {fc} --kit {k}" .format(fc=config['flowcell'], k = config['kit'])

rule basecall_final:
	#Collate called bases into a single fastq file
	input: expand('{{sample}}/0.basecall/raw_calls/{foo}/sequencing_summary.txt', foo = [os.path.basename(f).split('.')[0] for f in fast5_files])
	output: '{sample}/0.basecall/{sample}.fq'
	shell:
		"find {sample}/0.basecall/raw_calls/*/*.fastq | xargs cat > {output}"
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
	shell: "NanoPlot --fastq {input} -t {threads} --color green  -o " + "{s}/0.basecall/nanoplots".format(s = sample)

rule assemble_canu:
	#Run canu. This can be run either in distributed fashion on the cluster or in a single job. If run in distributed fashion (usedgrid is set to 'True')
	#then canu must be installed in the user's environment. In this case, the singualarity image is not used. Canu arguments are passed through from
	#the config within the 'canu_args' key. 'gridOptions' are also passed through from the config and instruct Canu on any additional parameters
	#needed to run on the cluster.
	input: rules.basecall_final.output
	output:
		'{sample}/1.assemble/assemble_{genome_size}/{sample}_{genome_size}.contigs.fasta',
		#'{sample}/1.assemble/assemble_{genome_size}/{sample}_{genome_size}.correctedReads.fasta.gz'
	threads: 1
	resources:
		mem=100,
		time=80
	singularity: singularity_image if config['usegrid'] != 'True' and config['usegrid'] != True else '' #switch between image and local environment for canu depending on whether cluster execution is required
	shell:
		"canu -p {sample}_{wildcards.genome_size} -d {sample}/1.assemble/assemble_{wildcards.genome_size}/ -nanopore-raw {input} " +
		config['canu_args'] +
		" stopOnReadQuality=false genomeSize={wildcards.genome_size} " +
		"useGrid={grid} gridOptions='{opts}'".format(
			grid = config['usegrid'],
			opts = config['grid_options']
		)

rule assemble_flye:
	input: rules.basecall_final.output
	output:
		"{sample}/1.assemble/assemble_{genome_size}/assembly.fasta"
121
122
shell:
	"flye -t {threads} --nano-raw {input} --meta -o {wildcards.sample}/1.assemble/assemble_{wildcards.genome_size}/ -g {wildcards.genome_size}"
143
144
145
146
147
148
149
150
151
152
shell:
	"""
    bedtools makewindows -g {input[1]} -w {params.window_width} | join - {input[1]} | tr ' ' '\t' | \
    cut -f1-4 | awk '{{if ($2 > {params.window_width} && $3 < $4 - {params.window_width} && $4 > {params.min_tig_size}) print $0}}' | \
    xargs -P 16 -l bash -c '
     htsbox samview {input[2]} $0:$1-$1 -p | \
     cut -f8,9 | awk "{{if (\$1 < $1 - ({params.window_width}/2) && \$2 > $1 + ({params.window_width}/2)) print \$0}}" | wc -l | \
     paste <(echo $0) <(echo $1) - ' | awk '{{if ($3 < 2) print $0}}
    ' > {output}
	"""
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
shell:
	"""
	if [ -s {input[0]} ] #if the input is nonempty...
	then
		cat {input[0]} | grep -v ^# | sort -k1,1 -k2,2g | join - <(cat {input[1]} | sort -k1,1 -k2,2g) | \
		awk '{{
			if ($1 == prev_tig){{
			 print($1,prev_coord,$2)
			 }}
			else{{
				if (prev_len > 0){{
					print(prev_tig,prev_coord,prev_len)
				}}
				print($1,"1",$2)
				}}
			prev_tig = $1
			prev_coord = $2
			prev_len = $4
			}}
			END {{ print(prev_tig,prev_coord,prev_len)
		}}' | sed "s/\(.*\) \(.*\)\ \(.*\)/\\1:\\2-\\3/g" |  xargs samtools faidx {input[2]} \
		| cut -f1 -d ':' | awk '(/^>/ && s[$0]++){{$0=$0\"_\"s[$0]}}1;' > {output[0]}

		cut -f1 {input[0]} > {sample}/{wildcards.sequence}.tigs.toremove
		grep -vf {sample}/{wildcards.sequence}.tigs.toremove {input[1]} | cut -f1 | xargs samtools faidx {input[2]} >> {output[0]}
		rm {sample}/{wildcards.sequence}.tigs.toremove
	else
		cp {input[2]} {output}
	fi
	"""
210
211
shell:
	"merge_wrapper.py {input} -ml 10000 -c 5 -hco 10; mv merged_out.fasta {output}"
SnakeMake From line 210 of master/Snakefile
216
217
shell:
	"ln -s ../../{input} {output}"
SnakeMake From line 216 of master/Snakefile
232
shell: "sort -k2,2gr {input[1]} | awk '{{if ($2 > {wildcards.contig_cutoff}) print $1}}' | xargs samtools faidx {input[0]} > {output}"
247
248
shell:
	"cat {input} | cut -f1 -d '_' | fold -w 120 | awk '(/^>/ && s[$0]++){{$0=$0\"_\"s[$0]}}1;' > {output}"
SnakeMake From line 247 of master/Snakefile
294
295
296
297
shell:
	"""
	racon -m 8 -x -6 -g -8 -w 500 -t {threads} {input} > {output}
	"""
307
308
309
310
311
shell:
	"""
	mkdir -p {output}
	cut -f1 {input[1]} | xargs -n 1 -I foo sh -c 'touch {output[0]}/foo; samtools faidx {input[0]} foo > {output[1]}/foo.fa'
	"""
325
326
327
328
shell:
	"""
	medaka consensus {input[1]} {output} --model r941_flip213 --threads {threads} --regions {wildcards.range}
	"""
342
343
344
345
shell:
	"""
	medaka stitch {input[0]} {input[1]} {output}
	"""
SnakeMake From line 342 of master/Snakefile
366
367
368
369
shell:
	"""
	cat {params.indir} | cut -f1 -d ':' > {output}
	"""
SnakeMake From line 366 of master/Snakefile
385
386
shell:
	"bwa index {input[0]}; bwa mem -t {threads} {input} | samtools sort --threads {threads} > {output}"
397
398
399
400
401
402
shell:
	"""
	mkdir {output}
	bedtools makewindows -w 100000 -g {input[1]} | awk '{{print $1,\":\", $2+ 1, \"-\", $3}}'  | tr -d ' ' |
	xargs -n 1 -I foo touch {output}/foo
	"""
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
	shell:
		"""
		# set env var $i to be the smallest read subset decimal (in increments of 0.1, with a couple very low values thrown in, too)
    	# sufficient to generate at least 40x coverage depth of the target sequence, or
		# 1 if 40x coverage cannot be achieved with the available read data

		for i in 0.01 0.05 $(seq 0.1 0.1 1);
		do
		   cov=$(samtools view {input[1]} -s $i -h {wildcards.range} | samtools depth - | cut -f3 | awk '{{sum+=$1}}END{{print sum/(NR+1)}}')
		   if [ $(echo $cov'>'{params.target_coverage}|bc) -eq 1 ]
		   then
		       break
		   fi
		done
		echo Using $i x of total coverage;

		samtools view -h -O BAM -s $i {input[1]} {wildcards.range} > {params.bam}
		samtools index {params.bam}
		samtools faidx {input[0]} $(echo {wildcards.range}| cut -f1 -d ':') | cut -f1 -d ':' > {params.fa}
		java -Xmx{params.java_mem}G -jar $(which pilon | sed 's/\/pilon//g')/../share/pilon*/pilon*.jar \
			--genome {params.fa} --targets {wildcards.range} \
			--unpaired {params.bam} --output {sample}_{wildcards.range} --outdir {params.subrun_folder} \
			--vcf --nostrays --mindepth 1
		bgzip {params.subrun_folder}/{sample}_{wildcards.range}.vcf
		tabix -fp vcf {params.subrun_folder}/{sample}_{wildcards.range}.vcf.gz
		"""
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
shell:
	"""
	#workaround!  Snakemake was causing the vcf's to appear newer than the indices, which tabix didn't like
	touch {sample}/2.polish/pilon/sub_runs/*/*.vcf.gz.tbi

	#get properly sorted intervals
	ls {sample}/2.polish/pilon/ranges/ | tr ':-' '\t' | sort -k1,1 -k2,2g | awk '{{print $1,":",$2,"-",$3}}' | tr -d ' ' > sorted_ranges.tmp

	#get header
	(zcat {sample}/2.polish/pilon/sub_runs/*/*.vcf.gz | head -1000 | grep ^#

	#get corrections within each range (omitting DUP records, which bcftools can't understand)
	cat sorted_ranges.tmp | xargs -n 1 -I foo sh -c "
	tabix {sample}/2.polish/pilon/sub_runs/foo/{sample}_foo.vcf.gz foo | grep -v '0/0' | grep -v DUP" |

	#sort by position
	sort -k1,1 -k2,2g) |

	#compress and store
	bgzip > {output} || true

	#index
	tabix -p vcf {output}
	"""
508
509
510
511
shell:
	"""
	bcftools consensus -f {input} -o {output}
	"""
518
519
520
521
shell:
	"""
	cut -f1 -d ':' {input} > {output}
	"""
SnakeMake From line 518 of master/Snakefile
535
536
537
538
539
540
541
shell:
	"""
	#mkdir {output}
	cat {input[1]} | awk '{{if ($2 > {params.min_size}) print $1}}' | xargs -n 1 -I foo sh -c "
		samtools faidx {input[0]} foo > {sample}/3.circularization/1.candidate_genomes/foo.fa
	"
	"""
552
553
554
555
556
557
shell:
	"""
	(samtools idxstats {input[0]} | grep {wildcards.tig} | awk '{{if ($2 > 50000) print $1, ":", $2-50000, "-", $2; else print $1, ":", 1, "-", $2 }}' | tr -d ' ';
	 samtools idxstats {input[0]} | grep {wildcards.tig} | awk '{{if ($2 > 50000) print $1, ":", 1, "-", 50000; else print $1, ":", 1, "-", $2 }}' | tr -d ' ') |
	xargs -I foo sh -c 'samtools view -h {input[0]} foo | samtools fastq - || true' | paste - - - - | sort | uniq | tr '\t' '\n' | bgzip > {output}
	"""
572
573
574
575
shell:
	"""
	flye -t {threads} --nano-raw {input} -o {params.directory} -g 1m
	"""
600
601
602
603
604
605
606
607
608
609
610
shell:
	"""
	nucmer -b 5000 {input[0]} {input[1]} -p {params.directory}/{params.prefix} #-t {threads} #reverted nucmer back down to 3, no more multithreading :(

	delta-filter -q {params.directory}/{params.prefix}.delta > {params.directory}/{params.prefix}.filt.delta

	show-coords -Tq {params.directory}/{params.prefix}.filt.delta | cut -f8,9 | sed '1,3d' | sort | \
	uniq -c | tr -s ' ' '\\t' | cut -f2-99 | grep -v ^1 | cut -f2,3 > {params.directory}/potential_circularizations.tsv || true

	show-coords -Tql {params.directory}/{params.prefix}.filt.delta | grep -f {params.directory}/potential_circularizations.tsv | cat > {output} || true
	"""
619
620
script:
	"scripts/spancircle.py"
SnakeMake From line 619 of master/Snakefile
634
635
636
637
638
639
640
641
642
643
644
645
646
run:
	span_out = open(input[0], 'r').readlines()
	cmd = ''
	if span_out == ['done\n'] or span_out[0].strip() == 'no circularizations': #no circularization occurred
		print('Nothing to do')
	else:
		trim = span_out[0].strip()
		trim_cmd = 'samtools faidx ' + input[1] + ' ' + trim + " > " + params.outfa + "\n"
		cmd += trim_cmd

		if len(span_out) == 3:
			extend = span_out[1].strip()
			extend_cmd = 'samtools faidx ' + input[3] + ' ' + extend + " | grep -v '>'" + " >> " + params.outfa + "\n"
671
672
script:
	"scripts/encircle.py"
SnakeMake From line 671 of master/Snakefile
686
687
688
689
690
691
692
693
694
695
696
run:
	span_out = open(input[0], 'r').readlines()
	cmd = ''
	if span_out == ['done\n']: #no circularization occurred
		print('No over-circularization')
	else:
		trim = span_out[0].strip()
		trim_cmd = 'samtools faidx ' + input[1] + ' ' + trim + " > " + params.outfa + "\n"
		cmd += trim_cmd

	open(output[0], 'w').write(cmd + '\n')
719
720
721
722
723
724
725
726
727
shell:
	"""
	find {sample}/3.circularization/3.circular_sequences/sh/ -type f | xargs cat | bash
	ls {sample}/3.circularization/3.circular_sequences/ | grep .fa$ | cut -f1-2 -d '_' > circs.tmp || true
	(cat {input[1]} | grep -vf circs.tmp |
	cut -f1 | xargs samtools faidx {input[0]}; ls {sample}/3.circularization/3.circular_sequences/* | grep .fa$ | xargs cat) |
	tr -d '\\n' | sed 's/\\(>[contigscaffold_]*[0-9]*\\)/\\n\\1\\n/g' | fold -w 120 | cut -f1 -d ':' | grep -v '^$' > {output} || true
	rm circs.tmp
	"""
743
shell: "cp {input} {output}"
SnakeMake From line 743 of master/Snakefile
757
758
shell:
	"minimap2 -t {threads} -x map-ont {input} > {output}"
772
773
shell:
	"minimap2 -t {threads} -ax map-ont {input} | samtools sort --threads {threads} > {output}"
782
783
shell:
	"samtools index {input}"
790
shell: "samtools faidx {input}"
ShowHide 25 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/elimoss/lathe
Name: lathe
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...