Structural variant filtering and anlaysis of Nanopore human WGS data.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Structural variant filtering and analysis of Nanopore human WGS data.
Installation instructions
Download the latest code from GitHub:
git clone https://github.com/mike-molnar/nanopore-SV-analysis.git
Dependencies
There are many dependencies so it is best to create a new Conda environment using the provided YAML file:
conda env create -n nanopore-SV-analysis -f nanopore-SV-analysis.yml
conda activate nanopore-SV-analysis
Below is a list of the Conda dependencies:
-
BCFtools
-
bedtools
-
karyoploteR
-
cuteSV
-
Longshot
-
NanoFilt v2.8.0
-
NanoPlot v1.20.0
-
pybedtools
-
PySAM
-
PyVCF
-
seaborn v0.10.0
-
Snakemake
-
Sniffles2
-
SURVIVOR
-
SVIM
-
WhatsHap
-
Winnowmap2
Reference genome
You will need to download the reference genome manually before running the workflow. I have not included the download as part of the workflow because it is designed to run on a cluster that may not have internet access. You can use a local copy of GRCh38 if you have one, but the reference can only contain the autosomes and sex chromosomes, and the chromosomes must be named
chr1, chr2, ...,chrX, chrY
. To download the reference genome and index it, change to the reference directory of the workflow and run the
download_reference.sh
script:
cd /path/to/nanopore-SV-analysis/reference
chmod u+x download_reference.sh
./download_reference.sh
To run on a grid engine
Copy the
Snakefile
and
config.yaml
files to the directory that you want to run the workflow. Modify the information in
config.yaml
for your sample names and FASTQ locations. There are a few different grid engines, so the exact format to run the workflow may be different for your particular grid engine:
snakemake --jobs 500 --rerun-incomplete --keep-going --latency-wait 60 --cluster "qsub -cwd -V -o snakemake.output.log -e snakemake.error.log -q queue_name -P project_name -pe smp {threads} -l h_vmem={params.memory_per_thread} -l h_rt={params.run_time} -b y"
You will have to replace
queue_name
and
project_name
with the necessary values to run on your cluster.
Code Snippets
17 18 19 | shell: "{conda_dir}/NanoPlot {params.scale} -t {threads} -p {wildcards.sample} \ --title {wildcards.sample} --fastq {input} -o {wildcards.sample}/analysis/nanoplot &> {log}" |
37 38 | shell: "{conda_dir}/samtools depth -r {wildcards.chromosomes} {input.bam} > {output} 2> {log}" |
57 58 59 | shell: "{conda_dir}/python {scripts_dir}/coverage_by_window.py -i {input.depth} \ -o {output} -w {params.window} -c {params.coverage} &> {log}" |
74 75 | shell: "cat {input} > {output} 2> {log}" |
94 95 96 | shell: "{conda_dir}/python {scripts_dir}/coverage_by_window.py -i {input.depth} \ -o {output} -w {params.window} -c {params.coverage} &> {log}" |
111 112 | shell: "cat {input} > {output} 2> {log}" |
131 132 133 | shell: "{conda_dir}/python {scripts_dir}/coverage_by_window.py -i {input.depth} \ -o {output} -w {params.window} -c {params.coverage} &> {log}" |
148 149 | shell: "cat {input} > {output} 2> {log}" |
168 169 170 | shell: "{conda_dir}/python {scripts_dir}/coverage_by_window.py -i {input.depth} \ -o {output} -w {params.window} -c {params.coverage} &> {log}" |
185 186 | shell: "cat {input} > {output} 2> {log}" |
205 206 207 | shell: "{conda_dir}/python {scripts_dir}/coverage_by_window.py -i {input.depth} \ -o {output} -w {params.window} -c {params.coverage} &> {log}" |
222 223 | shell: "cat {input} > {output} 2> {log}" |
240 241 242 | shell: "{conda_dir}/Rscript {scripts_dir}/karyoploter_coverage.R {input.normal} {input.sample} \ {output} {params.genome} {wildcards.normal} {wildcards.sample} &> {log}" |
259 260 261 | shell: "{conda_dir}/Rscript {scripts_dir}/karyoploter_coverage_small_chr.R {input.normal} {input.sample} \ {output} {params.genome} {wildcards.normal} {wildcards.sample} &> {log}" |
278 279 280 | shell: "{conda_dir}/Rscript {scripts_dir}/karyoploter_coverage_large_chr.R {input.normal} {input.sample} \ {output} {params.genome} {wildcards.normal} {wildcards.sample} &> {log}" |
299 300 301 302 | shell: "{conda_dir}/Rscript {scripts_dir}/karyoploter_coverage_chr.R {input.normal_cov} {input.sample_cov} \ {input.normal_baf} {input.sample_baf} {wildcards.normal} {wildcards.sample} {params.genome} \ {wildcards.chromosomes} {output} &> {log}" |
331 332 333 334 335 336 | shell: "{conda_dir}/python {scripts_dir}/filter_SVs.py -ins {input.INS} -del {input.DEL} -dup {input.DUP} \ -inv {input.INV} -trans {input.trans} -depth {input.depth} -split {input.split_alignments} \ -low_map {input.low_map} -gaps {genome_gaps} -ins_out {output.INS} -del_out \ {output.DEL} -dup_out {output.DUP} -inv_out {output.INV} -trans_out {output.trans} -cnv_out \ {output.CNVs} -cov {params.coverage} &> {log}" |
17 18 19 | shell: "{conda_dir}/NanoFilt --logfile {log} -q {params.quality} -l {params.min_length} \ {input} | {conda_dir}/bgzip -c > {output} 2>> {log}" |
35 36 | shell: "{conda_dir}/bgzip -r {input} 2> {log}" |
55 56 57 58 | shell: "{conda_dir}/winnowmap {params.preset_options} {params.include_MD_tag} \ -t {threads} -W {high_frequency_kmers} {reference} {input.fastq} 2> {log} | \ {conda_dir}/samtools sort -o {output} &>> {log}" |
74 75 | shell: "{conda_dir}/samtools index {input} &> {log}" |
93 94 95 | shell: "{conda_dir}/longshot {params.thresholds} -r {wildcards.regions} --bam {input.bam} \ --ref {reference} --out {output} &> {log}" |
111 112 | shell: "{conda_dir}/bgzip --threads {threads} -c {input} > {output} 2> {log}" |
129 130 | shell: "{conda_dir}/tabix -p {params.file_type} {input} &> {log}" |
149 150 151 | shell: "{conda_dir}/bcftools concat {params.concat} {input.zip} 2> {log} | \ {conda_dir}/bcftools sort {params.sort} -o {output} &>> {log}" |
168 169 | shell: "{conda_dir}/tabix -p {params.file_type} {input} &> {log}" |
190 191 192 | shell: "{conda_dir}/whatshap haplotag {params.RG} {params.noSNVs} --regions {wildcards.regions} \ --reference {reference} {input.vcf} {input.bam} 2> {log} | {conda_dir}/samtools sort -o {output} &>> {log}" |
208 209 | shell: "{conda_dir}/samtools merge {output} {input} &> {log}" |
225 226 | shell: "{conda_dir}/samtools index {input} &> {log}" |
245 246 247 | shell: "zcat {input.zip} 2> {log} | {conda_dir}/bcftools view -f {params.view_param} \ 2>> {log} | {conda_dir}/bcftools query -f {params.query_param} > {output} 2>> {log}" |
263 264 | shell: "{conda_dir}/python {scripts_dir}/calculate_b_allele_frequency.py -i {input} -o {output} 2> {log}" |
282 283 284 | shell: "{conda_dir}/samtools view -h {input.bam} 2> {log} | \ awk '$5<{params.quality} {{print $0}}' > {output} 2>> {log}" |
300 301 302 | shell: "{conda_dir}/samtools view -S -b -h {input} 2> {log} | \ {conda_dir}/samtools depth - > {output} 2>> {log}" |
322 323 324 325 | shell: "{conda_dir}/SURVIVOR bincov {input} {params.distance} {params.min_coverage} 2> {log} | \ {conda_dir}/bedtools slop -i stdin -g {chromosome_sizes} -b {params.slop} 2>> {log} | \ {conda_dir}/bedtools merge -i stdin -d {params.merge_length} > {output} 2>> {log}" |
21 22 23 24 25 | shell: "{conda_dir}/cuteSV --diff_ratio_merging_DEL {params.diff_ratio_merging_DEL} \ --min_support {params.min_support} --max_split_parts {params.max_splits} \ --threads {threads} --max_cluster_bias_DEL {params.max_cluster_bias_DEL} {input.bam} \ {reference} {output} {params.working_dir} &> {log}" |
41 42 | shell: "{conda_dir}/python {scripts_dir}/vcf2bedpe.py -i {input} -o {output} &> {log}" |
62 63 64 | shell: "{conda_dir}/sniffles --input {input.bam} --snf {output.snf} --vcf {output.vcf} --non-germline \ --minsupport {params.min_reads} --max-splits-base {params.max_splits} -t {threads} &> {log}" |
80 81 | shell: "{conda_dir}/python {scripts_dir}/vcf2bedpe.py -i {input} -o {output} &> {log}" |
99 100 | shell: "{conda_dir}/svim alignment {params.working_dir} {input.bam} {reference} &> {log}" |
116 117 | shell: "{conda_dir}/python {scripts_dir}/vcf2bedpe.py -i {input} -o {output} &> {log}" |
136 137 138 | shell: "awk {params.awk} {input} 2> {log} | awk {params.fix_columns} 2>> {log} | \ awk {params.remove_duplicates} > {output} 2>> {log}" |
165 166 167 168 | shell: "awk {params.awk_INS} {input.cuteSV} {input.sniffles} {input.svim} 2> {log} | \ sort {params.sort} 2>> {log} | {conda_dir}/bedtools merge -i stdin -d {params.merge_length} \ {params.columns} {params.output_type} > {output} 2>> {log}" |
191 192 193 194 | shell: "awk {params.awk_DEL} {input.cuteSV} {input.sniffles} {input.svim} 2> {log} | \ sort {params.sort} 2>> {log} | {conda_dir}/bedtools merge -i stdin -d {params.merge_length} \ {params.columns} {params.output_type} > {output} 2>> {log}" |
218 219 220 221 222 | shell: "grep {params.grep_DUP} {input.cuteSV} {input.sniffles} {input.svim} 2> {log} | \ cut {params.select_columns} 2>> {log} | sort {params.sort} 2>> {log} | \ {conda_dir}/bedtools merge -i stdin -d {params.merge_length} {params.columns} \ {params.output_type} > {output} 2>> {log}" |
246 247 248 249 250 | shell: "grep {params.grep_INV} {input.cuteSV} {input.sniffles} {input.svim} 2> {log} | \ cut {params.select_columns} 2>> {log} | sort {params.sort} 2>> {log} | \ {conda_dir}/bedtools merge -i stdin -d {params.merge_length} {params.columns} \ {params.output_type} > {output} 2>> {log}" |
272 273 274 275 | shell: "awk {params.awk_BND_TRA} {input.cuteSV} {input.sniffles} {input.svim} 2> {log} | \ awk {params.fix_columns} 2>> {log} | cut {params.select_columns} 2>> {log} | \ sort {params.sort} > {output} 2>> {log}" |
292 293 294 | shell: "{conda_dir}/python {scripts_dir}/merge_split_alignments.py \ --read-to-reference-bam {input.bam} --output-bedpe {output} &> {log}" |
314 315 316 | shell: "sort {params.sort} {input} 2> {log} | {conda_dir}/bedtools merge -i stdin \ -d {params.merge_length} -c {params.columns} -o {params.output_type} > {output} 2>> {log}" |
Support
- Future updates