Nanopore long-read DNA sequencing data analysis pipeline

public public 1yr ago 0 bookmarks

Description

This snakemake pipeline is for Nanopore long-read DNA sequencing data analysis. The pipeline may be run on an HPC or in a local environment.

Major steps in this workflow include:

  1. Modified basing calling using Megalodon

  2. Structural variant calling using Nanopore pipeline-structural-variation

  3. Haplotype-aware variant calling using PEPPER-Margin-DeepVariant

Software Requirements

User's guide

I. Input requirements

  • Edited config/config.yaml

  • Demultiplexed Nanopore fast5 data

  • Reference genome file

II. Editing the config.yaml

Basic parameters:

  • raw: Path to the directory where demultiplexed fast5 data is stored. The name of sub-directory for each sample should be in this format: {sampleID}_fast5

  • out: Output directory, default without input: "output" in working directory

  • reference: Path to primary reference genome fasta file

  • pepper: Path to the directory where PEPPER-Margin-DeepVariant singularity is downloaded

  • bind: Singularity bind mounting

  • snpeff: Path to SNPEFF installed location

Parameters for modified base calling:

  • outputs: Desired outputs

  • base_conv: modified base conversion

  • motif: Restrict modified base results to the specified motifs

  • thresh: Hard threshold for modified base aggregation (probability of modified/canonical base), default 0.75

  • config: Guppy model config file

Parameters for SV calling:

  • min_sv_length: Minimum SV length

  • max_sv_length: Maximum SV length

  • min_read_length: Minimum read length

  • min_read_mapping_quality: Minimum mapping quality

  • min_read_support: Minimum read support required to call a SV (auto for auto-detect)

  • snpsift: Run snpsift filtering or not

  • snpsift_filter: SnpSift filtering parameters if the input of the above parameter is "yes"

  • target_bed: (Optional) BED file containing targeted regions where SVs will only be called

Parameters for Haplotype-aware variant calling:

  • snpsift2: Run snpsift filtering or not

  • snpsift_filter2: SnpSift filtering parameters if the input of the above parameter is "yes"

Optional parameters to run the full analysis on HPV genomes:

(This function was originally designed for the HPV project, however, it can be applied to any secondary genome analysis of any species with references.)

  • hpv: Run the analysis on HPV genomes or not

  • hpv_ref: Path to HPV reference genome fasta file

  • snpeff_hpv: Name of the pre-build HPV snpEff annotation

  • snpeff_config: Path to the HPV snpEff config file

III. To run

  • Clone the repository to your working directory

    git clone https://github.com/NCI-CGR/Nanopore_DNA-seq.git
    
  • Install required software

  • Edit and save config/config.yaml

  • To run on an HPC using slurm job scheduler like NIH Biowulf (Pascal GPU or higher architecture required):

    Edit config/cluster_config.yaml according to your HPC information and adjust the computing resources as needed. Run sbatch.sh to initiate pipeline running.

    bash sbatch.sh
    
  • To run in a local environment:

    snakemake -p --cores 16 --keep-going --rerun-incomplete --jobs 300 --latency-wait 120 all
    
  • Look in log directory for logs for each rule

  • To view the snakemkae rule graph:

    snakemake --rulegraph | dot -T png > nanopore.png
    

    dag

IV Output directory structure

{output directory}
├── mbc # Modified base calling on primary (eg human) genome ├── basic # using the default model  └── {sampleID} └── model_based # using a specific model └── {sampleID}
├── mbc_hpv # Modified base calling on secondary (eg HPV) genome ├── basic # using the default model  └── {sampleID} └── model_based # using a specific model └── {sampleID}
├── sv # SV calling on primary genome └── {sampleID}
├── sv_hpv # SV calling on secondary genome └── {sampleID}
├── vc # Haplotype-aware variant calling on primary genome └── {smapleID}
└── vc_hpv # Haplotype-aware variant calling on secondary genome
 └── {smapleID}

Code Snippets

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
shell:
     """
     megalodon {input[0]} \
               --guppy-server-path ${{MEGALODON_GUPPY_PATH}} \
               --outputs {params.outputs} \
               --output-directory {out}mbc/basic/{wildcards.sample} \
               --overwrite \
               --reference {input[1]} \
               --mod-map-emulate-bisulfite \
               --mod-map-base-conv {params.base_conv} \
               --mod-map-base-conv Z C \
               --mod-motif m CG 0 \
               --mod-binary-threshold {params.thresh} \
               --devices ${{CUDA_VISIBLE_DEVICES}} \
               --processes ${{SLURM_CPUS_PER_TASK}} 2>log/{wildcards.sample}_basic.err
     touch {output}
     """
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
shell:
      """
      megalodon {input[0]} \
                --guppy-server-path ${{MEGALODON_GUPPY_PATH}} \
                --guppy-config {params.config} \
                --outputs {params.outputs} \
                --overwrite \
                --output-directory {out}mbc/model_based/{wildcards.sample} \
                --reference {input[1]} \
                --mod-map-emulate-bisulfite \
                --mod-map-base-conv {params.base_conv} \
                --mod-map-base-conv Z C \
                --mod-motif m CG 0 \
                --mod-binary-threshold {params.thresh} \
                --devices ${{CUDA_VISIBLE_DEVICES}} \
                --processes ${{SLURM_CPUS_PER_TASK}} 2>log/{wildcards.sample}_model.err
      touch {output}
      """
72
73
74
75
76
77
78
shell:
      """
      samtools sort {out}mbc/basic/{wildcards.sample}/mappings.bam > {out}mbc/basic/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_basic_bam_sort1.err
      samtools index {out}mbc/basic/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_basic_bam_index1.err
      samtools sort {out}mbc/basic/{wildcards.sample}/mod_mappings.5mC.bam > {out}mbc/basic/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_basic_bam_sort3.err
      samtools index {out}mbc/basic/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_basic_bam_index3.err
      """
88
89
90
91
92
93
94
shell:
      """
      samtools sort {out}mbc/model_based/{wildcards.sample}/mappings.bam > {out}mbc/model_based/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_model_based_bam_sort1.err
      samtools index {out}mbc/model_based/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_model_based_bam_index1.err
      samtools sort {out}mbc/model_based/{wildcards.sample}/mod_mappings.5mC.bam > {out}mbc/model_based/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_model_based_bam_sort3.err
      samtools index {out}mbc/model_based/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_model_based_bam_index3.err
      """
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
shell:
     """
     megalodon {input[0]} \
               --guppy-server-path ${{MEGALODON_GUPPY_PATH}} \
               --outputs {params.outputs} \
               --output-directory {out}mbc_hpv/basic/{wildcards.sample} \
               --overwrite \
               --reference {input[1]} \
               --mod-map-emulate-bisulfite \
               --mod-map-base-conv {params.base_conv} \
               --mod-map-base-conv Z C \
               --mod-motif m CG 0 \
               --mod-binary-threshold {params.thresh} \
               --devices ${{CUDA_VISIBLE_DEVICES}} \
               --processes ${{SLURM_CPUS_PER_TASK}} 2>log/{wildcards.sample}_basic_hpv.err
     touch {output}
     """
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
shell:
      """
      megalodon {input[0]} \
                --guppy-server-path ${{MEGALODON_GUPPY_PATH}} \
                --guppy-config {params.config} \
                --outputs {params.outputs} \
                --overwrite \
                --output-directory {out}mbc_hpv/model_based/{wildcards.sample} \
                --reference {input[1]} \
                --mod-map-emulate-bisulfite \
                --mod-map-base-conv {params.base_conv} \
                --mod-map-base-conv Z C \
                --mod-motif m CG 0 \
                --mod-binary-threshold {params.thresh} \
                --devices ${{CUDA_VISIBLE_DEVICES}} \
                --processes ${{SLURM_CPUS_PER_TASK}} 2>log/{wildcards.sample}_model_hpv.err
      touch {output}
      """
72
73
74
75
76
77
78
shell:
      """
      samtools sort {out}mbc_hpv/basic/{wildcards.sample}/mappings.bam > {out}mbc_hpv/basic/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_basic_bam_sort1_hpv.err
      samtools index {out}mbc_hpv/basic/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_basic_bam_index1_hpv.err
      samtools sort {out}mbc_hpv/basic/{wildcards.sample}/mod_mappings.5mC.bam > {out}mbc_hpv/basic/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_basic_bam_sort3_hpv.err
      samtools index {out}mbc_hpv/basic/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_basic_bam_index3_hpv.err
      """
88
89
90
91
92
93
94
shell:
      """
      samtools sort {out}mbc_hpv/model_based/{wildcards.sample}/mappings.bam > {out}mbc_hpv/model_based/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_model_based_bam_sort1_hpv.err
      samtools index {out}mbc_hpv/model_based/{wildcards.sample}/mappings_sorted.bam 2>log/{wildcards.sample}_model_based_bam_index1_hpv.err
      samtools sort {out}mbc_hpv/model_based/{wildcards.sample}/mod_mappings.5mC.bam > {out}mbc_hpv/model_based/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_model_based_bam_sort3_hpv.err
      samtools index {out}mbc_hpv/model_based/{wildcards.sample}/mod_mappings.5mC_sorted.bam 2>log/{wildcards.sample}_model_based_bam_index3_hpv.err
      """
24
25
26
27
shell:
      """
      lra index -ONT {input} 2>log/index_lra.err
      """
40
41
42
43
shell:
      """
      catfishq -r {out}mbc/basic/{wildcards.sample}/ | seqtk seq -A - | lra align -ONT -t 16 {input.REF} - -p s | samtools addreplacerg -r \"@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}\" - | samtools sort -@ 16 -T {wildcards.sample} -O BAM -o {output.BAM} - && samtools index -@ 16 {output.BAM} 2>log/{wildcards.sample}_lra_map.err
      """
60
61
62
63
64
65
66
67
68
69
70
71
shell:
      """
      cuteSV -t 16 \
             --min_size {params.min_size} \
             --max_size  {params.max_size} \
             -S {wildcards.sample} \
             --retain_work_dir \
             --report_readid \
             --min_support {params.min_read_support} \
             --genotype {input.BAM} {input.REF} {output.VCF} \
             {out}sv/{wildcards.sample}/sv_calls/ 2>log/{wildcards.sample}_sv_call.err
      """
82
83
84
85
86
shell:
      """
      mkdir -p {output.DIR}
      mosdepth -x -t 16 -n -b {params.BED} {output.DIR}/{wildcards.sample} {input.BAM} 2>log/{wildcards.sample}_depth.err
      """
111
112
113
114
shell:
      """
      vcfsort {input.VCF} > {output.VCF} 2>log/{wildcards.sample}_sort_vcf.err
      """
123
124
125
126
shell:
      """
      cat {input.VCF} | bgziptabix {output.VCF} 2>log/{wildcards.sample}_index_vcf.err
      """
135
136
137
138
139
140
141
142
143
144
shell:
      """
      NanoPlot -t 16 \
               --bam {input.BAM} \
               --raw -o {output.DIR} \
               -p {wildcards.sample}_ \
               --N50 \
               --title {wildcards.sample} \
               --downsample 100000 2>log/{wildcards.sample}_qc.err
      """
152
153
154
155
156
shell:
      """
      cd {out}sv/{wildcards.sample}/sv_calls/ 
      java -Xmx16g -jar $SNPEFF_JAR -v hg38 {input.VCF} > {output.VCF} 2>{run}log/{wildcards.sample}_snpeff.err
      """
166
167
168
169
170
run:
      if config["snpsift"] == "yes":
         shell("""cat {input.VCF} | java -jar $SNPSIFT_JAR filter "{params.filter}" > {output.VCF} 2>log/{wildcards.sample}_snpsift.err""")
      else:
           shell("""touch {output.VCF}""")
28
29
30
31
shell:
      """
      lra index -ONT {input} 2>log/index_lra_hpv.err
      """
44
45
46
47
shell:
      """
      catfishq -r {out}mbc_hpv/basic/{wildcards.sample}/ | seqtk seq -A - | lra align -ONT -t 16 {input.REF} - -p s | samtools addreplacerg -r \"@RG\tID:{wildcards.sample}\tSM:{wildcards.sample}\" - | samtools sort -@ 16 -T {wildcards.sample} -O BAM -o {output.BAM} - && samtools index -@ 16 {output.BAM} 2>log/{wildcards.sample}_lra_map_hpv.err
      """
64
65
66
67
68
69
70
71
72
73
74
75
shell:
      """
      cuteSV -t 16 \
             --min_size {params.min_size} \
             --max_size  {params.max_size} \
             -S {wildcards.sample} \
             --retain_work_dir \
             --report_readid \
             --min_support {params.min_read_support} \
             --genotype {input.BAM} {input.REF} {output.VCF} \
             {out}sv_hpv/{wildcards.sample}/sv_calls/ 2>log/{wildcards.sample}_sv_call_hpv.err
      """
86
87
88
89
90
shell:
      """
      mkdir -p {output.DIR}
      mosdepth -x -t 16 -n -b {params.BED} {output.DIR}/{wildcards.sample} {input.BAM} 2>log/{wildcards.sample}_depth_hpv.err
      """
115
116
117
118
shell:
      """
      vcfsort {input.VCF} > {output.VCF} 2>log/{wildcards.sample}_sort_vcf_hpv.err
      """
127
128
129
130
shell:
      """
      cat {input.VCF} | bgziptabix {output.VCF} 2>log/{wildcards.sample}_index_vcf_hpv.err
      """
138
139
140
141
142
143
144
145
146
147
shell:
      """
      NanoPlot -t 16 \
               --bam {input.BAM} \
               --raw -o {output.DIR} \
               -p {wildcards.sample}_ \
               --N50 \
               --title {wildcards.sample} \
               --downsample 100000 2>log/{wildcards.sample}_qc_hpv.err
      """
155
156
157
158
159
shell:
      """
      cd {out}sv_hpv/{wildcards.sample}/sv_calls/
      java -Xmx16g -jar {snpeff} -c {snpeff_config} -v {snpeff_hpv} {input.VCF} > {output.VCF} 2>{run}log/{wildcards.sample}_snpeff_hpv.err
      """
12
13
14
15
shell:
      """
      minimap2 -ax map-ont {input[1]} {out}mbc/basic/{wildcards.sample}/basecalls.fastq > {output} 2> log/{wildcards.sample}_minimap.err
      """
24
25
26
27
28
shell:
      """
      samtools view -b {input} | samtools sort > {output[0]} 2>log/{wildcards.sample}_minimap_sort.err
      samtools index {output} 2>log/{wildcards.sample}_minimap_index.err
      """
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
shell:
     """
     {bind}
     singularity exec \
     {pepper}/pepper_deepvariant_r0.4.sif \
     run_pepper_margin_deepvariant call_variant \
               -b {input[0]} \
               -f {input[1]} \
               -p {wildcards.sample}_PEPPER_Margin_DeepVariant \
               -o {out}vc/{wildcards.sample} \
               -t 30 \
               --ont \
               --phased_output 2>log/{wildcards.sample}_pmd.err
     touch {output[1]}
     """
64
65
66
67
68
shell:
      """
      cd {out}vc/{wildcards.sample}/
      java -Xmx16g -jar $SNPEFF_JAR -v hg38 {input} > {output} 2>{run}log/{wildcards.sample}_snpeff2.err
      """
78
79
80
81
82
run:
      if config["snpsift2"] == "yes":
         shell("""cat {input} | java -jar $SNPSIFT_JAR filter "{params.filter}" > {output} 2>log/{wildcards.sample}_snpsift2.err""")
      else:
           shell("""touch {output}""")
90
91
92
93
94
shell:
      """
      cd {out}vc/{wildcards.sample}/filter/
      java -Xmx16g -jar $SNPEFF_JAR -v hg38 {input} > {output} 2>{run}log/{wildcards.sample}_snpeff3.err
      """
16
17
18
19
shell:
      """
      minimap2 -ax map-ont {input[1]} {out}mbc_hpv/basic/{wildcards.sample}/basecalls.fastq > {output} 2> log/{wildcards.sample}_minimap_hpv.err
      """
28
29
30
31
32
shell:
      """
      samtools view -b {input} | samtools sort > {output[0]} 2>log/{wildcards.sample}_minimap_sort_hpv.err
      samtools index {output} 2>log/{wildcards.sample}_minimap_index_hpv.err
      """
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
shell:
     """
     {bind}
     singularity exec \
     {pepper}/pepper_deepvariant_r0.4.sif \
     run_pepper_margin_deepvariant call_variant \
               -b {input[0]} \
               -f {input[1]} \
               -p {wildcards.sample}_PEPPER_Margin_DeepVariant \
               -o {out}vc_hpv/{wildcards.sample} \
               -t 16 \
               --ont \
               --phased_output 2>log/{wildcards.sample}_pmd_hpv.err
     touch {output}
     """
65
66
67
68
69
70
shell:
      """
      cd {out}vc_hpv/{wildcards.sample}/

      java -Xmx16g -jar {snpeff} -c {snpeff_config} -v {snpeff_hpv} {params.vcf} > {output} 2>{run}log/{wildcards.sample}_snpeff2_hpv.err
      """
ShowHide 28 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NCI-CGR/Nanopore_DNA-seq
Name: nanopore_dna-seq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...