Snakemake workflow for the vg toolkit.

public public 1yr ago 0 bookmarks

This repository contains a Snakemake workflow for the vg toolkit . In the current stage it's simply a place to host a usable Snakefile and config.yaml files. With it, you can build a variation graph from a reference FASTA file and variants in a VCF file, map reads to the graph and genotype those variants.

For now, this workflow has been used locally or on a large AWS/Kubernetes instance. We will soon test it on a HPC system and update the examples below.

This workflow was tested on vg 1.21.0 .

Input files

To use the workflow you should first copy the Snakefile and config.yaml files in the working directory.

Graph construction

For construction, two files are necessary in the working directory:

  1. a reference file named {GENOME}.fa

  2. a VCF file called {VARIANTS}.vcf.gz (sorted and indexed)

The {GENOME} and {VARIANTS} labels should match the graph label in the configuration ( config.yaml file or in the command line). The corresponding graph label is {GENOME}-{VARIANTS} . These labels shouldn't contain the character - .

Mapping and genotyping samples

The reads for each sample should be in a folder named with the sample name. In the {sample} folder, FASTQ files should be named {sample}_1.fastq.gz and {sample}_2.fastq.gz . The sample names shouldn't contain the character - .

Examples

The two rules used most in practice are:

  1. construct_all to build the graph and all the necessary indexes for mapping and variant calling.

  2. genotype to map reads and genotype samples.

See an example on a small dataset in the testdata folder .

Construct a variation graph

To build a graph from a hg38.fa and a hgsvc.vcf.gz , you would use graph: "hg38-hgsvc" in config.yaml and:

snakemake --configfile config.yaml --resources mem_mb=200000 --cores 16 construct_all

Or specify the graph label in the command line:

snakemake --configfile config.yaml --config graph="hg38-hgsvc" --resources mem_mb=200000 --cores 16 construct_all

Note the --resources mem_mb=200000 --cores 16 which specifies the maximum amount memory or cores to use.

Map reads and genotype variants

Let's assume the graph indexes constructed (see above) and you want to analyze reads for 2 samples in SAMP1/SAMP1_*.fastq.gz and SAMP2/SAMP2_*.fastq.gz .

snakemake --configfile config.yaml --config samples="SAMP1 SAMP2" --resources mem_mb=100000 --cores 16 genotype

(Assuming that the graph label, e.g. hg38-hgsvc , is defined in the config.yaml file)

Overview of the workflow

We can easily visualize the workflow with Snakemake . Her we use the small test data with two chromosomes, one sample, and reads split in 2 chunks (more info in the testdata folder ).

Rule graph

DAG of jobs

File graph

Overview of the workflow when using the Giraffe mapper

The newer and faster mapper, giraffe, uses different graph indexes.

Rule graph

DAG of jobs

File graph

FAQ

How do I specify the resources?

When running locally or on an instance, we use --resources mem_mb=200000 --cores 16 to specify the maximum memory and cores to use.

To run on an HPC, you would need to add the HPC specific configuration (more details soon).

The current resources listed in config.yaml were tuned for a human genome.

Where can I run this?

You can run this on any resource supported by Snakemake . For example, locally and on HPC.

Can I use the workflow for non-human organisms?

Yes, juts remember to update the chromosome names in the config.yaml .

Why are there AWS/S3 commands in the workflow?

We sometimes run the workflow on an AWS instance and use these command to save the outputs in a S3 bucket. If you don't want to do this, use the default config value s3save: False .

Code Snippets

59
60
shell:
    "rm -rf {params.ff}"
76
77
78
79
run:
    shell('(vg construct -t {threads} -r {input.ref} -v {input.vcf} -a -f -S -R {wildcards.chr} -C | vg ids --sort - > {output}) 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
89
90
91
92
93
run:
    shell('vg ids --join --mapping {output.mapping} {input} 2> {log}')
    if config['s3save']:
        shell("for ff in {input}; do aws s3 cp --quiet $ff {SROOT}/; done")
        shell("aws s3 cp --quiet {output.mapping} {SROOT}/")
106
107
108
109
run:
    shell('vg index -L -t {threads} -x {output} {input.vg} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
SnakeMake From line 106 of master/Snakefile
120
121
122
123
run:
    shell('vg snarls -t {threads} {input} > {output} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
SnakeMake From line 120 of master/Snakefile
134
135
136
137
run:
    shell('vg prune -t {threads} -M 32 --restore-paths {input.vg} > {output} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")        
SnakeMake From line 134 of master/Snakefile
152
153
154
155
156
157
158
run:
    shell('mkdir -p {params.tmp_dir}')
    shell('vg index --temp-dir {params.tmp_dir} -p -t {threads} -g {output.gcsa} {input} 2> {log}')
    shell('rm -r {params.tmp_dir}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output.gcsa} {SROOT}/")
        shell("aws s3 cp --quiet {output.gcsalcp} {SROOT}/")
SnakeMake From line 152 of master/Snakefile
171
172
173
174
175
run:
    shell('vg gbwt -n {wildcards.n} -g {output.gg} -o {output.gbwt} -x {input} -P 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output.gbwt} {SROOT}/")
        shell("aws s3 cp --quiet {output.gg} {SROOT}/")
SnakeMake From line 171 of master/Snakefile
187
188
189
190
run:
    shell('vg minimizer -k {wildcards.k} -w {wildcards.w} -t {threads} -i {output} -g {input.gbwt} {input.xg} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
SnakeMake From line 187 of master/Snakefile
200
201
202
203
run:
    shell('vg snarls -t {threads} --include-trivial {input} > {output} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
SnakeMake From line 200 of master/Snakefile
215
216
217
218
run:
    shell('vg index -t {threads} -j {output} -x {input.xg} -s {input.snarls} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
SnakeMake From line 215 of master/Snakefile
233
234
235
236
run:
    NLINES=config['nb_split_reads'] * 4
    shell('mkdir -p {output.dir}')
    shell("gzip -cd {input} | split -d -l {NLINES} --filter='pigz -p {threads} > ${{FILE}}.fastq.gz' - \"{output.dir}/{wildcards.sample}_1.part\"")
SnakeMake From line 233 of master/Snakefile
243
244
245
246
run:
    NLINES=config['nb_split_reads'] * 4
    shell('mkdir -p {output.dir}')
    shell("gzip -cd {input} | split -d -l {NLINES} --filter='pigz -p {threads} > ${{FILE}}.fastq.gz' - \"{output.dir}/{wildcards.sample}_2.part\"")
SnakeMake From line 243 of master/Snakefile
273
274
275
276
run:
    shell("vg map -t {threads} -x {input.xg} -g {input.gcsa} -f {input.r1} -f {input.r2} > {output} 2> {log}")
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{output}")
SnakeMake From line 273 of master/Snakefile
292
293
294
295
run:
    shell("vg mpmap -S -t {threads} -x {input.xg} -g {input.gcsa} -f {input.r1} -f {input.r2} > {output} 2> {log}")
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{output}")
SnakeMake From line 292 of master/Snakefile
312
313
314
315
run:
    shell("vg gaffe -p -t {threads} -m {input.min} -d {input.dist} --gbwt-name {input.gbwt} -x {input.xg} -N {wildcards.sample} -f {input.r1} -f {input.r2} > {output} 2> {log}")
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{output}")
SnakeMake From line 312 of master/Snakefile
331
332
333
334
run:
    shell("cat {input} > {output}")
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/")
SnakeMake From line 331 of master/Snakefile
348
349
350
351
run:
    shell("vg pack -x {input.xg} -g {input.gam} -Q {wildcards.minq} -t {threads} -o {output} 2> {log}")
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/")
SnakeMake From line 348 of master/Snakefile
369
370
371
372
373
374
375
376
run:
    shell("vg call -k {input.pack} -t {threads} -s {wildcards.sample} --snarls {input.snarls} {input.xg} > {params.tmp_raw_vcf} 2> {log}")
    shell("bcftools sort {params.tmp_raw_vcf} | bgzip > {output.vcf}")
    shell("tabix -f -p vcf {output.vcf}")
    shell('rm {params.tmp_raw_vcf}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output.vcf} {SROOT}/{wildcards.sample}/")
        shell("aws s3 cp --quiet {output.idx} {SROOT}/{wildcards.sample}/")
396
397
398
399
400
401
402
403
run:
    shell("vg call -k {input.pack} -t {threads} -s {wildcards.sample} --snarls {input.snarls} -v {input.vcf} {input.xg} > {params.tmp_raw_vcf} 2> {log}")
    shell("bcftools view -e 'GT=\"0/0\" || GT=\"./.\"' {params.tmp_raw_vcf} | bcftools sort | bgzip > {output.vcf}")
    shell("tabix -f -p vcf {output.vcf}")
    shell('rm {params.tmp_raw_vcf}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output.vcf} {SROOT}/{wildcards.sample}/")
        shell("aws s3 cp --quiet {output.idx} {SROOT}/{wildcards.sample}/")
418
419
420
421
run:
    shell('vg convert {input} -p > {output} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/")
SnakeMake From line 418 of master/Snakefile
436
437
438
439
440
run:
    shell('vg augment {input.pg} {input.gam} -t {threads} -m 4 -q 5 -Q 5 -A {output.gam} > {output.pg} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output.gam} {SROOT}/{wildcards.sample}/")
        shell("aws s3 cp --quiet {output.pg} {SROOT}/{wildcards.sample}/")
SnakeMake From line 436 of master/Snakefile
451
452
453
454
run:
    shell('vg snarls -t {threads} {input} > {output} 2> {log}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/")
SnakeMake From line 451 of master/Snakefile
467
468
469
470
run:
    shell("vg pack -x {input.pg} -g {input.gam} -Q {wildcards.minq} -t {threads} -o {output} 2> {log}")
    if config['s3save']:
        shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/")
SnakeMake From line 467 of master/Snakefile
488
489
490
491
492
493
494
495
run:
    shell("vg call -k {input.pack} -t {threads} -s {wildcards.sample} --snarls {input.snarls} {input.xg} > {params.tmp_raw_vcf} 2> {log}")
    shell("bcftools sort {params.tmp_raw_vcf} | bgzip > {output.vcf}")
    shell("tabix -f -p vcf {output.vcf}")
    shell('rm {params.tmp_raw_vcf}')
    if config['s3save']:
        shell("aws s3 cp --quiet {output.vcf} {SROOT}/{wildcards.sample}/")
        shell("aws s3 cp --quiet {output.idx} {SROOT}/{wildcards.sample}/")
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/vgteam/vg_snakemake
Name: vg_snakemake
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...