Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains a
Snakemake
workflow for the
vg toolkit
.
In the current stage it's simply a place to host a usable
Snakefile
and
config.yaml
files.
With it, you can build a variation graph from a reference FASTA file and variants in a VCF file, map reads to the graph and genotype those variants.
For now, this workflow has been used locally or on a large AWS/Kubernetes instance. We will soon test it on a HPC system and update the examples below.
This workflow was tested on vg 1.21.0 .
Input files
To use the workflow you should first copy the
Snakefile
and
config.yaml
files in the working directory.
Graph construction
For construction, two files are necessary in the working directory:
-
a reference file named
{GENOME}.fa
-
a VCF file called
{VARIANTS}.vcf.gz
(sorted and indexed)
The
{GENOME}
and
{VARIANTS}
labels should match the
graph
label in the configuration (
config.yaml
file or in the command line).
The corresponding
graph
label is
{GENOME}-{VARIANTS}
.
These labels shouldn't contain the character
-
.
Mapping and genotyping samples
The reads for each sample should be in a folder named with the sample name.
In the
{sample}
folder, FASTQ files should be named
{sample}_1.fastq.gz
and
{sample}_2.fastq.gz
.
The sample names shouldn't contain the character
-
.
Examples
The two rules used most in practice are:
-
construct_all to build the graph and all the necessary indexes for mapping and variant calling.
-
genotype to map reads and genotype samples.
See an example on a small dataset in the
testdata
folder
.
Construct a variation graph
To build a graph from a
hg38.fa
and a
hgsvc.vcf.gz
, you would use
graph: "hg38-hgsvc"
in
config.yaml
and:
snakemake --configfile config.yaml --resources mem_mb=200000 --cores 16 construct_all
Or specify the graph label in the command line:
snakemake --configfile config.yaml --config graph="hg38-hgsvc" --resources mem_mb=200000 --cores 16 construct_all
Note the
--resources mem_mb=200000 --cores 16
which specifies the maximum amount memory or cores to use.
Map reads and genotype variants
Let's assume the graph indexes constructed (see above) and you want to analyze reads for 2 samples in
SAMP1/SAMP1_*.fastq.gz
and
SAMP2/SAMP2_*.fastq.gz
.
snakemake --configfile config.yaml --config samples="SAMP1 SAMP2" --resources mem_mb=100000 --cores 16 genotype
(Assuming that the graph label, e.g. hg38-hgsvc , is defined in the config.yaml file)
Overview of the workflow
We can easily visualize the workflow with
Snakemake
.
Her we use the small test data with two chromosomes, one sample, and reads split in 2 chunks (more info in the
testdata
folder
).
Rule graph
DAG of jobs
File graph
Overview of the workflow when using the Giraffe mapper
The newer and faster mapper, giraffe, uses different graph indexes.
Rule graph
DAG of jobs
File graph
FAQ
How do I specify the resources?
When running locally or on an instance, we use
--resources mem_mb=200000 --cores 16
to specify the maximum memory and cores to use.
To run on an HPC, you would need to add the HPC specific configuration (more details soon).
The current resources listed in
config.yaml
were tuned for a human genome.
Where can I run this?
You can run this on any resource supported by Snakemake . For example, locally and on HPC.
Can I use the workflow for non-human organisms?
Yes, juts remember to update the chromosome names in the
config.yaml
.
Why are there AWS/S3 commands in the workflow?
We sometimes run the workflow on an AWS instance and use these command to save the outputs in a S3 bucket.
If you don't want to do this, use the default config value
s3save: False
.
Code Snippets
59 60 | shell: "rm -rf {params.ff}" |
76 77 78 79 | run: shell('(vg construct -t {threads} -r {input.ref} -v {input.vcf} -a -f -S -R {wildcards.chr} -C | vg ids --sort - > {output}) 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
89 90 91 92 93 | run: shell('vg ids --join --mapping {output.mapping} {input} 2> {log}') if config['s3save']: shell("for ff in {input}; do aws s3 cp --quiet $ff {SROOT}/; done") shell("aws s3 cp --quiet {output.mapping} {SROOT}/") |
106 107 108 109 | run: shell('vg index -L -t {threads} -x {output} {input.vg} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
120 121 122 123 | run: shell('vg snarls -t {threads} {input} > {output} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
134 135 136 137 | run: shell('vg prune -t {threads} -M 32 --restore-paths {input.vg} > {output} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
152 153 154 155 156 157 158 | run: shell('mkdir -p {params.tmp_dir}') shell('vg index --temp-dir {params.tmp_dir} -p -t {threads} -g {output.gcsa} {input} 2> {log}') shell('rm -r {params.tmp_dir}') if config['s3save']: shell("aws s3 cp --quiet {output.gcsa} {SROOT}/") shell("aws s3 cp --quiet {output.gcsalcp} {SROOT}/") |
171 172 173 174 175 | run: shell('vg gbwt -n {wildcards.n} -g {output.gg} -o {output.gbwt} -x {input} -P 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output.gbwt} {SROOT}/") shell("aws s3 cp --quiet {output.gg} {SROOT}/") |
187 188 189 190 | run: shell('vg minimizer -k {wildcards.k} -w {wildcards.w} -t {threads} -i {output} -g {input.gbwt} {input.xg} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
200 201 202 203 | run: shell('vg snarls -t {threads} --include-trivial {input} > {output} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
215 216 217 218 | run: shell('vg index -t {threads} -j {output} -x {input.xg} -s {input.snarls} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
233 234 235 236 | run: NLINES=config['nb_split_reads'] * 4 shell('mkdir -p {output.dir}') shell("gzip -cd {input} | split -d -l {NLINES} --filter='pigz -p {threads} > ${{FILE}}.fastq.gz' - \"{output.dir}/{wildcards.sample}_1.part\"") |
243 244 245 246 | run: NLINES=config['nb_split_reads'] * 4 shell('mkdir -p {output.dir}') shell("gzip -cd {input} | split -d -l {NLINES} --filter='pigz -p {threads} > ${{FILE}}.fastq.gz' - \"{output.dir}/{wildcards.sample}_2.part\"") |
273 274 275 276 | run: shell("vg map -t {threads} -x {input.xg} -g {input.gcsa} -f {input.r1} -f {input.r2} > {output} 2> {log}") if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{output}") |
292 293 294 295 | run: shell("vg mpmap -S -t {threads} -x {input.xg} -g {input.gcsa} -f {input.r1} -f {input.r2} > {output} 2> {log}") if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{output}") |
312 313 314 315 | run: shell("vg gaffe -p -t {threads} -m {input.min} -d {input.dist} --gbwt-name {input.gbwt} -x {input.xg} -N {wildcards.sample} -f {input.r1} -f {input.r2} > {output} 2> {log}") if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{output}") |
331 332 333 334 | run: shell("cat {input} > {output}") if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/") |
348 349 350 351 | run: shell("vg pack -x {input.xg} -g {input.gam} -Q {wildcards.minq} -t {threads} -o {output} 2> {log}") if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/") |
369 370 371 372 373 374 375 376 | run: shell("vg call -k {input.pack} -t {threads} -s {wildcards.sample} --snarls {input.snarls} {input.xg} > {params.tmp_raw_vcf} 2> {log}") shell("bcftools sort {params.tmp_raw_vcf} | bgzip > {output.vcf}") shell("tabix -f -p vcf {output.vcf}") shell('rm {params.tmp_raw_vcf}') if config['s3save']: shell("aws s3 cp --quiet {output.vcf} {SROOT}/{wildcards.sample}/") shell("aws s3 cp --quiet {output.idx} {SROOT}/{wildcards.sample}/") |
396 397 398 399 400 401 402 403 | run: shell("vg call -k {input.pack} -t {threads} -s {wildcards.sample} --snarls {input.snarls} -v {input.vcf} {input.xg} > {params.tmp_raw_vcf} 2> {log}") shell("bcftools view -e 'GT=\"0/0\" || GT=\"./.\"' {params.tmp_raw_vcf} | bcftools sort | bgzip > {output.vcf}") shell("tabix -f -p vcf {output.vcf}") shell('rm {params.tmp_raw_vcf}') if config['s3save']: shell("aws s3 cp --quiet {output.vcf} {SROOT}/{wildcards.sample}/") shell("aws s3 cp --quiet {output.idx} {SROOT}/{wildcards.sample}/") |
418 419 420 421 | run: shell('vg convert {input} -p > {output} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/") |
436 437 438 439 440 | run: shell('vg augment {input.pg} {input.gam} -t {threads} -m 4 -q 5 -Q 5 -A {output.gam} > {output.pg} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output.gam} {SROOT}/{wildcards.sample}/") shell("aws s3 cp --quiet {output.pg} {SROOT}/{wildcards.sample}/") |
451 452 453 454 | run: shell('vg snarls -t {threads} {input} > {output} 2> {log}') if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/") |
467 468 469 470 | run: shell("vg pack -x {input.pg} -g {input.gam} -Q {wildcards.minq} -t {threads} -o {output} 2> {log}") if config['s3save']: shell("aws s3 cp --quiet {output} {SROOT}/{wildcards.sample}/") |
488 489 490 491 492 493 494 495 | run: shell("vg call -k {input.pack} -t {threads} -s {wildcards.sample} --snarls {input.snarls} {input.xg} > {params.tmp_raw_vcf} 2> {log}") shell("bcftools sort {params.tmp_raw_vcf} | bgzip > {output.vcf}") shell("tabix -f -p vcf {output.vcf}") shell('rm {params.tmp_raw_vcf}') if config['s3save']: shell("aws s3 cp --quiet {output.vcf} {SROOT}/{wildcards.sample}/") shell("aws s3 cp --quiet {output.idx} {SROOT}/{wildcards.sample}/") |
Support
- Future updates