Pipeline to build a large human reference panel, using publicly available genomes

public public 1yr ago 0 bookmarks

A fully automated and reproducible pipeline for building large reference panels of jointly-called and phased human genomes, aligned to GRCh38 .

This pipeline was inspired by the alignment and SNP calling workflow used by the New York Genome Center (NYGC) in their recent paper High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios ; implemented here, with improvements, using snakemake and conda for full reproducibility.

:warning: This pipeline is in active development and subject to ongoing improvements.

Installation

Download the refpanel source code

git clone git@github.com:ekirving/refpanel.git && cd refpanel

This pipeline uses the conda package manager (or the faster mamba front-end) to handle installation of all software dependencies. If you do not already have conda or mamba installed, then please install one first.

Once conda is setup, build and activate a new environment for the refpanel pipeline

conda env create --name refpanel --file environment.yaml
conda activate refpanel

Data sources

This pipeline comes preconfigured to build a joint-callset, called refpanel-v2 (n=5,100), involving all publicly available samples from:

Plus additional public genomes from:

The data from these projects is hosted by the International Genome Sample Resource (IGSR) database ( doi:10.1093/nar/gkw829 ) and the European Nucleotide Archive (ENA) ( doi:10.1093/nar/gkq967 ).

If there are publicly available whole-genome sequencing data that you would like incorporated into refpanel-v3 please raise an issue on GitHub with the details of the publication and they will be considered for inclusion in future releases.

If you wish to build a customised joint-callset (e.g., including non-public samples), please refer to the configuration docs .

Downloading data

To ensure all data is processed consistently, refpanel downloads gVCF files for 1000G; CRAM files for 1000G, HGDP, SGDP and GGVP; and fastq files for all other data sources.

To (optionally) pre-fetch all the data dependencies, run:

./refpanel download_data &

All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log

:warning: These files are very large : Please make sure you have sufficient disk space to store them!

Ancestry composition

Superpopulation assignments are based on the original 1000G, HGDP and SGDP metadata.

Superpopulation Code Samples
African Ancestry AFR 1,460
American Ancestry AMR 589
Central Asian and Siberian Ancestry CAS 66
Central and South Asian Ancestry CSA 199
East Asian Ancestry EAS 826
European Ancestry EUR 790
Middle Eastern Ancestry MEA 407
Oceanian Ancestry OCE 38
South Asian Ancestry SAS 678
West Eurasian Ancestry WEA 47

Joint-calling pipeline

In brief, refpanel produces a jointly-called and phased callset via the following steps:

For more information, refer to the DAG of the rule graph or the code itself.

Running the pipeline

To execute the full pipeline, end-to-end, run:

./refpanel &

All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log

:warning: This will take a long time : Please make sure you run this on a server with as many CPUs, and as much RAM, as possible (e.g., this pipeline was developed and run on a cluster of nodes, each with 96 cores and 755Gb of RAM each).

The pipeline can also be broken down into separate steps , for distribution across multiple nodes in a cluster.

Code Snippets

38
39
40
41
42
43
44
45
46
shell:
    "wget "
    " --mirror"
    " --quiet"
    " --no-host-directories"
    " --cut-dirs=5"
    " --directory-prefix=data/reference/GRCh38/"
    " -o /dev/null"
    " ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/"
56
57
58
59
shell:
    "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi && "
    "gunzip --test {output.vcf}"
69
70
71
72
shell:
    "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi && "
    "gunzip --test {output.vcf}"
82
83
84
85
shell:
    "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi && "
    "gunzip --test {output.vcf}"
104
105
106
107
108
109
shell:
    "gunzip -c {input.vcf} | "
    " sed 's/POS=POS-1/POS_POS-1/g' | "
    " sed '/^#/!s/ /_/g' | "
    " bgzip > {output.vcf} && "
    "bcftools index --tbi {output.vcf}"
160
161
shell:
    "bedtools subtract -nonamecheck -a {input.bed} -b {input.hap} > {output.bed}"
174
175
shell:
    "grep chrM {input.bed} > {output.bed}"
188
189
shell:
    "grep -vP 'chrY|chrM' {input.bed} > {output.bed}"
228
229
shell:
    "wget --quiet -O {output.map} -o /dev/null https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz"
249
250
shell:
    "gunzip -c {input.map} | awk 'NR==1 || $1==\"{params.chr}\" {{ print $2,$3,$4 }}' > {output.map}"
267
268
269
shell:
    "wget --quiet -O {output.tar} -o /dev/null https://github.com/odelaneau/shapeit4/raw/master/maps/genetic_maps.b38.tar.gz && "
    "mkdir -p {params.path} && tar -xzf {output.tar} -C {params.path}"
280
281
shell:
    "wget --quiet -O {output.txt} -o /dev/null ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt"
293
294
295
shell:
    "wget --quiet -O {output.vcf} -o /dev/null ftp://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null ftp://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz.tbi"
52
53
54
55
56
57
58
shell:
    "fastp "
    " --in1 {input.fastq}"
    " --out1 {output.fastq}"
    " --thread {threads}"
    " --json {output.json}"
    " --html {output.html} 2> {log}"
82
83
84
85
86
87
88
89
90
shell:
    "fastp "
    " --in1 {input.fastq_r1}"
    " --in2 {input.fastq_r2}"
    " --out1 {output.fastq_r1}"
    " --out2 {output.fastq_r2}"
    " --thread {threads}"
    " --json {output.json}"
    " --html {output.html} 2> {log}"
112
113
114
115
116
117
118
119
120
shell:
    "( bwa mem -Y "
    "   -K 100000000 "
    "   -t {threads} "
    "   -R '{params.rg}' "
    "   {input.ref} "
    "   {input.fastq} | "
    "  samtools view -Shb -o {output.bam} - "
    ") 2> {log}"
143
144
145
146
147
148
149
150
151
152
shell:
    "( bwa mem -Y "
    "   -K 100000000 "
    "   -t {threads} "
    "   -R '{params.rg}' "
    "   {input.ref} "
    "   {input.fastq_r1} "
    "   {input.fastq_r1} | "
    "  samtools view -Shb -o {output.bam} - "
    ") 2> {log}"
171
172
173
174
175
176
177
178
179
180
shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " FixMateInformation"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " ADD_MATE_CIGAR=True"
    " ASSUME_SORTED=true"
    " I={input.bam}"
    " O={output.bam} 2> {log}"
219
220
221
222
223
224
225
226
227
228
229
230
shell:
    "picard"
    " -XX:ConcGCThreads=1"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " MergeSamFiles"
    " USE_THREADING=true"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " SORT_ORDER=queryname"
    " {params.bams}"
    " OUTPUT={output.bam} 2> {log}"
251
252
253
254
255
256
257
258
259
260
261
shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " SortSam"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " SORT_ORDER=coordinate"
    " CREATE_INDEX=true"
    " I={input.bam}"
    " O={output.bam} 2> {log}"
285
286
287
288
289
290
291
292
293
294
295
shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " MarkDuplicates"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " METRICS_FILE={output.met}"
    " INPUT={input.bam}"
    " OUTPUT={output.bam} 2> {log} && "
    "picard BuildBamIndex I={output.bam} O={output.bai} 2> /dev/null"
325
326
327
328
329
330
331
332
333
334
335
336
337
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -T BaseRecalibrator"
    " --downsample_to_fraction 0.1"
    " --num_cpu_threads_per_data_thread {threads}"
    " --preserve_qscores_less_than 6"
    " -R {input.ref}"
    " -o {output.tbl}"
    " -I {input.bam}"
    " -knownSites {input.snps}"
    " -knownSites {input.indels}"
    " -knownSites {input.mills} 2> {log}"
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -T PrintReads"
    " --num_cpu_threads_per_data_thread {threads}"
    " --disable_indel_quals"
    " --preserve_qscores_less_than 6"
    " {params.flags}"
    " -SQQ 10"
    " -SQQ 20"
    " -SQQ 30"
    " -rf BadCigar"
    " -R {input.ref}"
    " -o {output.bam}"
    " -I {input.bam}"
    " -BQSR {input.tbl} 2> {log}"
395
396
397
398
399
400
401
shell:
    "samtools view"
    " --cram"
    " --reference {input.ref}"
    " --write-index"
    " --output {output.cram}"
    " {input.bam}"
53
54
55
56
57
58
59
shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " CollectMultipleMetrics "
    " R={input.ref}"
    " I={input.cram}"
    " O={params.prefix} 2> {log}"
81
82
83
84
85
86
87
shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " CollectWgsMetrics "
    " R={input.ref}"
    " I={input.cram}"
    " O={output.txt} 2> {log}"
102
103
shell:
    "samtools idxstats {input.cram} > {output.idx}"
116
117
118
119
120
shell:
    "awk '"
    ' $1=="chrX" {{ xcov=$3/$2 }}; '
    ' $1=="chrY" {{ ycov=$3/$2 }}; '
    " END {{ print xcov/ycov }}' {input.idx} > {output.xyr}"
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xmx{resources.mem_mb}m"
    " -T HaplotypeCaller"
    " --genotyping_mode DISCOVERY"
    " -A AlleleBalanceBySample"
    " -A DepthPerAlleleBySample"
    " -A DepthPerSampleHC"
    " -A InbreedingCoeff"
    " -A MappingQualityZeroBySample"
    " -A StrandBiasBySample"
    " -A Coverage"
    " -A FisherStrand"
    " -A HaplotypeScore"
    " -A MappingQualityRankSumTest"
    " -A MappingQualityZero"
    " -A QualByDepth"
    " -A RMSMappingQuality"
    " -A ReadPosRankSumTest"
    " -A VariantType"
    " --logging_level INFO"
    " --emitRefConfidence GVCF"
    " -rf BadCigar"
    " --variant_index_parameter 128000"
    " --variant_index_type LINEAR"
    " --sample_ploidy {wildcards.ploidy}"
    " --intervals {input.region}"
    " -R {input.ref}"
    " --num_cpu_threads_per_data_thread 1"
    " -I {input.cram}"
    " -o {output.vcf} 2> {log}"
109
110
111
112
113
114
115
116
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -T CombineGVCFs"
    " -R {input.ref}"
    " --variant {input.vcf1}"
    " --variant {input.vcf2}"
    " -o {output.vcf} 2> {log}"
62
63
64
65
66
67
68
69
70
71
72
shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xms{resources.mem_mb}m"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T CombineGVCFs"
    " -R {input.ref}"
    " -L {input.chr}"
    " {params.gvcfs}"
    " -o {output.vcf} &> {log}"
112
113
114
115
116
117
118
119
120
121
122
shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xms{resources.mem_mb}m"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T CombineGVCFs"
    " -R {input.ref}"
    " -L {input.chr}"
    " {params.gvcfs}"
    " -o {output.vcf} &> {log}"
60
61
62
63
64
65
66
67
68
69
70
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T GenotypeGVCFs"
    " -R {input.ref}"
    " -L {input.chr}"
    " --num_threads {threads}"
    " --disable_auto_index_creation_and_locking_when_reading_rods"
    " {params.gvcfs}"
    " -o {output.vcf} 2> {log}"
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T VariantRecalibrator"
    " -R {input.ref}"
    " --num_threads {threads}"
    " {params.vcfs}"
    " -mode SNP"
    " -recalFile {output.recal}"
    " -tranchesFile {output.tranche}"
    " -rscriptFile {output.plot}"
    " -resource:hapmap,known=false,training=true,truth=true,prior=15.0 {input.hap}"
    " -resource:omni,known=false,training=true,truth=true,prior=12.0 {input.omni}"
    " -resource:1000G,known=false,training=true,truth=false,prior=10.0 {input.snps}"
    " -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbsnp}"
    " -an QD"
    " -an MQ"
    " -an FS"
    " -an MQRankSum"
    " -an ReadPosRankSum"
    " -an SOR"
    " -an DP"
    " -tranche 100.0"
    " -tranche 99.8"
    " -tranche 99.6"
    " -tranche 99.4"
    " -tranche 99.2"
    " -tranche 99.0"
    " -tranche 95.0"
    " -tranche 90.0 2> {log}"
163
164
165
166
167
168
169
170
171
172
173
174
175
176
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T ApplyRecalibration"
    " -R {input.ref}"
    " -L {input.chr}"
    " --num_threads {threads}"
    " -input {input.vcf}"
    " -mode SNP"
    " --ts_filter_level 99.80"
    " -recalFile {input.recal}"
    " -tranchesFile {input.tranche}"
    " -o {output.vcf} 2> {log}"
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T VariantRecalibrator"
    " -R {input.ref}"
    " --num_threads {threads}"
    " {params.vcfs}"
    " -mode INDEL"
    " -recalFile {output.recal}"
    " -tranchesFile {output.tranche}"
    " -rscriptFile {output.plot}"
    " -resource:mills,known=true,training=true,truth=true,prior=12.0 {input.mills}"
    " -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbsnp}"
    " -an QD"
    " -an FS"
    " -an ReadPosRankSum"
    " -an MQRankSum"
    " -an SOR"
    " -an DP"
    " -tranche 100.0"
    " -tranche 99.0"
    " -tranche 95.0"
    " -tranche 92.0"
    " -tranche 90.0"
    " --maxGaussians 4 2> {log}"
257
258
259
260
261
262
263
264
265
266
267
268
269
270
shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T ApplyRecalibration"
    " -R {input.ref}"
    " -L {input.chr}"
    " --num_threads {threads}"
    " -input {input.vcf}"
    " -mode INDEL"
    " --ts_filter_level 99.0"
    " -recalFile {input.recal}"
    " -tranchesFile {input.tranche}"
    " -o {output.vcf} 2> {log}"
292
293
294
295
296
297
298
299
shell:
    "( bcftools norm"
    "   --fasta-ref {input.ref}"
    "   --atomize"
    "   --multiallelics +snps"
    "   -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ")2> {log}"
336
337
shell:
    """awk '{{ print $4","$3","$2 }}' {input.ped} > {output.csv}"""
362
363
364
365
366
367
shell:
    "( bcftools annotate -a {input.dbsnp} -c ID -Ou {input.vcf} | "
    "  bcftools +fill-tags -- --tags all,F_MISSING --samples-file {input.super} | "
    "  bcftools +mendelian --mode a --trio-file {input.trios} -Oz -o {output.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "
402
403
404
405
shell:
    "( bcftools view --include 'FILTER=\"PASS\" & F_MISSING<0.05 & ({params.hwe}) & MERR<{params.max_merr} & MAC>=2' -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "
43
44
45
shell:
    "bcftools view --samples '{wildcards.sample}' -Oz -o {output.vcf} {input.vcf} && "
    "bcftools index --tbi {output.vcf}"
66
67
68
shell:
    "bcftools view --samples '{wildcards.sample}' --regions-file {input.bed1} -Oz -o {output.vcf1} {input.vcf} && bcftools index --tbi {output.vcf1} && "
    "bcftools view --samples '{wildcards.sample}' --regions-file {input.bed2} -Oz -o {output.vcf2} {input.vcf} && bcftools index --tbi {output.vcf2}"
 94
 95
 96
 97
 98
 99
100
101
102
103
shell:
    "whatshap phase"
    " --reference {input.ref}"
    " --tag PS"
    " --indels"
    " --ignore-read-groups"
    " --output {output.vcf}"
    " {input.vcf}"
    " {input.cram} 2> {log} && "
    "bcftools index --tbi {output.vcf}"
138
139
140
141
142
143
144
145
146
147
shell:
    "whatshap phase"
    " --reference {input.ref}"
    " --tag PS"
    " --indels"
    " --ignore-read-groups"
    " --output {output.vcf}"
    " {input.vcf}"
    " {input.cram} {input.phase10x} 2> {log} && "
    "bcftools index --tbi {output.vcf}"
171
172
173
shell:
    "bcftools concat --allow-overlaps  -Oz -o {output.vcf} {input.vcf1} {input.vcf2} && "
    "bcftools index --tbi {output.vcf}"
228
229
230
231
232
233
shell:
    "split -d --number=l/{threads} {input.list} {params.prefix}- && "
    "parallel -j {threads} 'bcftools merge --file-list {{}} -Ob -o {{}}.bcf' ::: {params.prefix}-* && "
    "bcftools merge --threads 8 -Oz -o {output.vcf} {params.prefix}-*.bcf && "
    "bcftools index --tbi {output.vcf} && "
    "rm {params.prefix}-*"
35
36
shell:
    """awk '$1=="{wildcards.family}"' {input.ped} > {output.ped}"""
58
59
60
shell:
    "bcftools view --samples '{params.samples}' -Oz -o {output.vcf} {input.vcf} && "
    "bcftools index --tbi {output.vcf}"
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
shell:
    "whatshap phase"
    " --reference {input.ref}"
    " --chromosome {wildcards.chr}"
    " --genmap {input.map}"
    " --ped {input.ped}"
    " --use-ped-samples"
    " --tag PS"
    " --indels"
    " --output {output.vcf}"
    " {input.vcf} 2> {log} && "
    "bcftools index --tbi {output.vcf}"
149
150
151
152
shell:
    "ulimit -n {params.limit} && "
    "bcftools merge --file-list {input.list} -Oz -o {output.vcf} && "
    "bcftools index --tbi {output.vcf}"
45
46
47
48
49
50
51
52
53
54
shell:
    "shapeit4"
    " --thread {threads}"
    " --input {input.vcf}"
    " --map {input.map}"
    " --region {wildcards.chr}"
    " --sequencing"
    " --output {output.vcf}"
    " &> {log} && "
    "bcftools index --tbi {output.vcf}"
84
85
86
87
88
89
90
91
92
93
94
shell:
    "shapeit4"
    " --thread {threads}"
    " --input {input.vcf1}"
    " --scaffold {input.vcf2}"
    " --map {input.map}"
    " --region {wildcards.chr}"
    " --sequencing"
    " --output {output.vcf}"
    " &> {log} && "
    "bcftools index --tbi {output.vcf}"
117
118
119
120
121
122
123
shell:
    "bcftools annotate"
    " --annotations {input.vcf_annot}"
    " --columns '^INFO/AF'"
    " --output-type z"
    " --output {output.vcf} {input.vcf_input} 2> {log} && "
    "bcftools index --tbi {output.vcf}"
32
33
34
35
36
37
38
shell:
    "vep_install"
    " --AUTO c"
    " --SPECIES homo_sapiens"
    " --ASSEMBLY GRCh38"
    " --CACHEDIR {output.dir}"
    " --NO_UPDATE &> {log}"
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
shell:
    "vep"
    " --offline"
    " --everything"
    " --species homo_sapiens"
    " --assembly GRCh38"
    " --dir_cache {input.dir}"
    " --vcf"
    " --fork {threads}"
    " --input_file {input.vcf}"
    " --fasta {input.ref}"
    " --stats_file {output.htm}"
    " --compress_output bgzip"
    " --output_file {output.vcf} &> {log} && "
    "tabix -p vcf {output.vcf}"
34
35
36
shell:
    "( bcftools norm --atomize -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} ) 2> {log}"
62
63
64
65
66
67
68
69
70
71
72
73
shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T VariantEval"
    " -R {input.ref}"
    " --eval {input.vcf}"
    " --dbsnp {input.dbsnp}"
    " --comp {input.comp}"
    " --known_names '1000G'"
    " --out {output.evl} 2> {log}"
88
89
90
91
92
93
94
95
96
shell:
    "wget "
    " --mirror"
    " --quiet"
    " --no-host-directories"
    " --cut-dirs=6"
    " --directory-prefix=data/reference/GRCh38/giab-stratifications/v3.0/"
    " -o /dev/null"
    " https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/genome-stratifications/v3.0/GRCh38/"
109
110
111
112
113
114
115
116
117
shell:
    "wget "
    " --mirror"
    " --quiet"
    " --no-host-directories"
    " --cut-dirs=6"
    " --directory-prefix=data/evaluation/GIAB/NA12878_HG001/"
    " -o /dev/null"
    " ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv4.2.1/GRCh38/"
134
135
136
137
shell:
    "( bcftools view -Oz -o {output.vcf} {input.vcf} {wildcards.chr} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "
152
153
154
155
shell:
    "( bcftools view --samples 'NA12878' -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "
183
184
185
186
187
188
shell:
    "hap.py"
    " --reference {input.ref}"
    " --report-prefix {params.prefix}"
    " {input.vcf_truth}"
    " {input.vcf_input}"
94
95
shell:
    """grep -P '{params.rgx}' {input.man} | awk '{{ print $3" {params.file}" }}' > {output.md5}"""
SnakeMake From line 94 of data/1000g.smk
112
113
114
shell:
    "wget --quiet -O {output.vcf} -o /dev/null {params.ftp}/{wildcards.sample}.haplotypeCalls.er.raw.{wildcards.ext} && "
    "md5sum --status --check {input.md5}"
SnakeMake From line 112 of data/1000g.smk
160
161
162
163
164
165
166
167
shell:
    "awk 'NR>1 && $3!=0 && $4!=0' {input.tsv} | "
    "sed -E 's/SH074|SH089/CHS2/' | "
    "sed -E 's/Y028|Y117/YRI1/' | "
    "sed -E 's/m004|m009/MXL1/' | "
    "sed -E 's/m008|m011/MXL2/' | "
    "sed -E 's/2467|2469/ASW4/' | "
    "grep -v 'HG02567' > {output.ped}"
SnakeMake From line 160 of data/1000g.smk
91
92
93
94
95
96
97
98
shell:
    "samtools view"
    " --expr 'seq=~\"^[ACGTN]+$\"'"
    " --cram"
    " --reference {input.ref}"
    " --write-index"
    " --output {output.cram}"
    " {input.cram}"
117
118
119
shell:
    "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && "
    "samtools index {output.cram}"
65
66
67
68
69
70
71
72
shell:
    "samtools view"
    " --expr 'seq=~\"^[ACGTN]+$\"'"
    " --cram"
    " --reference {input.ref}"
    " --write-index"
    " --output {output.cram}"
    " {input.cram}"
91
92
93
shell:
    "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && "
    "samtools index {output.cram}"
60
61
62
63
shell:
    "echo '{wildcards.sample}' > {output.sample} && "
    "bcftools reheader --samples {output.sample} {input.vcf} > {output.vcf} && "
    "bcftools index --tbi {output.vcf}"
78
79
80
shell:
    "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && "
    "samtools index {output.cram}"
ShowHide 63 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ekirving/refpanel
Name: refpanel
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...