Pipeline to build a large human reference panel, using publicly available genomes
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A fully automated and reproducible pipeline for building large reference panels of jointly-called and phased human
genomes, aligned to
GRCh38
.
This pipeline was inspired by the alignment and SNP calling workflow used by the New York Genome Center (NYGC) in their recent paper High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios ; implemented here, with improvements, using snakemake and conda for full reproducibility.
:warning: This pipeline is in active development and subject to ongoing improvements.
Installation
Download the
refpanel
source code
git clone git@github.com:ekirving/refpanel.git && cd refpanel
This pipeline uses the
conda package manager
(or the
faster
mamba
front-end) to handle installation of all software
dependencies. If you do not already have
conda
or
mamba
installed, then please install one first.
Once
conda
is setup, build and activate a new environment for the
refpanel
pipeline
conda env create --name refpanel --file environment.yaml
conda activate refpanel
Data sources
This pipeline comes preconfigured to build a joint-callset, called
refpanel-v2
(n=5,100), involving all publicly
available samples from:
-
1000 Genomes Project (1000G) , 30x NYGC version (n=3,202; inc. 2,504 phase 3 samples + 698 additional related samples);
The 1000 Genomes Project Consortium. (2015) Nature doi:10.1038/nature15393
Byrska-Bishop et al. (2022) Cell doi:10.1016/j.cell.2022.08.004 -
Simons Genome Diversity Project (SGDP) (n=256; excluding overlap with 1000G);
Mallick et al. (2016) Nature doi:10.1038/nature18964 -
Gambian Genome Variation Project (GGVP) (n=400; excluding overlap with 1000G);
Band et al. (2019) Nature Communications doi:10.1038/s41467-019-13480-z -
Human Genome Diversity Project (HGDP) (n=816; excluding overlap with SGDP);
Bergström et al. (2020) Science doi:10.1126/science.aay5012 -
Arabian Peninsula Population Genomic Study (APPG) (n=137);
Almarri et al. (2021) Cell doi:10.1016/j.cell.2021.07.013
Plus additional public genomes from:
-
Meyer et al. (2012) Science (n=8); doi:10.1126/science.1224344
-
Mondal et al. (2016) Nature Genetics (n=60); doi:10.1038/ng.3621
-
Rodriguez-Flores et al. (2016) Genome Research (n=108); doi:10.1101/gr.191478.115
-
de Barros Damgaard et al. (2018) Science (n=41); 10.1126/science.aar7711
-
McColl et al. (2018) Science (n=2); doi:10.1126/science.aat3628
-
Lindo et al. (2018) Science Advances (n=25); 10.1126/sciadv.aau4921
-
Gelabert et al. (2019) BMC Genomics (n=12); doi:10.1186/s12864-019-5529-0
-
Serra-Vidal et al. (2019) Current Biology (n=21); doi:10.1016/j.cub.2019.09.050
-
Lorente-Galdos et al. (2019) Genome Biology (n=9); doi:10.1186/s13059-019-1684-5
-
Crooks et al. (2020) BMC Genetics (n=3); doi:10.1186/s12863-020-00917-4
The data from these projects is hosted by the International Genome Sample Resource (IGSR) database ( doi:10.1093/nar/gkw829 ) and the European Nucleotide Archive (ENA) ( doi:10.1093/nar/gkq967 ).
If there are publicly available whole-genome sequencing data that you would like incorporated into
refpanel-v3
please
raise an issue on GitHub with the details of the publication and they will be considered for inclusion in future releases.
If you wish to build a customised joint-callset (e.g., including non-public samples), please refer to the configuration docs .
Downloading data
To ensure all data is processed consistently,
refpanel
downloads
gVCF
files for 1000G;
CRAM
files for 1000G, HGDP,
SGDP and GGVP; and
fastq
files for all other data sources.
To (optionally) pre-fetch all the data dependencies, run:
./refpanel download_data &
All output will be automatically written to a log file
refpanel-<YYYY-MM-DD-HHMM.SS>.log
:warning: These files are very large : Please make sure you have sufficient disk space to store them!
Ancestry composition
Superpopulation assignments are based on the original 1000G, HGDP and SGDP metadata.
Superpopulation | Code | Samples |
---|---|---|
African Ancestry | AFR | 1,460 |
American Ancestry | AMR | 589 |
Central Asian and Siberian Ancestry | CAS | 66 |
Central and South Asian Ancestry | CSA | 199 |
East Asian Ancestry | EAS | 826 |
European Ancestry | EUR | 790 |
Middle Eastern Ancestry | MEA | 407 |
Oceanian Ancestry | OCE | 38 |
South Asian Ancestry | SAS | 678 |
West Eurasian Ancestry | WEA | 47 |
Joint-calling pipeline
In brief,
refpanel
produces a jointly-called and phased callset via the following steps:
-
Adapter trimming with
fastp
(v.0.23.2) -
Alignment to
GRCh38
withbwa mem
(v0.7.15) -
Fix-mate , merge , sort , and mark duplicates with
picard
(v2.5.0) -
Base recalibration with
gatk BaseRecalibrator
(v3.5) -
Conversion to
cram
withsamtools
(v1.14) -
Per-sample calling of
gVCFs
withgatk HaplotypeCaller
(with sex-dependent ploidy ) -
Merging samples with
gatk CombineGVCFs
-
Joint-calling of all samples with
gatk GenotypeGVCFs
-
Variant quality score recalibration with
gatk VariantRecalibrator
-
Annotation with dbSNP build 155 with
bcftools
(v1.14) -
Hard-filtering of SNPs and INDELs with
bcftools
:-
VQSR PASS;
-
GT missingness < 5%;
-
HWE p-value > 1e-10 in at least one super-population;
-
Mendelian error rate < 5%, using trios from 1000G (n=602) and GGVP (n=133);
-
MAC ≥ 2 (i.e., no singletons);
-
-
Read-based phasing with
whatshap
(v1.2.1) using:-
Illumina paired-end reads from all projects;
-
10x Genomics linked-read sequencing from HGDP (n=26) and APPG (n=137);
-
-
Pedigree phasing with
whatshap
using:- Trios from 1000G (n=602) and GGVP (n=130);
-
Statistical phasing with
shapeit4
(v4.2.2) -
Variant effect prediction with
ensembl-vep
(v105.0)
For more information, refer to the DAG of the rule graph or the code itself.
Running the pipeline
To execute the full pipeline, end-to-end, run:
./refpanel &
All output will be automatically written to a log file
refpanel-<YYYY-MM-DD-HHMM.SS>.log
:warning: This will take a long time : Please make sure you run this on a server with as many CPUs, and as much RAM, as possible (e.g., this pipeline was developed and run on a cluster of nodes, each with 96 cores and 755Gb of RAM each).
The pipeline can also be broken down into separate steps , for distribution across multiple nodes in a cluster.
Code Snippets
38 39 40 41 42 43 44 45 46 | shell: "wget " " --mirror" " --quiet" " --no-host-directories" " --cut-dirs=5" " --directory-prefix=data/reference/GRCh38/" " -o /dev/null" " ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/" |
56 57 58 59 | shell: "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz && " "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi && " "gunzip --test {output.vcf}" |
69 70 71 72 | shell: "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz && " "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi && " "gunzip --test {output.vcf}" |
82 83 84 85 | shell: "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz && " "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi && " "gunzip --test {output.vcf}" |
104 105 106 107 108 109 | shell: "gunzip -c {input.vcf} | " " sed 's/POS=POS-1/POS_POS-1/g' | " " sed '/^#/!s/ /_/g' | " " bgzip > {output.vcf} && " "bcftools index --tbi {output.vcf}" |
160 161 | shell: "bedtools subtract -nonamecheck -a {input.bed} -b {input.hap} > {output.bed}" |
174 175 | shell: "grep chrM {input.bed} > {output.bed}" |
188 189 | shell: "grep -vP 'chrY|chrM' {input.bed} > {output.bed}" |
228 229 | shell: "wget --quiet -O {output.map} -o /dev/null https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz" |
249 250 | shell: "gunzip -c {input.map} | awk 'NR==1 || $1==\"{params.chr}\" {{ print $2,$3,$4 }}' > {output.map}" |
267 268 269 | shell: "wget --quiet -O {output.tar} -o /dev/null https://github.com/odelaneau/shapeit4/raw/master/maps/genetic_maps.b38.tar.gz && " "mkdir -p {params.path} && tar -xzf {output.tar} -C {params.path}" |
280 281 | shell: "wget --quiet -O {output.txt} -o /dev/null ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt" |
293 294 295 | shell: "wget --quiet -O {output.vcf} -o /dev/null ftp://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz && " "wget --quiet -O {output.tbi} -o /dev/null ftp://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz.tbi" |
52 53 54 55 56 57 58 | shell: "fastp " " --in1 {input.fastq}" " --out1 {output.fastq}" " --thread {threads}" " --json {output.json}" " --html {output.html} 2> {log}" |
82 83 84 85 86 87 88 89 90 | shell: "fastp " " --in1 {input.fastq_r1}" " --in2 {input.fastq_r2}" " --out1 {output.fastq_r1}" " --out2 {output.fastq_r2}" " --thread {threads}" " --json {output.json}" " --html {output.html} 2> {log}" |
112 113 114 115 116 117 118 119 120 | shell: "( bwa mem -Y " " -K 100000000 " " -t {threads} " " -R '{params.rg}' " " {input.ref} " " {input.fastq} | " " samtools view -Shb -o {output.bam} - " ") 2> {log}" |
143 144 145 146 147 148 149 150 151 152 | shell: "( bwa mem -Y " " -K 100000000 " " -t {threads} " " -R '{params.rg}' " " {input.ref} " " {input.fastq_r1} " " {input.fastq_r1} | " " samtools view -Shb -o {output.bam} - " ") 2> {log}" |
171 172 173 174 175 176 177 178 179 180 | shell: "picard" " -Xmx{resources.mem_mb}m" " FixMateInformation" " MAX_RECORDS_IN_RAM=2000000" " VALIDATION_STRINGENCY=SILENT" " ADD_MATE_CIGAR=True" " ASSUME_SORTED=true" " I={input.bam}" " O={output.bam} 2> {log}" |
219 220 221 222 223 224 225 226 227 228 229 230 | shell: "picard" " -XX:ConcGCThreads=1" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " MergeSamFiles" " USE_THREADING=true" " MAX_RECORDS_IN_RAM=2000000" " VALIDATION_STRINGENCY=SILENT" " SORT_ORDER=queryname" " {params.bams}" " OUTPUT={output.bam} 2> {log}" |
251 252 253 254 255 256 257 258 259 260 261 | shell: "picard" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " SortSam" " MAX_RECORDS_IN_RAM=2000000" " VALIDATION_STRINGENCY=SILENT" " SORT_ORDER=coordinate" " CREATE_INDEX=true" " I={input.bam}" " O={output.bam} 2> {log}" |
285 286 287 288 289 290 291 292 293 294 295 | shell: "picard" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " MarkDuplicates" " MAX_RECORDS_IN_RAM=2000000" " VALIDATION_STRINGENCY=SILENT" " METRICS_FILE={output.met}" " INPUT={input.bam}" " OUTPUT={output.bam} 2> {log} && " "picard BuildBamIndex I={output.bam} O={output.bai} 2> /dev/null" |
325 326 327 328 329 330 331 332 333 334 335 336 337 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -T BaseRecalibrator" " --downsample_to_fraction 0.1" " --num_cpu_threads_per_data_thread {threads}" " --preserve_qscores_less_than 6" " -R {input.ref}" " -o {output.tbl}" " -I {input.bam}" " -knownSites {input.snps}" " -knownSites {input.indels}" " -knownSites {input.mills} 2> {log}" |
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -T PrintReads" " --num_cpu_threads_per_data_thread {threads}" " --disable_indel_quals" " --preserve_qscores_less_than 6" " {params.flags}" " -SQQ 10" " -SQQ 20" " -SQQ 30" " -rf BadCigar" " -R {input.ref}" " -o {output.bam}" " -I {input.bam}" " -BQSR {input.tbl} 2> {log}" |
395 396 397 398 399 400 401 | shell: "samtools view" " --cram" " --reference {input.ref}" " --write-index" " --output {output.cram}" " {input.bam}" |
53 54 55 56 57 58 59 | shell: "picard" " -Xmx{resources.mem_mb}m" " CollectMultipleMetrics " " R={input.ref}" " I={input.cram}" " O={params.prefix} 2> {log}" |
81 82 83 84 85 86 87 | shell: "picard" " -Xmx{resources.mem_mb}m" " CollectWgsMetrics " " R={input.ref}" " I={input.cram}" " O={output.txt} 2> {log}" |
102 103 | shell: "samtools idxstats {input.cram} > {output.idx}" |
116 117 118 119 120 | shell: "awk '" ' $1=="chrX" {{ xcov=$3/$2 }}; ' ' $1=="chrY" {{ ycov=$3/$2 }}; ' " END {{ print xcov/ycov }}' {input.idx} > {output.xyr}" |
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | shell: "gatk3" " -XX:ConcGCThreads=1" " -Xmx{resources.mem_mb}m" " -T HaplotypeCaller" " --genotyping_mode DISCOVERY" " -A AlleleBalanceBySample" " -A DepthPerAlleleBySample" " -A DepthPerSampleHC" " -A InbreedingCoeff" " -A MappingQualityZeroBySample" " -A StrandBiasBySample" " -A Coverage" " -A FisherStrand" " -A HaplotypeScore" " -A MappingQualityRankSumTest" " -A MappingQualityZero" " -A QualByDepth" " -A RMSMappingQuality" " -A ReadPosRankSumTest" " -A VariantType" " --logging_level INFO" " --emitRefConfidence GVCF" " -rf BadCigar" " --variant_index_parameter 128000" " --variant_index_type LINEAR" " --sample_ploidy {wildcards.ploidy}" " --intervals {input.region}" " -R {input.ref}" " --num_cpu_threads_per_data_thread 1" " -I {input.cram}" " -o {output.vcf} 2> {log}" |
109 110 111 112 113 114 115 116 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -T CombineGVCFs" " -R {input.ref}" " --variant {input.vcf1}" " --variant {input.vcf2}" " -o {output.vcf} 2> {log}" |
62 63 64 65 66 67 68 69 70 71 72 | shell: "gatk3" " -XX:ConcGCThreads=1" " -Xms{resources.mem_mb}m" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T CombineGVCFs" " -R {input.ref}" " -L {input.chr}" " {params.gvcfs}" " -o {output.vcf} &> {log}" |
112 113 114 115 116 117 118 119 120 121 122 | shell: "gatk3" " -XX:ConcGCThreads=1" " -Xms{resources.mem_mb}m" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T CombineGVCFs" " -R {input.ref}" " -L {input.chr}" " {params.gvcfs}" " -o {output.vcf} &> {log}" |
60 61 62 63 64 65 66 67 68 69 70 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T GenotypeGVCFs" " -R {input.ref}" " -L {input.chr}" " --num_threads {threads}" " --disable_auto_index_creation_and_locking_when_reading_rods" " {params.gvcfs}" " -o {output.vcf} 2> {log}" |
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T VariantRecalibrator" " -R {input.ref}" " --num_threads {threads}" " {params.vcfs}" " -mode SNP" " -recalFile {output.recal}" " -tranchesFile {output.tranche}" " -rscriptFile {output.plot}" " -resource:hapmap,known=false,training=true,truth=true,prior=15.0 {input.hap}" " -resource:omni,known=false,training=true,truth=true,prior=12.0 {input.omni}" " -resource:1000G,known=false,training=true,truth=false,prior=10.0 {input.snps}" " -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbsnp}" " -an QD" " -an MQ" " -an FS" " -an MQRankSum" " -an ReadPosRankSum" " -an SOR" " -an DP" " -tranche 100.0" " -tranche 99.8" " -tranche 99.6" " -tranche 99.4" " -tranche 99.2" " -tranche 99.0" " -tranche 95.0" " -tranche 90.0 2> {log}" |
163 164 165 166 167 168 169 170 171 172 173 174 175 176 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T ApplyRecalibration" " -R {input.ref}" " -L {input.chr}" " --num_threads {threads}" " -input {input.vcf}" " -mode SNP" " --ts_filter_level 99.80" " -recalFile {input.recal}" " -tranchesFile {input.tranche}" " -o {output.vcf} 2> {log}" |
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T VariantRecalibrator" " -R {input.ref}" " --num_threads {threads}" " {params.vcfs}" " -mode INDEL" " -recalFile {output.recal}" " -tranchesFile {output.tranche}" " -rscriptFile {output.plot}" " -resource:mills,known=true,training=true,truth=true,prior=12.0 {input.mills}" " -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbsnp}" " -an QD" " -an FS" " -an ReadPosRankSum" " -an MQRankSum" " -an SOR" " -an DP" " -tranche 100.0" " -tranche 99.0" " -tranche 95.0" " -tranche 92.0" " -tranche 90.0" " --maxGaussians 4 2> {log}" |
257 258 259 260 261 262 263 264 265 266 267 268 269 270 | shell: "gatk3" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T ApplyRecalibration" " -R {input.ref}" " -L {input.chr}" " --num_threads {threads}" " -input {input.vcf}" " -mode INDEL" " --ts_filter_level 99.0" " -recalFile {input.recal}" " -tranchesFile {input.tranche}" " -o {output.vcf} 2> {log}" |
292 293 294 295 296 297 298 299 | shell: "( bcftools norm" " --fasta-ref {input.ref}" " --atomize" " --multiallelics +snps" " -Oz -o {output.vcf} {input.vcf} && " " bcftools index --tbi {output.vcf} " ")2> {log}" |
336 337 | shell: """awk '{{ print $4","$3","$2 }}' {input.ped} > {output.csv}""" |
362 363 364 365 366 367 | shell: "( bcftools annotate -a {input.dbsnp} -c ID -Ou {input.vcf} | " " bcftools +fill-tags -- --tags all,F_MISSING --samples-file {input.super} | " " bcftools +mendelian --mode a --trio-file {input.trios} -Oz -o {output.vcf} && " " bcftools index --tbi {output.vcf} " ") 2> {log} " |
402 403 404 405 | shell: "( bcftools view --include 'FILTER=\"PASS\" & F_MISSING<0.05 & ({params.hwe}) & MERR<{params.max_merr} & MAC>=2' -Oz -o {output.vcf} {input.vcf} && " " bcftools index --tbi {output.vcf} " ") 2> {log} " |
43 44 45 | shell: "bcftools view --samples '{wildcards.sample}' -Oz -o {output.vcf} {input.vcf} && " "bcftools index --tbi {output.vcf}" |
66 67 68 | shell: "bcftools view --samples '{wildcards.sample}' --regions-file {input.bed1} -Oz -o {output.vcf1} {input.vcf} && bcftools index --tbi {output.vcf1} && " "bcftools view --samples '{wildcards.sample}' --regions-file {input.bed2} -Oz -o {output.vcf2} {input.vcf} && bcftools index --tbi {output.vcf2}" |
94 95 96 97 98 99 100 101 102 103 | shell: "whatshap phase" " --reference {input.ref}" " --tag PS" " --indels" " --ignore-read-groups" " --output {output.vcf}" " {input.vcf}" " {input.cram} 2> {log} && " "bcftools index --tbi {output.vcf}" |
138 139 140 141 142 143 144 145 146 147 | shell: "whatshap phase" " --reference {input.ref}" " --tag PS" " --indels" " --ignore-read-groups" " --output {output.vcf}" " {input.vcf}" " {input.cram} {input.phase10x} 2> {log} && " "bcftools index --tbi {output.vcf}" |
171 172 173 | shell: "bcftools concat --allow-overlaps -Oz -o {output.vcf} {input.vcf1} {input.vcf2} && " "bcftools index --tbi {output.vcf}" |
228 229 230 231 232 233 | shell: "split -d --number=l/{threads} {input.list} {params.prefix}- && " "parallel -j {threads} 'bcftools merge --file-list {{}} -Ob -o {{}}.bcf' ::: {params.prefix}-* && " "bcftools merge --threads 8 -Oz -o {output.vcf} {params.prefix}-*.bcf && " "bcftools index --tbi {output.vcf} && " "rm {params.prefix}-*" |
35 36 | shell: """awk '$1=="{wildcards.family}"' {input.ped} > {output.ped}""" |
58 59 60 | shell: "bcftools view --samples '{params.samples}' -Oz -o {output.vcf} {input.vcf} && " "bcftools index --tbi {output.vcf}" |
91 92 93 94 95 96 97 98 99 100 101 102 | shell: "whatshap phase" " --reference {input.ref}" " --chromosome {wildcards.chr}" " --genmap {input.map}" " --ped {input.ped}" " --use-ped-samples" " --tag PS" " --indels" " --output {output.vcf}" " {input.vcf} 2> {log} && " "bcftools index --tbi {output.vcf}" |
149 150 151 152 | shell: "ulimit -n {params.limit} && " "bcftools merge --file-list {input.list} -Oz -o {output.vcf} && " "bcftools index --tbi {output.vcf}" |
45 46 47 48 49 50 51 52 53 54 | shell: "shapeit4" " --thread {threads}" " --input {input.vcf}" " --map {input.map}" " --region {wildcards.chr}" " --sequencing" " --output {output.vcf}" " &> {log} && " "bcftools index --tbi {output.vcf}" |
84 85 86 87 88 89 90 91 92 93 94 | shell: "shapeit4" " --thread {threads}" " --input {input.vcf1}" " --scaffold {input.vcf2}" " --map {input.map}" " --region {wildcards.chr}" " --sequencing" " --output {output.vcf}" " &> {log} && " "bcftools index --tbi {output.vcf}" |
117 118 119 120 121 122 123 | shell: "bcftools annotate" " --annotations {input.vcf_annot}" " --columns '^INFO/AF'" " --output-type z" " --output {output.vcf} {input.vcf_input} 2> {log} && " "bcftools index --tbi {output.vcf}" |
32 33 34 35 36 37 38 | shell: "vep_install" " --AUTO c" " --SPECIES homo_sapiens" " --ASSEMBLY GRCh38" " --CACHEDIR {output.dir}" " --NO_UPDATE &> {log}" |
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | shell: "vep" " --offline" " --everything" " --species homo_sapiens" " --assembly GRCh38" " --dir_cache {input.dir}" " --vcf" " --fork {threads}" " --input_file {input.vcf}" " --fasta {input.ref}" " --stats_file {output.htm}" " --compress_output bgzip" " --output_file {output.vcf} &> {log} && " "tabix -p vcf {output.vcf}" |
34 35 36 | shell: "( bcftools norm --atomize -Oz -o {output.vcf} {input.vcf} && " " bcftools index --tbi {output.vcf} ) 2> {log}" |
62 63 64 65 66 67 68 69 70 71 72 73 | shell: "gatk3" " -XX:ConcGCThreads=1" " -Xmx{resources.mem_mb}m" " -Djava.io.tmpdir='{resources.tmpdir}'" " -T VariantEval" " -R {input.ref}" " --eval {input.vcf}" " --dbsnp {input.dbsnp}" " --comp {input.comp}" " --known_names '1000G'" " --out {output.evl} 2> {log}" |
88 89 90 91 92 93 94 95 96 | shell: "wget " " --mirror" " --quiet" " --no-host-directories" " --cut-dirs=6" " --directory-prefix=data/reference/GRCh38/giab-stratifications/v3.0/" " -o /dev/null" " https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/genome-stratifications/v3.0/GRCh38/" |
109 110 111 112 113 114 115 116 117 | shell: "wget " " --mirror" " --quiet" " --no-host-directories" " --cut-dirs=6" " --directory-prefix=data/evaluation/GIAB/NA12878_HG001/" " -o /dev/null" " ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv4.2.1/GRCh38/" |
134 135 136 137 | shell: "( bcftools view -Oz -o {output.vcf} {input.vcf} {wildcards.chr} && " " bcftools index --tbi {output.vcf} " ") 2> {log} " |
152 153 154 155 | shell: "( bcftools view --samples 'NA12878' -Oz -o {output.vcf} {input.vcf} && " " bcftools index --tbi {output.vcf} " ") 2> {log} " |
183 184 185 186 187 188 | shell: "hap.py" " --reference {input.ref}" " --report-prefix {params.prefix}" " {input.vcf_truth}" " {input.vcf_input}" |
94 95 | shell: """grep -P '{params.rgx}' {input.man} | awk '{{ print $3" {params.file}" }}' > {output.md5}""" |
112 113 114 | shell: "wget --quiet -O {output.vcf} -o /dev/null {params.ftp}/{wildcards.sample}.haplotypeCalls.er.raw.{wildcards.ext} && " "md5sum --status --check {input.md5}" |
160 161 162 163 164 165 166 167 | shell: "awk 'NR>1 && $3!=0 && $4!=0' {input.tsv} | " "sed -E 's/SH074|SH089/CHS2/' | " "sed -E 's/Y028|Y117/YRI1/' | " "sed -E 's/m004|m009/MXL1/' | " "sed -E 's/m008|m011/MXL2/' | " "sed -E 's/2467|2469/ASW4/' | " "grep -v 'HG02567' > {output.ped}" |
91 92 93 94 95 96 97 98 | shell: "samtools view" " --expr 'seq=~\"^[ACGTN]+$\"'" " --cram" " --reference {input.ref}" " --write-index" " --output {output.cram}" " {input.cram}" |
117 118 119 | shell: "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && " "samtools index {output.cram}" |
65 66 67 68 69 70 71 72 | shell: "samtools view" " --expr 'seq=~\"^[ACGTN]+$\"'" " --cram" " --reference {input.ref}" " --write-index" " --output {output.cram}" " {input.cram}" |
91 92 93 | shell: "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && " "samtools index {output.cram}" |
60 61 62 63 | shell: "echo '{wildcards.sample}' > {output.sample} && " "bcftools reheader --samples {output.sample} {input.vcf} > {output.vcf} && " "bcftools index --tbi {output.vcf}" |
78 79 80 | shell: "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && " "samtools index {output.cram}" |
Support
- Future updates
Related Workflows





