Pipeline to build a large human reference panel, using publicly available genomes

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A fully automated and reproducible pipeline for building large reference panels of jointly-called and phased human genomes, aligned to GRCh38 .

This pipeline was inspired by the alignment and SNP calling workflow used by the New York Genome Center (NYGC) in their recent paper High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios ; implemented here, with improvements, using snakemake and conda for full reproducibility.

:warning: This pipeline is in active development and subject to ongoing improvements.

Installation

Download the refpanel source code

git clone git@github.com:ekirving/refpanel.git && cd refpanel

This pipeline uses the conda package manager (or the faster mamba front-end) to handle installation of all software dependencies. If you do not already have conda or mamba installed, then please install one first.

Once conda is setup, build and activate a new environment for the refpanel pipeline

conda env create --name refpanel --file environment.yaml

conda activate refpanel

Data sources

This pipeline comes preconfigured to build a joint-callset, called refpanel-v2 (n=5,100), involving all publicly available samples from:

1000 Genomes Project (1000G) , 30x NYGC version (n=3,202; inc. 2,504 phase 3 samples + 698 additional related samples);
The 1000 Genomes Project Consortium. (2015) Nature doi:10.1038/nature15393
Byrska-Bishop et al. (2022) Cell doi:10.1016/j.cell.2022.08.004
Simons Genome Diversity Project (SGDP) (n=256; excluding overlap with 1000G);
Mallick et al. (2016) Nature doi:10.1038/nature18964
Gambian Genome Variation Project (GGVP) (n=400; excluding overlap with 1000G);
Band et al. (2019) Nature Communications doi:10.1038/s41467-019-13480-z
Human Genome Diversity Project (HGDP) (n=816; excluding overlap with SGDP);
Bergström et al. (2020) Science doi:10.1126/science.aay5012
Arabian Peninsula Population Genomic Study (APPG) (n=137);
Almarri et al. (2021) Cell doi:10.1016/j.cell.2021.07.013

Plus additional public genomes from:

Meyer et al. (2012) Science (n=8); doi:10.1126/science.1224344
Mondal et al. (2016) Nature Genetics (n=60); doi:10.1038/ng.3621
Rodriguez-Flores et al. (2016) Genome Research (n=108); doi:10.1101/gr.191478.115
de Barros Damgaard et al. (2018) Science (n=41); 10.1126/science.aar7711
McColl et al. (2018) Science (n=2); doi:10.1126/science.aat3628
Lindo et al. (2018) Science Advances (n=25); 10.1126/sciadv.aau4921
Gelabert et al. (2019) BMC Genomics (n=12); doi:10.1186/s12864-019-5529-0
Serra-Vidal et al. (2019) Current Biology (n=21); doi:10.1016/j.cub.2019.09.050
Lorente-Galdos et al. (2019) Genome Biology (n=9); doi:10.1186/s13059-019-1684-5
Crooks et al. (2020) BMC Genetics (n=3); doi:10.1186/s12863-020-00917-4

The data from these projects is hosted by the International Genome Sample Resource (IGSR) database ( doi:10.1093/nar/gkw829 ) and the European Nucleotide Archive (ENA) ( doi:10.1093/nar/gkq967 ).

If there are publicly available whole-genome sequencing data that you would like incorporated into refpanel-v3 please raise an issue on GitHub with the details of the publication and they will be considered for inclusion in future releases.

If you wish to build a customised joint-callset (e.g., including non-public samples), please refer to the configuration docs .

Downloading data

To ensure all data is processed consistently, refpanel downloads gVCF files for 1000G; CRAM files for 1000G, HGDP, SGDP and GGVP; and fastq files for all other data sources.

To (optionally) pre-fetch all the data dependencies, run:

./refpanel download_data &

All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log

:warning: These files are very large : Please make sure you have sufficient disk space to store them!

Ancestry composition

Superpopulation assignments are based on the original 1000G, HGDP and SGDP metadata.

Superpopulation	Code	Samples
African Ancestry	AFR	1,460
American Ancestry	AMR	589
Central Asian and Siberian Ancestry	CAS	66
Central and South Asian Ancestry	CSA	199
East Asian Ancestry	EAS	826
European Ancestry	EUR	790
Middle Eastern Ancestry	MEA	407
Oceanian Ancestry	OCE	38
South Asian Ancestry	SAS	678
West Eurasian Ancestry	WEA	47

Joint-calling pipeline

In brief, refpanel produces a jointly-called and phased callset via the following steps:

Adapter trimming with fastp (v.0.23.2)
Alignment to GRCh38 with bwa mem (v0.7.15)
Fix-mate , merge , sort , and mark duplicates with picard (v2.5.0)
Base recalibration with gatk BaseRecalibrator (v3.5)
Conversion to cram with samtools (v1.14)
Per-sample calling of gVCFs with gatk HaplotypeCaller (with sex-dependent ploidy )
Merging samples with gatk CombineGVCFs
Joint-calling of all samples with gatk GenotypeGVCFs
Variant quality score recalibration with gatk VariantRecalibrator
Annotation with dbSNP build 155 with bcftools (v1.14)
Hard-filtering of SNPs and INDELs with bcftools :
1. VQSR PASS;
2. GT missingness < 5%;
3. HWE p-value > 1e-10 in at least one super-population;
4. Mendelian error rate < 5%, using trios from 1000G (n=602) and GGVP (n=133);
5. MAC ≥ 2 (i.e., no singletons);
Read-based phasing with whatshap (v1.2.1) using:
- Illumina paired-end reads from all projects;
- 10x Genomics linked-read sequencing from HGDP (n=26) and APPG (n=137);
Pedigree phasing with whatshap using:
- Trios from 1000G (n=602) and GGVP (n=130);
Statistical phasing with shapeit4 (v4.2.2)
Variant effect prediction with ensembl-vep (v105.0)

For more information, refer to the DAG of the rule graph or the code itself.

Running the pipeline

To execute the full pipeline, end-to-end, run:

./refpanel &

All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log

:warning: This will take a long time : Please make sure you run this on a server with as many CPUs, and as much RAM, as possible (e.g., this pipeline was developed and run on a cluster of nodes, each with 96 cores and 755Gb of RAM each).

The pipeline can also be broken down into separate steps , for distribution across multiple nodes in a cluster.

Code Snippets

shell:
    "wget "
    " --mirror"
    " --quiet"
    " --no-host-directories"
    " --cut-dirs=5"
    " --directory-prefix=data/reference/GRCh38/"
    " -o /dev/null"
    " ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/"

SnakeMake From line 38 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi && "
    "gunzip --test {output.vcf}"

SnakeMake From line 56 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi && "
    "gunzip --test {output.vcf}"

SnakeMake From line 69 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.vcf} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi && "
    "gunzip --test {output.vcf}"

SnakeMake From line 82 of rules/01-reference.smk

shell:
    "gunzip -c {input.vcf} | "
    " sed 's/POS=POS-1/POS_POS-1/g' | "
    " sed '/^#/!s/ /_/g' | "
    " bgzip > {output.vcf} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 104 of rules/01-reference.smk

shell:
    "bedtools subtract -nonamecheck -a {input.bed} -b {input.hap} > {output.bed}"

SnakeMake BEDTools From line 160 of rules/01-reference.smk

shell:
    "grep chrM {input.bed} > {output.bed}"

SnakeMake From line 174 of rules/01-reference.smk

shell:
    "grep -vP 'chrY|chrM' {input.bed} > {output.bed}"

SnakeMake From line 188 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.map} -o /dev/null https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz"

SnakeMake From line 228 of rules/01-reference.smk

shell:
    "gunzip -c {input.map} | awk 'NR==1 || $1==\"{params.chr}\" {{ print $2,$3,$4 }}' > {output.map}"

SnakeMake From line 249 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.tar} -o /dev/null https://github.com/odelaneau/shapeit4/raw/master/maps/genetic_maps.b38.tar.gz && "
    "mkdir -p {params.path} && tar -xzf {output.tar} -C {params.path}"

SnakeMake From line 267 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.txt} -o /dev/null ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt"

SnakeMake From line 280 of rules/01-reference.smk

shell:
    "wget --quiet -O {output.vcf} -o /dev/null ftp://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz && "
    "wget --quiet -O {output.tbi} -o /dev/null ftp://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz.tbi"

SnakeMake From line 293 of rules/01-reference.smk

shell:
    "fastp "
    " --in1 {input.fastq}"
    " --out1 {output.fastq}"
    " --thread {threads}"
    " --json {output.json}"
    " --html {output.html} 2> {log}"

SnakeMake fastp From line 52 of rules/02-align.smk

shell:
    "fastp "
    " --in1 {input.fastq_r1}"
    " --in2 {input.fastq_r2}"
    " --out1 {output.fastq_r1}"
    " --out2 {output.fastq_r2}"
    " --thread {threads}"
    " --json {output.json}"
    " --html {output.html} 2> {log}"

SnakeMake fastp From line 82 of rules/02-align.smk

shell:
    "( bwa mem -Y "
    "   -K 100000000 "
    "   -t {threads} "
    "   -R '{params.rg}' "
    "   {input.ref} "
    "   {input.fastq} | "
    "  samtools view -Shb -o {output.bam} - "
    ") 2> {log}"

SnakeMake SAMtools BWA From line 112 of rules/02-align.smk

shell:
    "( bwa mem -Y "
    "   -K 100000000 "
    "   -t {threads} "
    "   -R '{params.rg}' "
    "   {input.ref} "
    "   {input.fastq_r1} "
    "   {input.fastq_r1} | "
    "  samtools view -Shb -o {output.bam} - "
    ") 2> {log}"

SnakeMake SAMtools BWA From line 143 of rules/02-align.smk

shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " FixMateInformation"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " ADD_MATE_CIGAR=True"
    " ASSUME_SORTED=true"
    " I={input.bam}"
    " O={output.bam} 2> {log}"

SnakeMake Picard From line 171 of rules/02-align.smk

shell:
    "picard"
    " -XX:ConcGCThreads=1"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " MergeSamFiles"
    " USE_THREADING=true"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " SORT_ORDER=queryname"
    " {params.bams}"
    " OUTPUT={output.bam} 2> {log}"

SnakeMake Picard From line 219 of rules/02-align.smk

shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " SortSam"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " SORT_ORDER=coordinate"
    " CREATE_INDEX=true"
    " I={input.bam}"
    " O={output.bam} 2> {log}"

SnakeMake Picard From line 251 of rules/02-align.smk

shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " MarkDuplicates"
    " MAX_RECORDS_IN_RAM=2000000"
    " VALIDATION_STRINGENCY=SILENT"
    " METRICS_FILE={output.met}"
    " INPUT={input.bam}"
    " OUTPUT={output.bam} 2> {log} && "
    "picard BuildBamIndex I={output.bam} O={output.bai} 2> /dev/null"

SnakeMake Picard From line 285 of rules/02-align.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -T BaseRecalibrator"
    " --downsample_to_fraction 0.1"
    " --num_cpu_threads_per_data_thread {threads}"
    " --preserve_qscores_less_than 6"
    " -R {input.ref}"
    " -o {output.tbl}"
    " -I {input.bam}"
    " -knownSites {input.snps}"
    " -knownSites {input.indels}"
    " -knownSites {input.mills} 2> {log}"

SnakeMake gatk From line 325 of rules/02-align.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -T PrintReads"
    " --num_cpu_threads_per_data_thread {threads}"
    " --disable_indel_quals"
    " --preserve_qscores_less_than 6"
    " {params.flags}"
    " -SQQ 10"
    " -SQQ 20"
    " -SQQ 30"
    " -rf BadCigar"
    " -R {input.ref}"
    " -o {output.bam}"
    " -I {input.bam}"
    " -BQSR {input.tbl} 2> {log}"

SnakeMake gatk From line 363 of rules/02-align.smk

shell:
    "samtools view"
    " --cram"
    " --reference {input.ref}"
    " --write-index"
    " --output {output.cram}"
    " {input.bam}"

SnakeMake SAMtools From line 395 of rules/02-align.smk

shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " CollectMultipleMetrics "
    " R={input.ref}"
    " I={input.cram}"
    " O={params.prefix} 2> {log}"

SnakeMake Picard From line 53 of rules/03-qc-metrics.smk

shell:
    "picard"
    " -Xmx{resources.mem_mb}m"
    " CollectWgsMetrics "
    " R={input.ref}"
    " I={input.cram}"
    " O={output.txt} 2> {log}"

SnakeMake Picard From line 81 of rules/03-qc-metrics.smk

shell:
    "samtools idxstats {input.cram} > {output.idx}"

SnakeMake SAMtools From line 102 of rules/03-qc-metrics.smk

shell:
    "awk '"
    ' $1=="chrX" {{ xcov=$3/$2 }}; '
    ' $1=="chrY" {{ ycov=$3/$2 }}; '
    " END {{ print xcov/ycov }}' {input.idx} > {output.xyr}"

SnakeMake From line 116 of rules/03-qc-metrics.smk

shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xmx{resources.mem_mb}m"
    " -T HaplotypeCaller"
    " --genotyping_mode DISCOVERY"
    " -A AlleleBalanceBySample"
    " -A DepthPerAlleleBySample"
    " -A DepthPerSampleHC"
    " -A InbreedingCoeff"
    " -A MappingQualityZeroBySample"
    " -A StrandBiasBySample"
    " -A Coverage"
    " -A FisherStrand"
    " -A HaplotypeScore"
    " -A MappingQualityRankSumTest"
    " -A MappingQualityZero"
    " -A QualByDepth"
    " -A RMSMappingQuality"
    " -A ReadPosRankSumTest"
    " -A VariantType"
    " --logging_level INFO"
    " --emitRefConfidence GVCF"
    " -rf BadCigar"
    " --variant_index_parameter 128000"
    " --variant_index_type LINEAR"
    " --sample_ploidy {wildcards.ploidy}"
    " --intervals {input.region}"
    " -R {input.ref}"
    " --num_cpu_threads_per_data_thread 1"
    " -I {input.cram}"
    " -o {output.vcf} 2> {log}"

SnakeMake gatk From line 42 of rules/04-call.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -T CombineGVCFs"
    " -R {input.ref}"
    " --variant {input.vcf1}"
    " --variant {input.vcf2}"
    " -o {output.vcf} 2> {log}"

SnakeMake gatk From line 109 of rules/04-call.smk

shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xms{resources.mem_mb}m"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T CombineGVCFs"
    " -R {input.ref}"
    " -L {input.chr}"
    " {params.gvcfs}"
    " -o {output.vcf} &> {log}"

SnakeMake gatk From line 62 of rules/05-merge-samples.smk

shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xms{resources.mem_mb}m"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T CombineGVCFs"
    " -R {input.ref}"
    " -L {input.chr}"
    " {params.gvcfs}"
    " -o {output.vcf} &> {log}"

SnakeMake gatk From line 112 of rules/05-merge-samples.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T GenotypeGVCFs"
    " -R {input.ref}"
    " -L {input.chr}"
    " --num_threads {threads}"
    " --disable_auto_index_creation_and_locking_when_reading_rods"
    " {params.gvcfs}"
    " -o {output.vcf} 2> {log}"

SnakeMake gatk From line 60 of rules/06-joint-call.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T VariantRecalibrator"
    " -R {input.ref}"
    " --num_threads {threads}"
    " {params.vcfs}"
    " -mode SNP"
    " -recalFile {output.recal}"
    " -tranchesFile {output.tranche}"
    " -rscriptFile {output.plot}"
    " -resource:hapmap,known=false,training=true,truth=true,prior=15.0 {input.hap}"
    " -resource:omni,known=false,training=true,truth=true,prior=12.0 {input.omni}"
    " -resource:1000G,known=false,training=true,truth=false,prior=10.0 {input.snps}"
    " -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbsnp}"
    " -an QD"
    " -an MQ"
    " -an FS"
    " -an MQRankSum"
    " -an ReadPosRankSum"
    " -an SOR"
    " -an DP"
    " -tranche 100.0"
    " -tranche 99.8"
    " -tranche 99.6"
    " -tranche 99.4"
    " -tranche 99.2"
    " -tranche 99.0"
    " -tranche 95.0"
    " -tranche 90.0 2> {log}"

SnakeMake gatk From line 106 of rules/06-joint-call.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T ApplyRecalibration"
    " -R {input.ref}"
    " -L {input.chr}"
    " --num_threads {threads}"
    " -input {input.vcf}"
    " -mode SNP"
    " --ts_filter_level 99.80"
    " -recalFile {input.recal}"
    " -tranchesFile {input.tranche}"
    " -o {output.vcf} 2> {log}"

SnakeMake gatk From line 163 of rules/06-joint-call.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T VariantRecalibrator"
    " -R {input.ref}"
    " --num_threads {threads}"
    " {params.vcfs}"
    " -mode INDEL"
    " -recalFile {output.recal}"
    " -tranchesFile {output.tranche}"
    " -rscriptFile {output.plot}"
    " -resource:mills,known=true,training=true,truth=true,prior=12.0 {input.mills}"
    " -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {input.dbsnp}"
    " -an QD"
    " -an FS"
    " -an ReadPosRankSum"
    " -an MQRankSum"
    " -an SOR"
    " -an DP"
    " -tranche 100.0"
    " -tranche 99.0"
    " -tranche 95.0"
    " -tranche 92.0"
    " -tranche 90.0"
    " --maxGaussians 4 2> {log}"

SnakeMake gatk From line 205 of rules/06-joint-call.smk

shell:
    "gatk3"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T ApplyRecalibration"
    " -R {input.ref}"
    " -L {input.chr}"
    " --num_threads {threads}"
    " -input {input.vcf}"
    " -mode INDEL"
    " --ts_filter_level 99.0"
    " -recalFile {input.recal}"
    " -tranchesFile {input.tranche}"
    " -o {output.vcf} 2> {log}"

SnakeMake gatk From line 257 of rules/06-joint-call.smk

shell:
    "( bcftools norm"
    "   --fasta-ref {input.ref}"
    "   --atomize"
    "   --multiallelics +snps"
    "   -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ")2> {log}"

SnakeMake BCFtools From line 292 of rules/06-joint-call.smk

shell:
    """awk '{{ print $4","$3","$2 }}' {input.ped} > {output.csv}"""

SnakeMake From line 336 of rules/06-joint-call.smk

shell:
    "( bcftools annotate -a {input.dbsnp} -c ID -Ou {input.vcf} | "
    "  bcftools +fill-tags -- --tags all,F_MISSING --samples-file {input.super} | "
    "  bcftools +mendelian --mode a --trio-file {input.trios} -Oz -o {output.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "

SnakeMake BCFtools From line 362 of rules/06-joint-call.smk

shell:
    "( bcftools view --include 'FILTER=\"PASS\" & F_MISSING<0.05 & ({params.hwe}) & MERR<{params.max_merr} & MAC>=2' -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "

SnakeMake BCFtools From line 402 of rules/06-joint-call.smk

shell:
    "bcftools view --samples '{wildcards.sample}' -Oz -o {output.vcf} {input.vcf} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 43 of rules/07-phase-reads.smk

shell:
    "bcftools view --samples '{wildcards.sample}' --regions-file {input.bed1} -Oz -o {output.vcf1} {input.vcf} && bcftools index --tbi {output.vcf1} && "
    "bcftools view --samples '{wildcards.sample}' --regions-file {input.bed2} -Oz -o {output.vcf2} {input.vcf} && bcftools index --tbi {output.vcf2}"

SnakeMake BCFtools From line 66 of rules/07-phase-reads.smk

shell:
    "whatshap phase"
    " --reference {input.ref}"
    " --tag PS"
    " --indels"
    " --ignore-read-groups"
    " --output {output.vcf}"
    " {input.vcf}"
    " {input.cram} 2> {log} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools WhatsHap From line 94 of rules/07-phase-reads.smk

shell:
    "whatshap phase"
    " --reference {input.ref}"
    " --tag PS"
    " --indels"
    " --ignore-read-groups"
    " --output {output.vcf}"
    " {input.vcf}"
    " {input.cram} {input.phase10x} 2> {log} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools WhatsHap From line 138 of rules/07-phase-reads.smk

shell:
    "bcftools concat --allow-overlaps  -Oz -o {output.vcf} {input.vcf1} {input.vcf2} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 171 of rules/07-phase-reads.smk

shell:
    "split -d --number=l/{threads} {input.list} {params.prefix}- && "
    "parallel -j {threads} 'bcftools merge --file-list {{}} -Ob -o {{}}.bcf' ::: {params.prefix}-* && "
    "bcftools merge --threads 8 -Oz -o {output.vcf} {params.prefix}-*.bcf && "
    "bcftools index --tbi {output.vcf} && "
    "rm {params.prefix}-*"

SnakeMake BCFtools From line 228 of rules/07-phase-reads.smk

shell:
    """awk '$1=="{wildcards.family}"' {input.ped} > {output.ped}"""

SnakeMake From line 35 of rules/08-phase-trios.smk

shell:
    "bcftools view --samples '{params.samples}' -Oz -o {output.vcf} {input.vcf} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 58 of rules/08-phase-trios.smk

shell:
    "whatshap phase"
    " --reference {input.ref}"
    " --chromosome {wildcards.chr}"
    " --genmap {input.map}"
    " --ped {input.ped}"
    " --use-ped-samples"
    " --tag PS"
    " --indels"
    " --output {output.vcf}"
    " {input.vcf} 2> {log} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools WhatsHap From line 91 of rules/08-phase-trios.smk

shell:
    "ulimit -n {params.limit} && "
    "bcftools merge --file-list {input.list} -Oz -o {output.vcf} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 149 of rules/08-phase-trios.smk

shell:
    "shapeit4"
    " --thread {threads}"
    " --input {input.vcf}"
    " --map {input.map}"
    " --region {wildcards.chr}"
    " --sequencing"
    " --output {output.vcf}"
    " &> {log} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools shapeit4 From line 45 of rules/09-phase-stat.smk

shell:
    "shapeit4"
    " --thread {threads}"
    " --input {input.vcf1}"
    " --scaffold {input.vcf2}"
    " --map {input.map}"
    " --region {wildcards.chr}"
    " --sequencing"
    " --output {output.vcf}"
    " &> {log} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools shapeit4 From line 84 of rules/09-phase-stat.smk

shell:
    "bcftools annotate"
    " --annotations {input.vcf_annot}"
    " --columns '^INFO/AF'"
    " --output-type z"
    " --output {output.vcf} {input.vcf_input} 2> {log} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 117 of rules/09-phase-stat.smk

shell:
    "vep_install"
    " --AUTO c"
    " --SPECIES homo_sapiens"
    " --ASSEMBLY GRCh38"
    " --CACHEDIR {output.dir}"
    " --NO_UPDATE &> {log}"

SnakeMake From line 32 of rules/10-ensembl-vep.smk

shell:
    "vep"
    " --offline"
    " --everything"
    " --species homo_sapiens"
    " --assembly GRCh38"
    " --dir_cache {input.dir}"
    " --vcf"
    " --fork {threads}"
    " --input_file {input.vcf}"
    " --fasta {input.ref}"
    " --stats_file {output.htm}"
    " --compress_output bgzip"
    " --output_file {output.vcf} &> {log} && "
    "tabix -p vcf {output.vcf}"

SnakeMake tabix Variant Effect Predictor (VEP) From line 66 of rules/10-ensembl-vep.smk

shell:
    "( bcftools norm --atomize -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} ) 2> {log}"

SnakeMake BCFtools From line 34 of rules/11-variant-evaluation.smk

shell:
    "gatk3"
    " -XX:ConcGCThreads=1"
    " -Xmx{resources.mem_mb}m"
    " -Djava.io.tmpdir='{resources.tmpdir}'"
    " -T VariantEval"
    " -R {input.ref}"
    " --eval {input.vcf}"
    " --dbsnp {input.dbsnp}"
    " --comp {input.comp}"
    " --known_names '1000G'"
    " --out {output.evl} 2> {log}"

SnakeMake gatk From line 62 of rules/11-variant-evaluation.smk

shell:
    "wget "
    " --mirror"
    " --quiet"
    " --no-host-directories"
    " --cut-dirs=6"
    " --directory-prefix=data/reference/GRCh38/giab-stratifications/v3.0/"
    " -o /dev/null"
    " https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/genome-stratifications/v3.0/GRCh38/"

SnakeMake From line 88 of rules/11-variant-evaluation.smk

shell:
    "wget "
    " --mirror"
    " --quiet"
    " --no-host-directories"
    " --cut-dirs=6"
    " --directory-prefix=data/evaluation/GIAB/NA12878_HG001/"
    " -o /dev/null"
    " ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv4.2.1/GRCh38/"

SnakeMake From line 109 of rules/11-variant-evaluation.smk

shell:
    "( bcftools view -Oz -o {output.vcf} {input.vcf} {wildcards.chr} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "

SnakeMake BCFtools From line 134 of rules/11-variant-evaluation.smk

shell:
    "( bcftools view --samples 'NA12878' -Oz -o {output.vcf} {input.vcf} && "
    "  bcftools index --tbi {output.vcf} "
    ") 2> {log} "

SnakeMake BCFtools From line 152 of rules/11-variant-evaluation.smk

shell:
    "hap.py"
    " --reference {input.ref}"
    " --report-prefix {params.prefix}"
    " {input.vcf_truth}"
    " {input.vcf_input}"

SnakeMake From line 183 of rules/11-variant-evaluation.smk

shell:
    """grep -P '{params.rgx}' {input.man} | awk '{{ print $3" {params.file}" }}' > {output.md5}"""

SnakeMake From line 94 of data/1000g.smk

shell:
    "wget --quiet -O {output.vcf} -o /dev/null {params.ftp}/{wildcards.sample}.haplotypeCalls.er.raw.{wildcards.ext} && "
    "md5sum --status --check {input.md5}"

SnakeMake From line 112 of data/1000g.smk

shell:
    "awk 'NR>1 && $3!=0 && $4!=0' {input.tsv} | "
    "sed -E 's/SH074|SH089/CHS2/' | "
    "sed -E 's/Y028|Y117/YRI1/' | "
    "sed -E 's/m004|m009/MXL1/' | "
    "sed -E 's/m008|m011/MXL2/' | "
    "sed -E 's/2467|2469/ASW4/' | "
    "grep -v 'HG02567' > {output.ped}"

SnakeMake From line 160 of data/1000g.smk

shell:
    "samtools view"
    " --expr 'seq=~\"^[ACGTN]+$\"'"
    " --cram"
    " --reference {input.ref}"
    " --write-index"
    " --output {output.cram}"
    " {input.cram}"

SnakeMake SAMtools From line 91 of data/ggvp.smk

shell:
    "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && "
    "samtools index {output.cram}"

SnakeMake SAMtools From line 117 of data/ggvp.smk

shell:
    "samtools view"
    " --expr 'seq=~\"^[ACGTN]+$\"'"
    " --cram"
    " --reference {input.ref}"
    " --write-index"
    " --output {output.cram}"
    " {input.cram}"

SnakeMake SAMtools From line 65 of data/hgdp.smk

shell:
    "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && "
    "samtools index {output.cram}"

SnakeMake SAMtools From line 91 of data/hgdp.smk

shell:
    "echo '{wildcards.sample}' > {output.sample} && "
    "bcftools reheader --samples {output.sample} {input.vcf} > {output.vcf} && "
    "bcftools index --tbi {output.vcf}"

SnakeMake BCFtools From line 60 of data/phase10x.smk

shell:
    "samtools reheader --command \"sed 's/SM:[^\\t]*/SM:{wildcards.sample}/g'\" {input.cram} > {output.cram} && "
    "samtools index {output.cram}"