Perform error-prone data preparation for whole genome regression modeling

public public 1yr ago 0 bookmarks

istari 🔬

An awesome whole genome regression modeling pipeline


This is the home of the pipeline, istari. Its long-term goals: to perform error-prone data preparation for whole genome regression modeling using regenie like no pipeline before!

Overview

Welcome to istari! Before getting started, we highly recommend reading through istari's documentation .

The ./istari pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

istari is a comprehensive pipeline that performs error-prone data preparation steps for genome-wide association studies (GWAS) optimized for WES and WGS. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. This pipeline prepares data for GWAS by converting file formats, performing QC filtering, preparing phenotypes, covariates, and files for regenie3 . As input, it accepts a gVCF file (note: file must have corresponding index file), phenotype file, and covariates file. The pipeline validates phenotype and covariate files. If sex and/or ancestry information is not present in the covariates file, the pipeline runs Somalier4 to estimate sex and ancestry and adds this information to the covariates file. The pipeline then uses Ensembl's Variant Effect Predictor5 (VEP) to annotate variants. The pipeline then filters variants using Slivar , which utilizes the Genome Aggregation 190 Database6 (gnomAD) popMax AF. By default, this pipeline include variants with a population allele frequency ≤ 0.01 and ≤ 0.05 (1% or 5% in gnomAD) and predicted to be functionally impactful by Slivar, but it is set up so these thresholds can be adjusted. Then, the pipeline performs pre-association test QC filtering for regenie step 1 based on minor allele frequency (MAF), minor allele count (MAC), linkage disequilibrium (LD), and genotype and sample missingness using GATK and plink7,8 . By default, certain filtering thresholds are set, but these thresholds and pruning settings can be adjusted. Note that the regenie developers do not recommend using >1M SNPs for step 1. The pipeline then creates the annotation file, mask file, and list file, which are used as input for gene-based tests in regenie step 2 gene-based tests. The pipeline can submit jobs to a cluster using a job scheduler like SLURM (more coming soon!). A hybrid approach ensures the pipeline is accessible to all users.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Dependencies

Requires: singularity=3.10.5 snakemake=7.19.1

At the current moment, the pipeline uses a mixture of environment modules and docker images; however, this will be changing soon! In the very near future, the pipeline will only use docker images. With that being said, snakemake and singularity must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee the highest level of reproducibility, each step of the pipeline will rely on versioned images from DockerHub . Snakemake uses singularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity will be the only two dependencies in the future.

Installation

Please clone this repository to your local filesystem using the following command:

# Clone Repository from Github
git clone https://github.com/OpenOmics/istari.git
# Change your working directory
cd istari/
# Add dependencies to $PATH
# Biowulf users should run
module load snakemake singularity
# Get usage information
./istari -h

Contribute

This site is a living document, created for and by members like you. istari is maintained by the members of OpenOmics and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.
3. Mbatchou, J., Barnard, L., Backman, J., Marcketta, A., Kosmicki, J. A., Ziyatdinov, A., et al. (2021). Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53 (7), 1097–1103.
4. Pedersen, B.S., Bhetariya, P.J., Brown, J., Kravitz, S.N., Marth, G., Jensen, R.L., Bronner, M.P., Underhill, H.R., and Quinlan, A.R. (2020). Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches.Genome Med. 12, 62.
5. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. (2016). The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. 6. Karczewski, K.J., Francioli, L.C., Tiao, G. et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443.
7. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly MJ, Sham PC (2007) PLINK: A Tool Set for Whole-Genome and Population-Based Linkage Analyses. American Journal of Human Genetics, 81.
8. Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4.

Code Snippets

22
23
24
25
26
27
28
29
30
shell:
    """
    set +u
    module load GATK
    gatk SelectVariants -V {input.vcf} -O {params.unzipped_vcf} -R {params.ref} --max-nocall-fraction {params.filt_nocall}
    module load bcftools
    bgzip {params.unzipped_vcf}
    tabix -p vcf {output.cleaned_vcf}
    """
72
73
74
75
76
77
78
79
80
81
82
shell:
    """
    ml plink/1
    plink --vcf {input.cleaned_vcf} --vcf-half-call m --set-missing-var-ids @:# --not-chr M --geno {params.geno} --maf {params.maf} --double-id --set-hh-missing --update-sex {params.sex} --make-bed --out {params.bed1}
    ml plink/2
    plink2 --bfile {params.prefix} --indep-pairwise {params.ld_window} {params.ld_step} {params.ld_thresh} --make-bed --out {params.ld_bed}
    plink2 --bfile {params.ld_bed} --exclude {params.ld_bed}.prune.out --mac {params.mac} --make-bed --out {params.bed}
    ml plink/1
    plink --bfile {params.bed} --bmerge {params.bed_01}  --make-bed --out {params.step2_01} --update-sex {params.sex}
    plink --bfile {params.bed} --bmerge {params.bed_05}  --make-bed --out {params.step2_05} --update-sex {params.sex}
    """
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
shell:
    """
    regenie \
    --step 1 \
    --bed {params.step1} \
    --covarFile {params.covariates} \
    --covarColList {params.covarColList} \
    --phenoFile {params.phenoFile} \
    --phenoColList {params.phenoColList} \
    --bsize {params.step1_bsize} \
    --threads {threads} \
    --lowmem \
    --gz \
    --nauto 23 \
    --lowmem-prefix tmp_rg \
    --out {params.step1} \

    regenie \
    --step 2 \
    --bed {params.bed_01} \
    --covarFile {params.covariates} \
    --covarColList {params.covarColList} \
    --phenoFile {params.phenoFile} \
    --phenoColList {params.phenoColList} \
    --write-mask-snplist \
    --pred {params.step1}_pred.list \
    --anno-file {input.annot_01} \
    --mask-def {input.mask_01} \
    --vc-tests skat,skato \
    --debug \
    --nauto 23 \
    --bsize {params.step2_bsize} \
    --build-mask sum \
    --check-burden-files \
    --set-list {input.list_01} \
    --strict-check-burden \
    --aaf-bins 0.01 \
    --minMAC 1 \
    --write-samples \
    --out {params.step201} \

    regenie \
    --step 2 \
    --bed {params.bed_05} \
    --covarFile {params.covariates} \
    --covarColList {params.covarColList} \
    --phenoFile {params.phenoFile} \
    --phenoColList {params.phenoColList} \
    --write-mask-snplist \
    --pred {params.step1}_pred.list \
    --anno-file {input.annot_05} \
    --mask-def {input.mask_05} \
    --vc-tests skat,skato \
    --debug \
    --nauto 23 \
    --bsize {params.step2_bsize} \
    --build-mask sum \
    --check-burden-files \
    --set-list {input.list_05} \
    --strict-check-burden \
    --aaf-bins 0.05 \
    --minMAC 1 \
    --write-samples \
    --out {params.step205} \
    """
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
shell:
    """
    echo "Extracting sites to estimate ancestry."
    {params.exe} extract -d {params.outdir}/ --sites {params.sites} -f {params.ref} {input.vcf}
    echo "Estimating relatedness."
    {params.exe} relate --infer -i -o {params.outdir}/relatedness {output.somalier}
    echo "Estimating ancestry."
    {params.exe} ancestry --n-pcs=10 -o {params.outdir}/ancestry --labels {params.ancestry_data}/ancestry-labels-1kg.tsv {params.ancestry_data}/*.somalier ++ {output.somalier} || {{
# Somalier ancestry error,
# usually due to not finding
# any sites compared to the
# its references, expected
# with sub-sampled datasets
echo "WARNING: Somalier ancestry failed..." 1>&2
touch {output.ancestry}
}}
    """
65
66
67
68
69
70
71
72
73
shell:
    """
    mkdir -p {params.outdir_regenie}
    mkdir -p {params.outdir_QC}
    paste -d '\t' <(cut -f 1,2 {input.covariates}) <(cut -f 5 {input.sex}) > {output.sex_file}
    head -n 1 {input.ancestry} > {output.subset}
    awk '/P00/' {input.ancestry} >> {output.subset}
    paste -d '\t' <(cut -f 1,2 {input.covariates}) <(cut -f 5 {input.sex}) <(cut -f 9- {output.subset}) > {output.reg_covariates}
    """
19
20
21
22
23
24
25
26
27
28
29
shell:
    """
    module load bcftools
    bcftools view --max-alleles 2 {input.vcf} -O z -o {output.sub}
    mkdir -p {params.outdir}
    set +u
    module load VEP
    vep -i {output.sub} -o {output.vep_vcf} --force_overwrite --fork 12 --fasta {params.ref} --species human --assembly {params.vep_assembly} --cache --dir_cache $VEP_CACHEDIR --offline --format vcf --compress_output bgzip --everything --pick --vcf
    module load bcftools
    tabix -p vcf {output.vep_vcf}
    """
56
57
58
59
60
61
62
63
64
65
66
67
68
shell:
    """
    mkdir -p {params.outdir}
    set +u
    SLIVAR_IMPACTFUL_ORDER={params.slivar_order}
    {params.slivar_tool} expr --js {params.slivar_js} -g {params.gnomad_db} --vcf {input.vep_vcf} --info "INFO.gnomad_popmax_af < {params.gnomad_af} && INFO.impactful" --pass-only > {output.unzipped_vcf}
    module load bcftools
    bcftools +split-vep {output.unzipped_vcf} -f '%CHROM:%POS:%REF:%ALT\t%SYMBOL\t%Consequence\n' > {output.slivar_annot}
    module load plink/2
    plink2 --make-bed --max-alleles 2 --double-id --out {params.bed} --set-missing-var-ids @:# --vcf {output.unzipped_vcf} --vcf-half-call m
    bgzip -c {output.unzipped_vcf} > {output.slivar_vcf}
    tabix -p vcf {output.slivar_vcf}
    """
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
shell:
    """
    mkdir -p {params.outdir}
    set +u
    SLIVAR_IMPACTFUL_ORDER={params.slivar_order}
    {params.slivar_tool} expr --js {params.slivar_js} -g {params.gnomad_db} --vcf {input.vep_vcf} --info "INFO.gnomad_popmax_af < {params.gnomad_af} && INFO.impactful" --pass-only > {output.unzipped_vcf}
    module load bcftools
    bcftools +split-vep {output.unzipped_vcf} -f '%CHROM:%POS:%REF:%ALT\t%SYMBOL\t%Consequence\n' > {output.slivar_annot}
    module load plink/2
    plink2 --make-bed --max-alleles 2 --double-id --out {params.bed} --set-missing-var-ids @:# --vcf {output.unzipped_vcf} --vcf-half-call m
    bgzip -c {output.unzipped_vcf} > {output.slivar_vcf}
    tabix -p vcf {output.slivar_vcf}
    """
128
129
130
131
132
shell:
    """
    module load R
    Rscript {params.script} {params.wdir} {input.in1} {output.annot_out1} {output.list_out1} {output.mask_out1} {input.in2} {output.annot_out2} {output.list_out2} {output.mask_out2}
    """
ShowHide 5 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://openomics.github.io/istari/
Name: istari
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...