Perform error-prone data preparation for whole genome regression modeling

public 1yr ago 0 bookmarks

View Workflow

istari 🔬

An awesome whole genome regression modeling pipeline

This is the home of the pipeline, istari. Its long-term goals: to perform error-prone data preparation for whole genome regression modeling using regenie like no pipeline before!

Overview

Welcome to istari! Before getting started, we highly recommend reading through istari's documentation .

The ./istari pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

istari run : Run the istari pipeline with your input files.
istari unlock : Unlocks a previous runs output directory.
istari cache : Cache remote resources locally, coming soon!

istari is a comprehensive pipeline that performs error-prone data preparation steps for genome-wide association studies (GWAS) optimized for WES and WGS. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. This pipeline prepares data for GWAS by converting file formats, performing QC filtering, preparing phenotypes, covariates, and files for regenie3 . As input, it accepts a gVCF file (note: file must have corresponding index file), phenotype file, and covariates file. The pipeline validates phenotype and covariate files. If sex and/or ancestry information is not present in the covariates file, the pipeline runs Somalier4 to estimate sex and ancestry and adds this information to the covariates file. The pipeline then uses Ensembl's Variant Effect Predictor5 (VEP) to annotate variants. The pipeline then filters variants using Slivar , which utilizes the Genome Aggregation 190 Database6 (gnomAD) popMax AF. By default, this pipeline include variants with a population allele frequency ≤ 0.01 and ≤ 0.05 (1% or 5% in gnomAD) and predicted to be functionally impactful by Slivar, but it is set up so these thresholds can be adjusted. Then, the pipeline performs pre-association test QC filtering for regenie step 1 based on minor allele frequency (MAF), minor allele count (MAC), linkage disequilibrium (LD), and genotype and sample missingness using GATK and plink7,8 . By default, certain filtering thresholds are set, but these thresholds and pruning settings can be adjusted. Note that the regenie developers do not recommend using >1M SNPs for step 1. The pipeline then creates the annotation file, mask file, and list file, which are used as input for gene-based tests in regenie step 2 gene-based tests. The pipeline can submit jobs to a cluster using a job scheduler like SLURM (more coming soon!). A hybrid approach ensures the pipeline is accessible to all users.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Dependencies

Requires: singularity=3.10.5 snakemake=7.19.1

At the current moment, the pipeline uses a mixture of environment modules and docker images; however, this will be changing soon! In the very near future, the pipeline will only use docker images. With that being said, snakemake and singularity must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee the highest level of reproducibility, each step of the pipeline will rely on versioned images from DockerHub . Snakemake uses singularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity will be the only two dependencies in the future.

Installation

Please clone this repository to your local filesystem using the following command:

# Clone Repository from Github
git clone https://github.com/OpenOmics/istari.git
# Change your working directory
cd istari/
# Add dependencies to $PATH
# Biowulf users should run
module load snakemake singularity
# Get usage information
./istari -h

Contribute

This site is a living document, created for and by members like you. istari is maintained by the members of OpenOmics and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.
3. Mbatchou, J., Barnard, L., Backman, J., Marcketta, A., Kosmicki, J. A., Ziyatdinov, A., et al. (2021). Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53 (7), 1097–1103.
4. Pedersen, B.S., Bhetariya, P.J., Brown, J., Kravitz, S.N., Marth, G., Jensen, R.L., Bronner, M.P., Underhill, H.R., and Quinlan, A.R. (2020). Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches.Genome Med. 12, 62.
5. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. (2016). The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. 6. Karczewski, K.J., Francioli, L.C., Tiao, G. et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443.
7. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly MJ, Sham PC (2007) PLINK: A Tool Set for Whole-Genome and Population-Based Linkage Analyses. American Journal of Human Genetics, 81.
8. Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4.

Code Snippets

shell:
    """
    set +u
    module load GATK
    gatk SelectVariants -V {input.vcf} -O {params.unzipped_vcf} -R {params.ref} --max-nocall-fraction {params.filt_nocall}
    module load bcftools
    bgzip {params.unzipped_vcf}
    tabix -p vcf {output.cleaned_vcf}
    """

SnakeMake gatk BCFtools tabix From line 22 of rules/filtering.smk

shell:
    """
    ml plink/1
    plink --vcf {input.cleaned_vcf} --vcf-half-call m --set-missing-var-ids @:# --not-chr M --geno {params.geno} --maf {params.maf} --double-id --set-hh-missing --update-sex {params.sex} --make-bed --out {params.bed1}
    ml plink/2
    plink2 --bfile {params.prefix} --indep-pairwise {params.ld_window} {params.ld_step} {params.ld_thresh} --make-bed --out {params.ld_bed}
    plink2 --bfile {params.ld_bed} --exclude {params.ld_bed}.prune.out --mac {params.mac} --make-bed --out {params.bed}
    ml plink/1
    plink --bfile {params.bed} --bmerge {params.bed_01}  --make-bed --out {params.step2_01} --update-sex {params.sex}
    plink --bfile {params.bed} --bmerge {params.bed_05}  --make-bed --out {params.step2_05} --update-sex {params.sex}
    """

SnakeMake pLink plink2 From line 72 of rules/filtering.smk

shell:
    """
    regenie \
    --step 1 \
    --bed {params.step1} \
    --covarFile {params.covariates} \
    --covarColList {params.covarColList} \
    --phenoFile {params.phenoFile} \
    --phenoColList {params.phenoColList} \
    --bsize {params.step1_bsize} \
    --threads {threads} \
    --lowmem \
    --gz \
    --nauto 23 \
    --lowmem-prefix tmp_rg \
    --out {params.step1} \

    regenie \
    --step 2 \
    --bed {params.bed_01} \
    --covarFile {params.covariates} \
    --covarColList {params.covarColList} \
    --phenoFile {params.phenoFile} \
    --phenoColList {params.phenoColList} \
    --write-mask-snplist \
    --pred {params.step1}_pred.list \
    --anno-file {input.annot_01} \
    --mask-def {input.mask_01} \
    --vc-tests skat,skato \
    --debug \
    --nauto 23 \
    --bsize {params.step2_bsize} \
    --build-mask sum \
    --check-burden-files \
    --set-list {input.list_01} \
    --strict-check-burden \
    --aaf-bins 0.01 \
    --minMAC 1 \
    --write-samples \
    --out {params.step201} \

    regenie \
    --step 2 \
    --bed {params.bed_05} \
    --covarFile {params.covariates} \
    --covarColList {params.covarColList} \
    --phenoFile {params.phenoFile} \
    --phenoColList {params.phenoColList} \
    --write-mask-snplist \
    --pred {params.step1}_pred.list \
    --anno-file {input.annot_05} \
    --mask-def {input.mask_05} \
    --vc-tests skat,skato \
    --debug \
    --nauto 23 \
    --bsize {params.step2_bsize} \
    --build-mask sum \
    --check-burden-files \
    --set-list {input.list_05} \
    --strict-check-burden \
    --aaf-bins 0.05 \
    --minMAC 1 \
    --write-samples \
    --out {params.step205} \
    """

SnakeMake regenie From line 34 of rules/regenie.smk

shell:
    """
    echo "Extracting sites to estimate ancestry."
    {params.exe} extract -d {params.outdir}/ --sites {params.sites} -f {params.ref} {input.vcf}
    echo "Estimating relatedness."
    {params.exe} relate --infer -i -o {params.outdir}/relatedness {output.somalier}
    echo "Estimating ancestry."
    {params.exe} ancestry --n-pcs=10 -o {params.outdir}/ancestry --labels {params.ancestry_data}/ancestry-labels-1kg.tsv {params.ancestry_data}/*.somalier ++ {output.somalier} || {{
# Somalier ancestry error,
# usually due to not finding
# any sites compared to the
# its references, expected
# with sub-sampled datasets
echo "WARNING: Somalier ancestry failed..." 1>&2
touch {output.ancestry}
}}
    """

SnakeMake From line 27 of rules/somalier.smk

shell:
    """
    mkdir -p {params.outdir_regenie}
    mkdir -p {params.outdir_QC}
    paste -d '\t' <(cut -f 1,2 {input.covariates}) <(cut -f 5 {input.sex}) > {output.sex_file}
    head -n 1 {input.ancestry} > {output.subset}
    awk '/P00/' {input.ancestry} >> {output.subset}
    paste -d '\t' <(cut -f 1,2 {input.covariates}) <(cut -f 5 {input.sex}) <(cut -f 9- {output.subset}) > {output.reg_covariates}
    """

SnakeMake From line 65 of rules/somalier.smk

shell:
    """
    module load bcftools
    bcftools view --max-alleles 2 {input.vcf} -O z -o {output.sub}
    mkdir -p {params.outdir}
    set +u
    module load VEP
    vep -i {output.sub} -o {output.vep_vcf} --force_overwrite --fork 12 --fasta {params.ref} --species human --assembly {params.vep_assembly} --cache --dir_cache $VEP_CACHEDIR --offline --format vcf --compress_output bgzip --everything --pick --vcf
    module load bcftools
    tabix -p vcf {output.vep_vcf}
    """

SnakeMake BCFtools tabix Variant Effect Predictor (VEP) From line 19 of rules/vep_slivar.smk

shell:
    """
    mkdir -p {params.outdir}
    set +u
    SLIVAR_IMPACTFUL_ORDER={params.slivar_order}
    {params.slivar_tool} expr --js {params.slivar_js} -g {params.gnomad_db} --vcf {input.vep_vcf} --info "INFO.gnomad_popmax_af < {params.gnomad_af} && INFO.impactful" --pass-only > {output.unzipped_vcf}
    module load bcftools
    bcftools +split-vep {output.unzipped_vcf} -f '%CHROM:%POS:%REF:%ALT\t%SYMBOL\t%Consequence\n' > {output.slivar_annot}
    module load plink/2
    plink2 --make-bed --max-alleles 2 --double-id --out {params.bed} --set-missing-var-ids @:# --vcf {output.unzipped_vcf} --vcf-half-call m
    bgzip -c {output.unzipped_vcf} > {output.slivar_vcf}
    tabix -p vcf {output.slivar_vcf}
    """

SnakeMake BCFtools tabix plink2 From line 56 of rules/vep_slivar.smk

shell:
    """
    mkdir -p {params.outdir}
    set +u
    SLIVAR_IMPACTFUL_ORDER={params.slivar_order}
    {params.slivar_tool} expr --js {params.slivar_js} -g {params.gnomad_db} --vcf {input.vep_vcf} --info "INFO.gnomad_popmax_af < {params.gnomad_af} && INFO.impactful" --pass-only > {output.unzipped_vcf}
    module load bcftools
    bcftools +split-vep {output.unzipped_vcf} -f '%CHROM:%POS:%REF:%ALT\t%SYMBOL\t%Consequence\n' > {output.slivar_annot}
    module load plink/2
    plink2 --make-bed --max-alleles 2 --double-id --out {params.bed} --set-missing-var-ids @:# --vcf {output.unzipped_vcf} --vcf-half-call m
    bgzip -c {output.unzipped_vcf} > {output.slivar_vcf}
    tabix -p vcf {output.slivar_vcf}
    """

SnakeMake BCFtools tabix plink2 From line 95 of rules/vep_slivar.smk

shell:
    """
    module load R
    Rscript {params.script} {params.wdir} {input.in1} {output.annot_out1} {output.list_out1} {output.mask_out1} {input.in2} {output.annot_out2} {output.list_out2} {output.mask_out2}
    """