Snakemake Workflow for SNP Imputation and Annotation

public public 1yr ago 0 bookmarks

A Snakemake workflow to prepare, annotate and impute SNPs files from 23andMe genotyping platform.

Main features

  • Imputation

  • Functional annotation

  • Correlation with somatic/disease databases

Imputation

The genotyping information within the file usually partially covers the genome and is therefore incomplete. In fact, many genetic markers are not included as probes into genotyping arrays and eventually less than 1M position in the genome are covered.
Due to the success of genomic population studies across several hundreds of human genomes, we know how much genetic information tends to be shared within the same population. We can use that information up to certain degree of confidence to amplify the genetic markers from a dataset. This process, called imputation, can use information from projects such as HapMap or 1000Genomes to make predictions on genotypes not included in genotyping arrays.

More details about imputation step can be found in the wiki .

SNP effects

Genetic variations in human genomes, especially the very common single nucleotide polymorphisms (SNPs), are correlated with many diseases, associated with individuality and relevant to many other fields, such as nutrition. In order to be able to accurately identify strong associations between genetic markers and its phenotypes, Genome Wide Association Studies are conducted. Those, however, do not always conclude reliably the strength of the effect, and sometimes the confidence is not high enough. Reference catalogs such as dbSNP, Clinvar and Exac among others, try to annotate all information relevant to the variants and its associated phenotype. Typically all those variants tend to be related to certain publications where some GWAS and other analysis have been carried out to corroborate associations.

  • Input file is a tab separated file containing genotyping information of an individual:
# This data file generated by 23andMe
#
# Below is a text version of your data. Fields are TAB-separated.
# Each line corresponds to a single SNP. For each SNP, we provide its
# identifier, its location on a reference human genome, and the genotype call.
# Human genome reference used: GRCh37/Mito:rCRS
#
# rsid	chromosome	position	genotype
rs3094315	1	752566	AA
rs3131972	1	752721	GG
rs75333668	1	762320	CC
rs11240777	1	798959	GG
rs4970383	1	838555	CC
...

This file contains information about the genotype at 562,526 positions in an individual’s genome. This kind of files are for example produced by the 23andMe genotyping analysis including the accession number of SNPs, its location information (chromosome, position), and the corresponding genotype.

Using the Snakemake workflow

We assume that you already have conda and Snakemake installed, otherwise you can easily install them with the following commands:

To install conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

To install Snakemake via conda: conda install -c conda-forge -c bioconda snakemake snakemake-wrapper-utils mamba

To use this tool, you will need to do the following steps:
1. git clone https://github.com/mdelcorvo/Genetic_annotation_challenge.git
2. Download your raw data from the 23andme site and put it in data directory
3. Download the 1000 Genomes reference data, which can be found on the impute2 website here:
https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html
4.Extract this data by running:
gunzip ALL_1000G_phase1integrated_v3_impute.tgz
tar xf ALL_1000G_phase1integrated_v3_impute.tar
and put them in resources\ReferencePanel directory (or change the default directory in config file)
5. Download Clinvar, Cosmic and GWAS database from the following link:
Cosmic: "http://ftp.ensembl.org/pub/grch37/current/variation/vcf/homo_sapiens/homo_sapiens_somatic.vcf.gz"
ClinVar: "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz"
GWAS: "https://www.ebi.ac.uk/gwas/api/search/downloads/full"
and put them in resources\database directory (or change the default directory in config file)
6.Edit config file by setting correct paths
7. cd Genetic_annotation_challenge && snakemake --use-conda -n 4

Code Snippets

10
11
12
shell:
    "Rscript --vanilla {input.script} {input.original} {input.database} {output.original}; "
    "Rscript --vanilla {input.script} {input.imputed} {input.database} {output.imputed}; "
23
24
25
shell:
    "Rscript --vanilla {input.script} {input.original} {input.database} {output.original}; "
    "Rscript --vanilla {input.script} {input.imputed} {input.database} {output.imputed}; "
36
37
38
shell:
    "Rscript --vanilla {input.script} {input.original} {input.database} {output.original}; "
    "Rscript --vanilla {input.script} {input.imputed} {input.database} {output.imputed}; "
12
13
14
15
16
    shell:
        "plink --23file {input.raw_data} --list-duplicate-vars;"
	"plink --23file {input.raw_data} {params.name} {params.name} --make-bed --out results/{params.name} --snps-only just-acgt --exclude plink.dupvar;"
	"Rscript --vanilla {input.script} results/{params.name};"
	"plink --bfile results/{params.name} --recode vcf --out results/{params.name}.original;"
29
30
    shell:
	    "plink --bfile results/{params.name} --out results/{params.name}.{params.chr} --snps-only just-acgt --chr {params.chr} --export oxford; "
43
44
    shell:
	    "plink --bfile results/{params.name} --out results/{params.name}.23 --snps-only just-acgt --chr 23 --export oxford; "
59
60
shell:
    "impute2 -m {input.map} -h {input.hap} -l {input.legend} -g {input.gen} -int 20.4e6 20.5e6 -Ne 20000 -k 100 -iter 100 -o {output.imputed};"
74
75
shell:
    "impute2 -chrX -m {input.mapX} -h {input.hapX} -l {input.legendX} -g {input.genX} -int 20.4e6 20.5e6 -Ne 20000 -o {output.imputedx};"         
86
87
88
shell:
    "awk '{{$1 = {params.chr}; print}}' {input.imputed} > {output.imputed_fixed};"
    "plink --gen results/{params.name}.{params.chr}.chrfix.impute2 --sample results/{params.name}.{params.chr}.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.{params.chr}; "
101
102
103
shell:
    "awk '{{$1 = 'X'; print}}' {input.imputed} > {output.imputed_fixed};"
    "plink --gen results/{params.name}.X.chrfix.impute2 --sample results/{params.name}.X.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.X; "
116
117
shell:
    "plink --gen {input.gen} --sample results/{params.name}.{params.chr}.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.{params.chr}; "
130
131
shell:
    "plink --gen {input.genx} --sample results/{params.name}.X.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.X; "
143
144
145
shell:
    "ls {input.vcf} > {output.list};"
    "bcftools concat --file-list {output.list} -o {output.vcf};"    
ShowHide 11 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/mdelcorvo/impute23
Name: impute23
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...