Snakemake Workflow for SNP Imputation and Annotation

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A Snakemake workflow to prepare, annotate and impute SNPs files from 23andMe genotyping platform.

Main features

Imputation
Functional annotation
Correlation with somatic/disease databases

Imputation

The genotyping information within the file usually partially covers the genome and is therefore incomplete. In fact, many genetic markers are not included as probes into genotyping arrays and eventually less than 1M position in the genome are covered.
Due to the success of genomic population studies across several hundreds of human genomes, we know how much genetic information tends to be shared within the same population. We can use that information up to certain degree of confidence to amplify the genetic markers from a dataset. This process, called imputation, can use information from projects such as HapMap or 1000Genomes to make predictions on genotypes not included in genotyping arrays.

More details about imputation step can be found in the wiki .

SNP effects

Genetic variations in human genomes, especially the very common single nucleotide polymorphisms (SNPs), are correlated with many diseases, associated with individuality and relevant to many other fields, such as nutrition. In order to be able to accurately identify strong associations between genetic markers and its phenotypes, Genome Wide Association Studies are conducted. Those, however, do not always conclude reliably the strength of the effect, and sometimes the confidence is not high enough. Reference catalogs such as dbSNP, Clinvar and Exac among others, try to annotate all information relevant to the variants and its associated phenotype. Typically all those variants tend to be related to certain publications where some GWAS and other analysis have been carried out to corroborate associations.

Input file is a tab separated file containing genotyping information of an individual:

# This data file generated by 23andMe
#
# Below is a text version of your data. Fields are TAB-separated.
# Each line corresponds to a single SNP. For each SNP, we provide its
# identifier, its location on a reference human genome, and the genotype call.
# Human genome reference used: GRCh37/Mito:rCRS
#
# rsid	chromosome	position	genotype
rs3094315	1	752566	AA
rs3131972	1	752721	GG
rs75333668	1	762320	CC
rs11240777	1	798959	GG
rs4970383	1	838555	CC
...

This file contains information about the genotype at 562,526 positions in an individual’s genome. This kind of files are for example produced by the 23andMe genotyping analysis including the accession number of SNPs, its location information (chromosome, position), and the corresponding genotype.

Using the Snakemake workflow

We assume that you already have conda and Snakemake installed, otherwise you can easily install them with the following commands:

To install conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

To install Snakemake via conda: conda install -c conda-forge -c bioconda snakemake snakemake-wrapper-utils mamba

To use this tool, you will need to do the following steps:
1. git clone https://github.com/mdelcorvo/Genetic_annotation_challenge.git
2. Download your ‘raw data’ from the 23andme site and put it in data directory
3. Download the 1000 Genomes reference data, which can be found on the impute2 website here:
https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html
4.Extract this data by running:
gunzip ALL_1000G_phase1integrated_v3_impute.tgz
tar xf ALL_1000G_phase1integrated_v3_impute.tar
and put them in resources\ReferencePanel directory (or change the default directory in config file)
5. Download Clinvar, Cosmic and GWAS database from the following link:
Cosmic: "http://ftp.ensembl.org/pub/grch37/current/variation/vcf/homo_sapiens/homo_sapiens_somatic.vcf.gz"
ClinVar: "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz"
GWAS: "https://www.ebi.ac.uk/gwas/api/search/downloads/full"
and put them in resources\database directory (or change the default directory in config file)
6.Edit config file by setting correct paths
7. cd Genetic_annotation_challenge && snakemake --use-conda -n 4

Code Snippets

shell:
    "Rscript --vanilla {input.script} {input.original} {input.database} {output.original}; "
    "Rscript --vanilla {input.script} {input.imputed} {input.database} {output.imputed}; "

SnakeMake From line 10 of rules/annotation.smk

shell:
    "Rscript --vanilla {input.script} {input.original} {input.database} {output.original}; "
    "Rscript --vanilla {input.script} {input.imputed} {input.database} {output.imputed}; "

SnakeMake From line 23 of rules/annotation.smk

shell:
    "Rscript --vanilla {input.script} {input.original} {input.database} {output.original}; "
    "Rscript --vanilla {input.script} {input.imputed} {input.database} {output.imputed}; "

SnakeMake From line 36 of rules/annotation.smk

    shell:
        "plink --23file {input.raw_data} --list-duplicate-vars;"
	"plink --23file {input.raw_data} {params.name} {params.name} --make-bed --out results/{params.name} --snps-only just-acgt --exclude plink.dupvar;"
	"Rscript --vanilla {input.script} results/{params.name};"
	"plink --bfile results/{params.name} --recode vcf --out results/{params.name}.original;"

SnakeMake pLink From line 12 of rules/impute.smk

    shell:
	    "plink --bfile results/{params.name} --out results/{params.name}.{params.chr} --snps-only just-acgt --chr {params.chr} --export oxford; "

SnakeMake pLink From line 29 of rules/impute.smk

    shell:
	    "plink --bfile results/{params.name} --out results/{params.name}.23 --snps-only just-acgt --chr 23 --export oxford; "

SnakeMake pLink From line 43 of rules/impute.smk

shell:
    "impute2 -m {input.map} -h {input.hap} -l {input.legend} -g {input.gen} -int 20.4e6 20.5e6 -Ne 20000 -k 100 -iter 100 -o {output.imputed};"

SnakeMake From line 59 of rules/impute.smk

shell:
    "impute2 -chrX -m {input.mapX} -h {input.hapX} -l {input.legendX} -g {input.genX} -int 20.4e6 20.5e6 -Ne 20000 -o {output.imputedx};"         

SnakeMake From line 74 of rules/impute.smk

shell:
    "awk '{{$1 = {params.chr}; print}}' {input.imputed} > {output.imputed_fixed};"
    "plink --gen results/{params.name}.{params.chr}.chrfix.impute2 --sample results/{params.name}.{params.chr}.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.{params.chr}; "

SnakeMake pLink From line 86 of rules/impute.smk

shell:
    "awk '{{$1 = 'X'; print}}' {input.imputed} > {output.imputed_fixed};"
    "plink --gen results/{params.name}.X.chrfix.impute2 --sample results/{params.name}.X.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.X; "

SnakeMake pLink From line 101 of rules/impute.smk

shell:
    "plink --gen {input.gen} --sample results/{params.name}.{params.chr}.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.{params.chr}; "

SnakeMake pLink From line 116 of rules/impute.smk

shell:
    "plink --gen {input.genx} --sample results/{params.name}.X.sample --hard-call-threshold 0.49 --keep-allele-order --recode vcf --out results/{params.name}.X; "

SnakeMake pLink From line 130 of rules/impute.smk

shell:
    "ls {input.vcf} > {output.list};"
    "bcftools concat --file-list {output.list} -o {output.vcf};"    

SnakeMake BCFtools From line 143 of rules/impute.smk

ShowHide 11 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/mdelcorvo/impute23

Name: impute23

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

BCFtools pLink Snakemake Sequence analysis

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free