Detection of Dominant Genetic Associations with Gene Expression Levels

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This analysis pipeline can be used to detect loci having non-additive genetic assoications with gene expression levels.

Theory

We applied a multiple linear regression model to identify dominant associaions between SNPs and gene expression levels.

$G_A$ stands for the number of non-reference alleles. $G_A$ = 0 if genotype is reference homozygous; $G_A$ = 1 of genotype is heterozygous; $G_A$ = 2 if genotype is non-reference homozygous. This is the variable that captures the additive effects, which is commonly used in the analysis of gene expression quantitative loci(eQTLs).

$G_D$ captures dominant effects, where $G_D$ = 1 if genotype is heterozygous and $G_D$ = 0 if genotype is either reference or non-reference homozygous.

E denotes the observed gene expression levels. Our model assumes that the noise across samples is normally distributed with 0 mean and some variance value.

Our null hypothesis is that there is no evidence for a dominant association, which is equivalent to β2 not significantly different from zero. An ANOVA model can be used instead of a multiple linear regression to detect both additive and dominant effects of the genotype; however the F-test used in the ANOVA model aims to test if at least one beta is significantly different from zero. This does not fulfill our objective, as we specifically wish to test whether there is evidence showing that β2 is not equal to zero. Alternatively, a t-test can be used to test the effect sizes of Ga and Gd separately.

Here is an illustration of what dominant eQTL looks like.

Workflow

Getting Started

Dependencies

snakemake 3.13.3
python2.7
numpy
R

Snakemake Workflow

Input files: Genotype file (genotype.txt) and RNA-Seq counts matrix file (gene_expr.txt) downloaded from GTEx project

Run PCA analysis on the genotype matrix

Rscript --vanilla pca_genotype.R genotype genotype.txt genotype_pca.txt

Regress out sex, age, race and hidden covariates from gene expression matrix using PEER package

Rscript --vanilla peer_factor.R gene_expr.txt genotype_pca.txt phenotype_info.txt gene_expr_PEER.txt

Split jobs for parallele computation:1000 genes per job

python split_into_files.py gene_expr_PEER.txt {files}_peer_expr.txt

Merge SNPs that are within +/- 100kb of gene body with gene expression matrixes by column into a large matrix

python snps_counts_comb_by_chr_no_filter.py snp.h5 index.h5 genotype_pca.txt {files}_peer_expr.txt {files}_snps_counts_comb.txt

Adjust p-values using beta approximation and further speed up using matrix multiplication

Rscript --vanilla beta_adjust_P_matrix_multip_corrected.R {files}_snps_counts_comb.txt {files}_output.txt

Merge output files

cat {files}_output.txt > all_chr_output.txt

extract out snp-gene pair that shows dominant effects

Rscript --vanilla extract_snp_gene_pair_comb_matrix.R {files}_output.txt {files}_snps_counts_comb.txt dominant_snp_gene_pair.txt

Authors

Jing Gu

Acknowledgements

Dr. Graham McVicker Salk Institute for Biological Studies
Patrick Fiaux UC San Diego
Arko Sen Salk Institute for Biological Studies
Hsiuyi Chen Salk Institute for Biological Studies
Selene Tyndale Salk Institute for Biological Studies

Code Snippets

shell:
    "mkdir -p {config[output_dir]}/{wildcards.sample}/filtered_SNPs ;"     
    "{config[py2]} {config[script_dir]}/get_Overlapped_SNPs_GTEx.py "                                                
    "{wildcards.chrom} {input.snp_tab} {input.snp_hapl} {input.genotypes} {input.exp_matrix} {output};"

SnakeMake From line 75 of master/Snakefile

shell:
    "{config[py2]} {config[script_dir]}/merge_genotype_files.py "                                                
    "{output} {input};"

SnakeMake From line 89 of master/Snakefile

shell:
    "mkdir -p {config[output_dir]}/{wildcards.sample}/pca ;"     
    "Rscript --vanilla {config[script_dir]}/pca_genotypes.R {input} {output};"

SnakeMake From line 100 of master/Snakefile

shell:
    "mkdir -p {config[output_dir]}/{wildcards.sample}/PEER ;"     
    "Rscript --vanilla {config[script_dir]}/peer_factor.R {input.exp_matrix} {input.pca_matrix} {input.env_matrix} {output};"

SnakeMake From line 114 of master/Snakefile

shell:
    "mkdir -p {config[output_dir]}/{wildcards.sample}/PEER ;"     
    "{config[py2]} {config[script_dir]}/split_into_files.py "
    "{input.expr_matrix} {input.job_nums} {output};"

SnakeMake From line 128 of master/Snakefile

shell:        
    "mkdir -p {config[output_dir]}/{wildcards.sample}/snps_counts_comb ;"                                                        
    "{config[py2]} {config[script_dir]}/snps_counts_comb_by_chr_no_filter.py "                                                
    "{input.snp_tab} {input.snp_index} {input.filtered_SNPs} {input.counts_matrix} {output};"

SnakeMake From line 144 of master/Snakefile

shell:
     "echo HOSTNAME=$HOSTNAME >&2 ; "
     "mkdir -p {config[output_dir]}/{wildcards.sample}/output ;"                                                                                          
     "Rscript --vanilla {config[script_dir]}/beta_adjust_P_matrix_multip_corrected.R {input} {output};"

SnakeMake From line 157 of master/Snakefile

shell:
    "cat {input} > {output} ;"

SnakeMake From line 170 of master/Snakefile

shell:
    "Rscript --vanilla {config[script_dir]}/prepare_GO_analysis.R {input.merged_output} {input.gene_ref} {output};"

SnakeMake From line 186 of master/Snakefile

shell:
    "{config[py2]} {config[script_dir]}/GO_analysis/go_cat_fisher_test.py -n all {input.fg_files} {input.bg_files} > {output} ;" 

SnakeMake From line 199 of master/Snakefile

shell:
     "Rscript --vanilla {config[script_dir]}/extract_snp_gene_pair_comb_matrix.R {input.summary} {input.f} {wildcards.sample} {output};"

SnakeMake From line 212 of master/Snakefile

ShowHide 11 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/j3gu/Dominant-eQTLs

Name: dominant-eqtls

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Snakemake Gene expression

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free