Genomic prediction of amino acid traits in Arabidopsis seeds

public public 1yr ago Version: v1.0 0 bookmarks

Genomic prediction for free amino acid traits in Arabidopsis seeds

Data and scripts necessary to run genomic partitioning and prediciton models on free amino acid traits measured in a diverse panel of 313 Arabidopsis lines.

Software requirements

  • R v3.6.0

  • PLINK v1.90b4 64-bit

  • Miniconda3 (includes conda v4.6.7)

  • Snakemake v5.4.2 (install to virtual environment)

To install snakemake in a virtual environment, run:

conda env create --name multiblup --file environment.yaml

For future use, activate this environment with: source activate multiblup

Data

Edit the config.yaml file to specify the paths for the genotype and phenotype data.

Genotype data

Before running Snakemake, download the Arabidopsis Regional Mapping (RegMap) data ( Horton et al. 2012 ):
cd data/external
wget https://github.com/Gregor-Mendel-Institute/atpolydb/blob/master/250k_snp_data/call_method_75.tar.gz
tar -xvf call_method_75.tar.gz

Phenotype and covariate data

  • data/raw/aa360_raw_ratios.csv
    Raw measurements (nmol/mg seed) of 65 free amino acid traits measured in 313 accessions of Arabidopsis thaliana as reported by Angelovici et al. 2013

  • data/processed/aa360_BLUEs.txt Environment adjusted best linear unbiased estimates (BLUEs) for 65 free amino acid traits. Calculated using the HAPPI-GWAS pipeline from Slaten et al. 2020 . Check out notebooks/01-calculate_BLUEs.Rmd for details.

  • data/processed/aa360_covars.txt
    Principal components from genotype data to model population structure. Environment adjusted best linear unbiased estimates (BLUEs) for 65 free amino acid traits. Calculated using the HAPPI-GWAS pipeline from Slaten et al. 2020 . Check out notebooks/01-calculate_BLUEs.Rmd for details.

  • data/processed/aa360_covars.txt
    Principal components from genotype data to model population structure.

Snakemake

The snakemake workflow includes a Snakefile (specifies steps of the workflow and variables) and rule files in rules/ to run specific commands.

Snakefile

This file specifies variable names for use in a snakemake workflow. There are notes on how to include specific pathways and MapMan bincodes. Other variables include trait names, the number of random SNP sets, and the number of cross validations to perform.

The rule all: section is a pseudorule that tracks the expected output of each command in the rule files.

To run the workflow, edit cluster configuration settings in submit.json then run submit.sh

Rules

rules/common.smk - specifies location of config.yaml file

rules/prep_data.smk

  • filter and convert genotype data to PED format

  • exports TAIR 10 ensembl gene ids for SNP data

  • calculate SNP weightings (these aren't actually used, could skip)

  • run PCA and exports covariate file (including two PCs here - recommend adjusting depending on data)

rules/cross_validation.smk

  • create training and testing sets for cross validation

  • Note: may need to run this separately on command line, repeating the command via loops/snakemake sometimes does not generate unique folds

rules/gblup.smk

  • export kinship matrix for all SNPs

  • estimate variances and heritability

  • genomic prediction and cross-validation (REML + calculating BLUPs)

  • summarize GBLUP output ( reports/gblup.Rdata )

rules/multiblup.smk

  • filter and export list of pathway SNPs (includes a 2.5 kb buffer before and after each pathway gene)

  • calculate kinship matrices for each pathway (list1) and all remaining SNPs (list2)

  • estimate variances and heritability for each SNP partition

  • genomic prediction and cross-validation with multiple kernels/random effects (REML + calculating BLUPs)

  • summarize MultiBLUP output ( reports/multiblup.RData )

rules/null_distribution.smk

  • first option: generate 5000 random gene groups with a uniform distribution of SNPs. This is useful if you want to examine influences of partition size on heritability explained/model fit or if you are looking to compare a lot of different partitions with varying size to an empirical distribution (output reports/lr_null_results.csv )

  • second option: generate 1000 random gene groups for each pathway (excludes pathway SNPs and samples a similar number of SNPs/genes)

  • calculate kinship matrices for each random group

  • estimate variances and heritability for each random SNP set

  • summarize results across all 1000 gene groups ( reports/null_dist_results_pathways/{pathway}_null.csv )

Notebooks

R notebooks to summarize results.

01-calculate_BLUEs.Rmd

Using raw FAA trait data, calculate BLUEs for each trait using the HAPPI-GWAS pipeline.

02-gblup_results.Rmd

Checks quality of model output and summarizes prediction results for the GBLUP and MultiBLUP models (e.g. proportion of heritability explained, likelihood ratio, prediction accuracy, reliability, bias, MSE)

03-process_null.Rmd

Examines properties of the random gene groups (e.g. distribution of likelihood ratio) and performs quantile regression to establish 95 percentiles for the proportion of heritability explained and likelihood ratio (see nice discussion of this approach in Edwards et al. 2016 )

04-multiblup_results_by_pathway.Rmd

Identifies pathways that pass significance criteria based on comparison to random gene groups with the same number of SNPs (proportion of h2, likelihood ratio, and increase in prediction accuracy).

05-figures.Rmd

Code to create figures used in the manuscript.

Project Organization (based on Cookiecutter data science)

Project based on the cookiecutter data science project template . #cookiecutterdatascience

├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
 ├── external <- Data from third party sources.
 ├── interim <- Intermediate data that has been transformed.
 ├── processed <- The final, canonical data sets for modeling.
 └── raw <- The original, immutable data dump.

├── docs <- manuscript documents

├── models <- Trained and serialized models, model predictions, or model summaries

├── notebooks <- R notebooks. Naming convention is a number (for ordering),
 the creator's initials, and a short `-` delimited description, e.g.
 `1.0-jqp-initial-data-exploration`.

├── references <- Data dictionaries, manuals, and all other explanatory materials.

├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
 └── figures <- Generated graphics and figures to be used in reporting

├── environment.yaml <- The requirements file for reproducing the analysis conda environment

├── src <- Source code for use in this project.
 ├── data <- Scripts to download or generate data
 
 ├── features <- Scripts to turn raw data into features for modeling
 
 ├── models <- Scripts to train models and then use trained models to make
  predictions
 
 └── visualization <- Scripts to create exploratory and results oriented visualizations

|── Snakefile <- Snakemake workflow to execute analyses

└── submit.json <- Configuration settings to run snakemake on a computing cluster

Code Snippets

23
24
25
26
run:
    shell("{ldak} --cut-folds {params.folds} \
    --bfile {params.bfile} \
    --num-folds {params.numfolds}")
18
19
20
21
22
23
run:
    shell("{ldak} --calc-kins-direct {params.outdir} \
    --bfile {params.bfile} \
    --ignore-weights YES \
    --power 0 \
    --kinship-raw YES")
40
41
42
43
44
45
46
47
run:
    shell("{ldak} --reml {params.out} \
    --pheno {input.pheno} \
    --pheno-name {params.trait} \
    --grm {params.grm} \
    --covar {input.covar} \
    --constrain YES \
    --reml-iter 500")
67
68
69
70
71
72
73
74
75
76
77
run:
    for i in CV:
        for j in INDEX:
            shell("{ldak} --reml {params.out}{i}.{j} \
            --pheno {input.pheno} \
            --pheno-name {params.trait} \
            --grm {params.grm} \
            --keep {params.keep}{i}.train{j} \
            --covar {input.covar} \
            --constrain YES \
            --reml-iter 500")
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
run:
    for i in CV:
        for j in INDEX:
            shell("{ldak} --calc-blups {params.out}{i}.{j} \
            --grm {params.grm} \
            --remlfile {params.reml}{i}.{j}.reml \
            --bfile {params.bfile} \
            --covar {input.covar}")
            shell("{ldak} --calc-scores {params.out}{i}.{j} \
            --bfile {params.bfile} \
            --power 0 \
            --scorefile {params.out}{i}.{j}.blup \
            --keep {params.keep}{i}.test{j} \
            --pheno {input.pheno} \
            --pheno-name {params.trait}")
120
121
run:
    shell("Rscript src/04_summarize_gblup.R {input.blues}")
SnakeMake From line 120 of rules/gblup.smk
12
13
run:
    shell("Rscript src/05_select_pathway_snps.R {params.pathway} \"{params.bincode}\"")
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
run:
    # partition kinship matrix
    # --ignore-weights YES & --power 0 based on LDAK recommendations for prediction
    shell("{ldak} --cut-kins {params.outdir} \
    --bfile {params.bfile} \
    --partition-number 2 \
    --partition-prefix {params.prefix}")
    shell("{ldak} --calc-kins {params.outdir} \
    --bfile {params.bfile} \
    --partition 1 \
    --ignore-weights YES \
    --power 0 \
    --kinship-raw YES")
    shell("{ldak} --calc-kins {params.outdir} \
    --bfile {params.bfile} \
    --partition 2 \
    --ignore-weights YES \
    --power 0 \
    --kinship-raw YES")
67
68
69
70
71
72
73
74
75
run:
    for p in PATHWAYS.keys():
        shell("{ldak} --reml {params.out}{p}/{params.trait}.multiblup \
        --pheno {input.pheno} \
        --pheno-name {params.trait} \
        --mgrm {params.mgrm}{p}/partition.list \
        --covar {input.covar} \
        --constrain YES \
        --reml-iter 500")
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
run:
    for p in PATHWAYS.keys():
        for i in CV:
            for j in INDEX:
                shell("{ldak} --reml {params.out}{p}/{params.trait}.cv{i}.{j} \
                --pheno {input.pheno} \
                --pheno-name {params.trait} \
                --mgrm {params.mgrm}{p}/partition.list \
                --keep {params.keep}{i}.train{j} \
                --covar {input.covar} \
                --constrain YES \
                --reml-iter 100")
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
run:
    for p in PATHWAYS.keys():
        for i in CV:
            for j in INDEX:
                shell("{ldak} --calc-blups {params.out}{p}/{params.trait}.cv{i}.{j} \
                --mgrm {params.mgrm}{p}/partition.list \
                --remlfile {params.out}{p}/{params.trait}.cv{i}.{j}.reml \
                --bfile {params.bfile} \
                --covar {input.covar}")
                shell("{ldak} --calc-scores {params.out}{p}/{params.trait}.cv{i}.{j} \
                --bfile {params.bfile} \
                --power 0 \
                --scorefile {params.out}{p}/{params.trait}.cv{i}.{j}.blup \
                --keep {params.keep}{i}.test{j} \
                --pheno {input.pheno} \
                --pheno-name {params.trait}")
148
149
run:
    shell("Rscript src/06_summarize_multiblup.R {input.blues}")
10
11
shell:
    "Rscript src/07_null_group_sizes.R"
24
25
shell:
    "Rscript src/08_generate_null_gene_groups.R {params.num}"
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
run:
    for r in RANDOM:
        shell("{ldak} --cut-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition-number 2 \
        --partition-prefix {params.prefix}{r}/list")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 1 \
        --ignore-weights YES \
        --power 0")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 2 \
        --ignore-weights YES \
        --power 0")
75
76
77
78
79
80
81
82
83
run:
    for r in RANDOM:
        shell("{ldak} --reml {params.prefix}{r}/{params.trait}.h2 \
        --pheno {input.pheno} \
        --pheno-name {params.trait} \
        --mgrm {params.mgrm}{r}/partition.list \
        --covar {input.covar} \
        --constrain YES \
        --reml-iter 500")
91
92
run:
    shell("Rscript src/09_summarize_null.R")
106
107
shell:
    "Rscript src/08a_generate_null_gene_groups_pathways.R {params.pathway}"
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
run:
    for r in range(1,1001):
        shell("{ldak} --cut-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition-number 2 \
        --partition-prefix {params.prefix}{r}/list")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 1 \
        --ignore-weights YES \
        --power 0")
        shell("{ldak} --calc-kins {params.prefix}{r} \
        --bfile {params.bfile} \
        --partition 2 \
        --ignore-weights YES \
        --power 0")
156
157
158
159
160
161
162
163
164
run:
    for r in range(1,1001):
        shell("{ldak} --reml {params.prefix}{r}/{params.trait}.h2 \
        --pheno {input.pheno} \
        --pheno-name {params.trait} \
        --mgrm {params.mgrm}{r}/partition.list \
        --covar {input.covar} \
        --constrain YES \
        --reml-iter 500")
174
175
run:
    shell("Rscript src/09a_summarize_null_pathways.R {params.pathway}")
20
21
22
23
24
25
26
27
28
run:
    shell("Rscript src/01_convert_to_ped.R")
    shell("plink --file {params.indir} \
    --recode 12 \
    --maf 0.05 \
    --mind 0.1 \
    --geno 0.1 \
    --make-bed \
    --out {params.outdir}")
38
39
shell:
    "Rscript src/02_extract_gene_ids.R {input}"
58
59
60
61
62
63
run:
    shell("{ldak} --cut-weights {params.outdir} \
    --bfile {params.bfile} \
    --section-kb 0.25")
    shell("{ldak} --calc-weights-all {params.outdir} \
    --bfile {params.bfile}")
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
library(tidyverse)

snp_data <- read.csv("data/external/call_method_75/call_method_75_TAIR9.csv",
                     header = TRUE, skip = 1)

# create the MAP file
# requires Chromosome and Position info
# columns: Chromosome/rsid/physical_distance/Position

map <- snp_data %>%
  select(Chromosome, Positions) %>%
  mutate(rsid = paste0("S", seq_len(n())),
         distance = 0) %>%
  select(Chromosome, rsid, distance, Positions) %>%
  write_delim(., "data/raw/call_method_75_TAIR9.map", col_names = FALSE)

# create the PED file
# rows are individuals
# columns are family/sample/paternal/maternal/sex/phenotype/marker1/marker2/etc

# individual ids
id <- colnames(snp_data[,-c(1:2)])

# get accession ids with phenotype data
acc360 <- read_delim("data/processed/pheno_file", delim = "\t") 

ped <- snp_data[,-c(1:2)] %>%
  # recode genotypes to be diploid
  mutate_if(is.factor, funs(fct_recode(., GG = "G",
                                       TT = "T",
                                       AA = "A",
                                       CC = "C"))) %>%
  t() %>%
  as.data.frame(row.names = FALSE) %>%
  # add columns expected by plink (family/sample id/paternal/maternal/sex/phenotype)
  add_column(., family = "AtRegMap",
             sample = gsub("X", "", id),
             paternal = 0,
             maternal = 0,
             sex = -9,
             phenotype = -9,
             .before = 1) %>%
  # only include accessions with phenotype data
  filter(sample %in% acc360$IID) %>%
  write_delim(., "data/raw/call_method_75_TAIR9.ped", col_names = FALSE)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
library(tidyverse)
library(broom)

# use command line args
args <- commandArgs(trailingOnly = TRUE)

# phenotypes
pheno <- read.table(paste0(args[1]), header = TRUE) %>%
  gather(key = "trait",
         value = "phenotype",
         -FID, -IID)

# heritability
gblup_h2 <- tibble(file = list.files(path = "models/gblup_h2",
                                     pattern = ".reml$",
                                     full.names = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "trait", "method2", "fill"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE, skip = 13)) %>%
  unnest(data) %>%
  filter(Component %in% "Her_ALL") %>%
  select(trait, Heritability, Her_SD)

head(gblup_h2)

# log likelihood
gblup_llik <- tibble(file = list.files(path = "models/gblup_h2",
                                     pattern = ".reml$",
                                     full.names = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "trait", "metric", "fill", "fill2"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(gblup_llik = as.numeric(as.character(X1))) %>%
  select(trait, gblup_llik)

head(gblup_llik)

# individuals used for test set
cross_val <- tibble(filename = list.files(path = "data/processed/cross_validation",
                    pattern = ".test", full.names = TRUE)) %>%
             mutate(cvnum = word(filename, start = 4, end = 4, sep = "/|[.]"),
                    fold = word(filename, start = 5, end = 5, sep = "/|[.]"),
                    fold = gsub("test", "", fold)) %>%
             mutate(file_contents = map(filename, ~read.table(., header = FALSE))) %>%
             unnest(cols = c("file_contents")) %>%
             rename("FID" = "V1",
                    "IID" = "V2")

# predictive accuracy
gblup_pa <- tibble(file = list.files(path = "models/gblup",
                                     pattern = ".pred$",
                                     full.names = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  left_join(pheno, ., by = c("FID" = "ID1", "IID" = "ID2", "trait")) %>%
  inner_join(., cross_val)
  # left_join(., training_coef, by = c("trait", "cvnum", "fold"))

# training_coef <- tibble(file = list.files(path = "models/gblup",
#                                           pattern = ".coeff$",
#                                           full.names = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   filter(Component %in% "Intercept") %>%
#   select(trait, cvnum, fold, Effect)
#
# head(training_coef)
#
# gblup_pa <- tibble(file = list.files(path = "models/gblup",
#                                      pattern = ".profile$",
#                                      full.names = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   left_join(., training_coef, by = c("trait", "cvnum", "fold"))

head(gblup_pa)

save(gblup_h2, gblup_llik, gblup_pa, file = "reports/gblup.RData")
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
library(tidyverse)
library(fuzzyjoin)

# use command line arguments
args = commandArgs(trailingOnly = TRUE)
print(args[1])
print(args[2])

# read in gene list and annotations (from biomaRt)
gene_list <- read.table("data/processed/gene_list_tair10.txt", header = TRUE)

################################################################################
## Function to extract genes for protein related pathways
## input: data frame with chromosome name, ensembl gene id, external gene name,
## start/end position, and GO category names (name_1006)
################################################################################

extract_genes <- function(genes, category, go_term, snp_data, filename) {
  tmp <- genes[grep(paste0('^', category), genes[,go_term], perl = TRUE),] %>%
    distinct(ensembl_gene_id, .keep_all = TRUE) %>%
    mutate_at(c("chromosome_name", "ensembl_gene_id"), funs(factor(.))) %>%
    mutate(mn = start_position - 2500,
           mx = end_position + 2500,
           chromosome_name = as.character(chromosome_name))

  snp_data <- snp_data %>% mutate(chromosome = as.character(chromosome))

  snp_tmp <- fuzzy_inner_join(snp_data, tmp, by = c("chromosome" = "chromosome_name",
                                        "position" = "mn",
                                        "position" = "mx"),
                    match_fun = list(`==`, `>=`, `<=`)) %>%
    droplevels() %>%
    dplyr::select(chromosome, snp_id, position, ensembl_gene_id.x,
                  ensembl_gene_id.y, external_gene_name, BINCODE, NAME) %>%
    distinct(., snp_id, .keep_all = TRUE)

  dir.create(paste0("data/processed/pathways/", filename), recursive = TRUE)
  write.table(snp_tmp, paste0("data/processed/pathways/",
                              filename, "/", filename, ".txt"),
              col.names = TRUE, row.names = FALSE)

  return(snp_tmp)
}

# Read in annotations from Mapman

cats <- read.csv("data/external/Ath_AGI_LOCUS_TAIR10_Aug2012.csv", header = TRUE, na.strings = c("", "NA"))
cats$IDENTIFIER <- toupper(cats$IDENTIFIER) # convert locus ids to uppercase
str(cats)
head(cats)

# merge with gene_list
gene_cats <- merge(gene_list, cats, by.x = "ensembl_gene_id", by.y = "IDENTIFIER")

# read in snps and order by Chromosome/Position
snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>%
  mutate_at(c("chromosome"), funs(factor(.))) %>%
  mutate(position = as.integer(position)) %>%
  arrange(chromosome, position)

# create directories for pathways
dir <- paste0("data/processed/pathways/", args[1])
dir.create(dir, recursive = TRUE, showWarnings = FALSE)

# pull out annotation categories based on Mapman categories
pathway <- if(args[2] == 0) {
  read.table(paste0(dir, "/", args[1], ".txt"), header = TRUE)
} else {
  extract_genes(gene_cats, c(args[2]), "BINCODE", snps, args[1])
}

head(pathway)

unique(pathway$BINCODE)
unique(pathway$NAME)

# export snplists for LDAK
pathway %>%
  select(snp_id) %>%
  write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE)

snps %>%
  select(snp_id) %>%
  filter(!snp_id %in% pathway$snp_id) %>%
  write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE)
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
library(tidyverse)
library(broom)

# use command line args
args <- commandArgs(trailingOnly = TRUE)

# phenotypes
pheno <- read.table(paste0(args[1]), header = TRUE) %>%
  gather(key = "trait",
         value = "phenotype",
         -FID, -IID)

# heritability
multiblup_h2 <- tibble(file = list.files(path = "models/multiblup_h2",
                                     pattern = ".share$",
                                     full.names = TRUE, recursive = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "method2", "fill"), remove = FALSE) %>%
  # mutate(data = lapply(file, read.table, header = TRUE, skip = 13)) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  # filter(Component %in% c("Her_K1", "Her_K2")) %>%
  # select(trait, pathway, Component, Heritability, Her_SD)
  select(trait, pathway, Component, Share, SD)

head(multiblup_h2)

# likelihood ratio
multiblup_llik <- tibble(file = list.files(path = "models/multiblup_h2",
                                     pattern = ".reml$",
                                     full.names = TRUE, recursive = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "method2", "fill"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(multiblup_llik = as.numeric(as.character(X2))) %>%
  select(trait, pathway, multiblup_llik)

head(multiblup_llik)

# predictive ability

# individuals used for test set
cross_val <- tibble(filename = list.files(path = "data/processed/cross_validation",
                    pattern = ".test", full.names = TRUE)) %>%
             mutate(cvnum = word(filename, start = 4, end = 4, sep = "/|[.]"),
                    fold = word(filename, start = 5, end = 5, sep = "/|[.]"),
                    fold = gsub("test", "", fold)) %>%
             mutate(file_contents = map(filename, ~read.table(., header = FALSE))) %>%
             unnest(cols = c("file_contents")) %>%
             rename("FID" = "V1",
                    "IID" = "V2")

# predictive accuracy
multiblup_pa <- tibble(file = list.files(path = "models/multiblup",
                                     pattern = ".pred$",
                                     full.names = TRUE,
                                     recursive = TRUE)) %>%
  separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold"), remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE)) %>%
  unnest(data) %>%
  left_join(pheno, ., by = c("FID" = "ID1", "IID" = "ID2", "trait")) %>%
  inner_join(., cross_val)

# training_coef <- tibble(file = list.files(path = "models/multiblup",
#                                           pattern = ".coeff$",
#                                           full.names = TRUE, recursive = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold", "fill"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   filter(Component %in% "Intercept") %>%
#   select(trait, pathway, cvnum, fold, Effect)
#
# head(training_coef)
#
# multiblup_pa <- tibble(file = list.files(path = "models/multiblup",
#                                          pattern = ".profile$",
#                                          full.names = TRUE, recursive = TRUE)) %>%
#   separate(file, sep = "/|[.]", into = c("source", "method", "pathway", "trait", "cvnum", "fold"), remove = FALSE) %>%
#   mutate(data = lapply(file, read.table, header = TRUE)) %>%
#   unnest(data) %>%
#   left_join(., training_coef, by = c("pathway", "trait", "cvnum", "fold"))

head(multiblup_pa)

# number of snps/genes in each pathway
files <- list.files(path = "data/processed/pathways", pattern = "*.txt", full.names = TRUE,
                    recursive = TRUE)

pathways <- tibble(file = files) %>%
  separate(file, sep = "/", into = c("dir", "dir2", "dir3", "pathway", "pathway2"),
           remove = FALSE) %>%
  mutate(data = lapply(file, read.table, header = TRUE),
         data = lapply(data, mutate_all, as.character)) %>%
  unnest(data) %>%
  group_by(pathway) %>%
  summarise(size = length(unique(snp_id)),
            n_genes = length(unique(ensembl_gene_id.x))) %>%
  select(pathway, n_genes, size) %>%
  ungroup() %>%
  write_csv(., "reports/pathways_summary.csv")

head(pathways)

save(multiblup_h2, multiblup_llik, multiblup_pa, pathways, file = "reports/multiblup.RData")
 7
 8
 9
10
11
12
13
14
15
16
17
gene_group_sizes <- round(runif(5000, 1, 50000))
hist(gene_group_sizes)
write.table(gene_group_sizes, "data/interim/null_group_sizes.txt", row.names = FALSE)

# write sampling file (control_sampling.txt) for sbatch
# this can be used if submitting with sbatch to run groups of 50 random sets at a time
sampling <- as.data.frame(seq(1, length(gene_group_sizes), by = 50))
colnames(sampling) <- "start"
sampling$end <- sampling$start + 49

write.table(sampling, "data/interim/control_sampling.txt", row.names = FALSE, col.names = FALSE)
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
library(tidyverse)

# use command line arguments
args = commandArgs(trailingOnly = TRUE)

# name of pathway
pathway <- args[1]

# pathway SNPs
snp_ids <- read.table(paste0("data/processed/pathways/", pathway, "/list1"))

# number of SNPs to sample
nsnps <- nrow(snp_ids)

# number of random subsets to export
end <- 1000

cat("pathway:", pathway, "- sampling", end, "random subsets of", nsnps, "snps each\n")

# read in SNP and gene information for all SNPs
snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>%
        arrange(chromosome, position) %>%
        filter(!snp_id %in% snp_ids$V1) # remove pathway SNPs

str(snps)

# # read in random uniform distribution for number of SNPs
# snp_number <- read.table("data/interim/gene_groups.txt", header = TRUE)
#
# a function to sample genes randomly up to a specified number of SNPs
# n = numer of SNPs to sample
# bprange = number of basepairs to include before/after a gene

sample_genes <- function(data, nsnps, bprange) {
  out <- data.frame()
  while (nrow(out) < nsnps) {
    # subset snps by genes already selected in output
    set <- na.omit(data[!data$ensembl_gene_id %in% out$ensembl_gene_id,])
    gene <- sample(set$ensembl_gene_id, 1, replace = FALSE) # select a random gene
    snp_sub <- set[set$ensembl_gene_id==gene,] # select all snps within gene
    chr <- snp_sub$chromosome[1]
    mn <- min(snp_sub$position) - bprange
    mx <- max(snp_sub$position) + bprange
    positions <- data[data$position > mn & data$position < mx & data$chromosome==chr,] # select SNPs in gene with X kb buffer
    out <- rbind(out, positions) # combine until threshold number of snps are met
    }
    return(out)
  }

# loop through gene sets specified by commandArgs

for (i in 1:end) {
  if(file.exists(paste0("data/processed/random_sets_pathways/", pathway, "/null_", i, ".txt"))) {
    cat("file already exists for sample", i, "\n")
    next
  }
  tryCatch({

  cat("exporting random gene group", i, "of", end, "\n")

  random <- sample_genes(snps, n = nsnps, bprange = 2500) %>%
  arrange(chromosome, position)

  dir.create(paste0("data/processed/random_sets_pathways/", pathway), showWarnings = FALSE, recursive = TRUE)

  write.table(random, paste0("data/processed/random_sets_pathways/", pathway, "/null_", i, ".txt"))

  # export snplists for LDAK
  dir <- paste0("data/processed/random_sets_pathways/", pathway, "/c_", i)
  dir.create(dir, recursive = TRUE, showWarnings = FALSE)

  # SNPs in random gene group
  random %>%
    select(snp_id) %>%
    unique() %>%
    write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  # remaining SNPs
  snps %>%
    select(snp_id) %>%
    filter(!snp_id %in% random$V1) %>%
    write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  }, error = function(e){cat("ERROR:", conditionMessage(e), "\n")})
}
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
library(tidyverse)

# use command line arguments
args = commandArgs(trailingOnly = TRUE)
# end <- as.numeric(args[1]) + 49
end <- 5000
cat("random subsets from ", args[1], "to", end, "\n")

# read in SNP and gene information
snps <- read.table("data/processed/snp_gene_ids_tair10.txt", header = TRUE) %>%
arrange(chromosome, position)

# read in random uniform distribution for number of SNPs
snp_number <- read.table("data/interim/gene_groups.txt", header = TRUE)

# a function to sample genes randomly up to a specified number of SNPs
# n = numer of SNPs to sample
# bprange = number of basepairs to include before/after a gene

sample_genes <- function(data, n, bprange) {
  out <- data.frame()
  while (nrow(out) < n) {
    # subset snps by genes already selected in output
    set <- na.omit(data[!data$ensembl_gene_id %in% out$ensembl_gene_id,])
    gene <- sample(set$ensembl_gene_id, 1, replace = FALSE) # select a random gene
    snp_sub <- set[set$ensembl_gene_id==gene,] # select all snps within gene
    chr <- snp_sub$chromosome[1]
    mn <- min(snp_sub$position) - bprange
    mx <- max(snp_sub$position) + bprange
    positions <- data[data$position > mn & data$position < mx & data$chromosome==chr,] # select SNPs in gene with X kb buffer
    out <- rbind(out, positions) # combine until threshold number of snps are met
    }
    return(out)
  }

# loop through gene sets specified by commandArgs

for (i in args[1]:end) {
  if(file.exists(paste0("data/processed/random_sets/null_", i, ".txt"))) {
    cat("file already exists for sample", i, "\n")
    next
  }
  tryCatch({
  print(snp_number[i,])
  random <- sample_genes(snps, n = snp_number[i,], bprange = 2500) %>%
  arrange(chromosome, position)

  dir.create("data/processed/random_sets", showWarnings = FALSE)

  write.table(random, paste0("data/processed/random_sets/null_", i, ".txt"))

  # export snplists for LDAK
  dir <- paste0("data/processed/random_sets/c_", i)
  dir.create(dir, recursive = TRUE, showWarnings = FALSE)

  random %>%
    select(snp_id) %>%
    unique() %>%
    write.table(., paste0(dir, "/list1"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  snps %>%
    select(snp_id) %>%
    filter(!snp_id %in% random$V1) %>% # filter(!snp_id %in% random$snp_id) %>%
    write.table(., paste0(dir, "/list2"), quote = FALSE, row.names = FALSE, col.names = FALSE)

  }, error = function(e){cat("ERROR:", conditionMessage(e), "\n")})
}


for (f in files){
foo <- read.table(paste0(f), header = FALSE)
print(length(foo$V1))
foo %>% select(V1) %>% unique() %>% write.table(., paste0(f), row.names = FALSE, col.names = FALSE, quote = FALSE)
}
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
library(tidyverse)
library(data.table)

################################################################################
# Step 1: summarise random gene groups

# use command line arguments
args = commandArgs(trailingOnly = TRUE)

# filepath
path <- paste0("data/processed/random_sets_pathways/", args[1])

# list files for random gene groups
random_files <- data.frame(
  filename = list.files(path = path,
                        pattern = ".txt$",
                        full.names = TRUE,
                        recursive = TRUE)
)

random_files <- data.frame(filename =
                         system(paste("find", path, "| grep .txt", sep = " "),
                         intern = TRUE)
                       )

# summarize number of snps and genes in each random group

random_summary <- random_files %>%
  mutate(filename = as.character(filename),
         # pathway id
         pathway = word(filename, start = 4, sep = "/"),
         # random group id (1-1000)
         random_group = word(filename, start = 5, sep = "/"),
         # cleaning up a bit.. this is kind of messy
         random_group = gsub(".txt", "", random_group),
         random_group = gsub("null_", "c_", random_group),
         # read in file contents into a list
         file_contents = map(filename,
                             ~ fread(., drop = 1))) %>%
#                             ~ read.table(.))) %>%
  # unnest list into a data frame
  unnest() %>%
  group_by(pathway, random_group) %>%
  # count number of unique genes and snps in each random gene group
  summarize(n_genes = n_distinct(ensembl_gene_id),
            n_snps = n_distinct(snp_id))

head(random_summary)

################################################################################
# Step 2: compute likelihood ratio test statistic

# import log likelihood for GBLUP model
# this is used to compute the likelihood ratio statistic
load("reports/gblup.RData")

# filepath to model output from REML
path <- paste0("models/null_h2_pathways/", args[1])

## list files for reml output of each random gene group

# list files using 'find' OS command
# this option is *much* faster than 'list.files'
lr_files <- data.frame(filename =
                         system(paste("find", path, "| grep .reml", sep = " "),
                         intern = TRUE)
                       )

lr_null <- lr_files %>%
  mutate(filename = as.character(filename),
         # pathway id
         pathway = word(filename, start = 3, sep = "/"),
         # random group id (1-1000)
         random_group = word(filename, start = 4, sep = "/"),
         # traid id
         trait = word(filename, start = 5, sep = "/|[.]"),
         # read in file contents into a list
         file_contents = map(filename,
                             ~ fread(.,
                                     header = FALSE,
                                     skip = 10,
                                     nrows = 1))) %>%
  # unnest list into a data frame
  unnest() %>%
#   filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(null_llik = as.numeric(as.character(V2))) %>%
  left_join(., gblup_llik, by = "trait") %>%
  select(trait, pathway, random_group, null_llik, gblup_llik) %>%
  mutate(LR = 2 * (null_llik - gblup_llik))

head(lr_null)

################################################################################
## Step 3: summarize proportion of variance explained

## list files for proportion of variance explained
# uses same path as LR output

h2_files <- data.frame(filename =
                         system(paste("find", path, "| grep .share", sep = " "),
                         intern = TRUE)
                       )


# h2_files <- data.frame(filename =
#                          list.files(path = path,
#                                     pattern = ".share$",
#                                     full.names = TRUE,
#                                     recursive = TRUE)
# )

h2_null <- h2_files %>%
  mutate(filename = as.character(filename),
         # pathway id
         pathway = word(filename, start = 3, sep = "/"),
         # random group id (1-1000)
         random_group = word(filename, start = 4, sep = "/"),
         # traid id
         trait = word(filename, start = 5, sep = "/|[.]"),
         # read in file contents into a list
         file_contents = map(filename,
                             ~ fread(., header = TRUE))) %>%
  # unnest list into a data frame
  unnest() %>%
  select(trait, pathway, random_group, Component, Share, SD) %>%
  left_join(., random_summary, by = c("pathway", "random_group"))

head(h2_null)

################################################################################
## Step 4: merge and export results!

output <- paste0("reports/null_dist_results_pathways/", args[1], "_null.csv")

null_dist <- left_join(lr_null, h2_null,
                       by = c("trait", "pathway", "random_group")) %>%
  write_csv(., output)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
library(tidyverse)
library(data.table)
# library(quantreg)

# summarize random gene groups (number of SNPs & number of genes)

feature_sizes <- read_csv("reports/random_subsets_summary.csv")
head(feature_sizes)

# import log likelihood for GBLUP model
load("reports/gblup.RData")

## list files for reml output of each random gene group
files <- list.files(path = "models/null_h2", pattern = ".reml$",
                    full.names = TRUE, recursive = TRUE)

lr_null <- rbindlist(lapply(files, read.table, header = TRUE),
                      idcol = "filename", use.names = TRUE) %>%
  mutate(filename = files[filename]) %>%
  separate(filename, sep = "/|[.]",
           into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>%
  filter(Num_Kinships %in% "Alt_Likelihood") %>%
  mutate(null_llik = as.numeric(as.character(X2))) %>%
  left_join(., gblup_llik, by = "trait") %>%
  select(trait, pathway, Num_Kinships, null_llik, gblup_llik) %>%
  left_join(., feature_sizes, by = "pathway") %>%
  mutate(group_size = cut(size, breaks = seq(0, 50000, by = 10000),
                          include.lowest = TRUE, dig.lab = 10),
         LR = 2 * (null_llik - gblup_llik))

# lr_null <- rbindlist(lapply(files, read.table, header = TRUE),
#                       idcol = "filename", use.names = TRUE) %>%
#   mutate(filename = files[filename]) %>%
#   separate(filename, sep = "/|[.]",
#            into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>%
#   filter(Num_Kinships %in% "Alt_Likelihood") %>%
#   # mutate(null_llik = as.numeric(as.character(X2))) %>%
#   # left_join(., gblup_llik, by = "trait") %>%
#   # select(trait, pathway, Num_Kinships, null_llik, gblup_llik) %>%
#   rename("LRT_Stat" = X2) %>%
#   select(trait, pathway, LRT_Stat) %>%
#   left_join(., feature_sizes, by = "pathway") %>%
#   mutate(group_size = cut(size, breaks = seq(0, 50000, by = 10000),
#                           include.lowest = TRUE, dig.lab = 10))

head(lr_null)

## read in variance component estimates
files <- list.files(path = "models/null_h2", pattern = ".share$",
                    full.names = TRUE, recursive = TRUE)

h2_null <- rbindlist(lapply(files, read.table, header = TRUE),
                      idcol = "filename", use.names = TRUE) %>%
  mutate(filename = files[filename]) %>%
  separate(filename, sep = "/|[.]", into = c("dir", "source", "pathway", "trait", "metric", "model"), remove = FALSE) %>%
  left_join(., feature_sizes, by = "pathway") %>%
  select(trait, pathway, size, n_genes, Component, Share, SD)

head(h2_null)

## merge!
null_dist <- left_join(lr_null, h2_null, by = c("trait", "pathway")) %>%
  write_csv(., "reports/lr_null_results.csv")
ShowHide 28 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/mishaploid/AA-GenomicPrediction
Name: aa-genomicprediction
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...