Multi-Step Genomic Analysis Pipeline for DeCOI WGS Cohort

public public 1yr ago 0 bookmarks

Pipeline for QC, GWAS and gene-based collapsing within the DeCOI WGS cohort (1220 individuals). Rules for ROH-analysis and gene-set based analyses are also included. This repository is for documentary purposes and the pipeline will not run out of the box - e.g. adjustments are needed for the local compute environment and not all inpute files (e.g. genotype files, phenotype files, annotation sources) are provided.

Setup

This pipeline was run on a compute cluster running Linux (CentOS Linux 7) and slurm 22.05.6. Miniconda3 was manually installed (https://docs.conda.io/en/latest/miniconda.html). The setup is quiet complex, due to the use of the slurm scheduler and the use of Apache Spark. The setup was done in the following way:

Two conda environments were manually created:

  • an environment running snakemake-7.3.7:
	conda create --name snakemake7 -c bioconda snakemake=7.3.7	# for letting snakemake submit jobs to slurm a profile file with the default settings for snakemake was used.	# for documentation you can find the file I used in "config/snakemake_slurm_profile/config.yaml"	# you would need to modify it to have e.g. the correct default partition and account. Then you would move it to snakemakes config directory:	mkdir -p ~/.config/snakemake/slurm	cp config/snakemake_slurm_profile/config.yaml ~/.config/snakemake/slurm	# afterwards you should be able to execute "snakemake --profile slurm"; then each rule that is executed should be submitted as an individual slurm job. Resource requirements can be specified in each snakemake-rule (see file workflow/Snakefile). 
  • an environment for running hail / Apache Spark: Note that this environment is being created to be able to set up a Apache Spark cluster on top of slurm which can then be used by hail. This will probably need some testing. If running hail on a single node is enough, you could simply install hail via conda (e.g. just run the first two commands from below and then run "conda install -c bioconda hail"). You could then set the variable cluster to "no" in config/config.yaml.
	conda env update --file env/spark_from_history.yaml	conda activate spark	mkdir -p ~/scratch/spark	cd ~/scratch/spark	wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz	tar zxvf spark-3.1.1-bin-hadoop3.2.tgz	git clone https://github.com/hail-is/hail.git	cd hail/hail	make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.12.13 SPARK_VERSION=3.1.1	pip install pyspark==3.1.1	# now manually modify the file "workflow/scripts/spark_submit_command.sh" as indicated within the file.	# also adjust the last 5 variables within config/config.yaml

The pipeline could now be tested by running

conda activate snakemake7
snakemake -np

The whole pipeline was executing by using the file "run.sh"; this file would also need adjustments (as indicated in the file):

conda activate snakemake7
sbatch run.sh

Here are the DAGs of some analyses that were conducted - the names of the rules are indicated.

Population-PCA:

PCA Pipeline

GWAS:

GWAS Pipeline

RVAS:

RVAS Pipeline

Inputs

Not all input files that were used for the pipeline are included in this repository due to privacy issues or file-size. The following files are missing:

  • genotype data / cohort bcf, which was produced by glnexus. The path ("input_vcf") can be set in the file config/config.yaml.

  • fasta file of the reference genome (here hg38). The path ("fasta") can be set in the file config/config.yaml.

  • Data of the 1000 genomes project; You only need to provide a folder; there is a snakemake-rule ("download_1000G_genotypes") to download the data. The path ("location_1000G") can be set in the file config/config.yaml.

  • Phenotype files are not included due to privacy issues; see resources folder for a short description of those files (for reference only).

  • several annotation sources for VEP are not included, which were placed in the folder indicated under the variable "database_dir" in the config file:
    "gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz" source: https://gnomad.broadinstitute.org/downloads#v2-liftover ,
    "dbNSFP4.1a_hg38.gz" source: https://sites.google.com/site/jpopgen/dbNSFP - the prepared for VEP/the dbNSFP plugin of VEP as described in https://github.com/Ensembl/VEP_plugins/blob/release/102/dbNSFP.pm
    "spliceai_scores_sorted.hg38.vcf.gz" source: downloaded from https://basespace.illumina.com/s/otSPW8hnhaZR , then sorted by chrom:pos and bgziped/indexed

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(tidyverse)

cov_path<-"results/var_sets/EURs_unrel/B2/cov.cov"
prs_path<-"results/PRS/scores.profile"
cov_out_path<-"results/var_sets/EURs_unrel/B2/cov.cov"

cov_path<-snakemake@input[[1]]
prs_path<-snakemake@input[[2]]
cov_out_path<-snakemake@output[[1]]

COV<-read_tsv(file=cov_path, col_names=TRUE)
PRS<-read.table(file = prs_path, header = TRUE)

print("Percentage of individuals in the COV file that also got a PRS calculated:")
mean(COV$IID %in% PRS$IID)

PRS_red<-PRS %>%
  select(IID, SCORESUM)

COV<-COV %>% 
  left_join(PRS_red, by="IID")

write_tsv(x=COV, file=cov_out_path, col_names = TRUE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
library(tidyverse)
library(readxl)

args=c("config/post_df3_HGI_sample_QC_summary.xlsx", "data/hail_gather_data/for_sample_QC.tsv", "data/plotPhenoInfo/")

args = commandArgs(trailingOnly=TRUE)
print(args)
PhenoFileLocation=args[1]
qc_file<-args[2]
outFolder=args[3]

samplesheet<-read_excel(PhenoFileLocation, sheet = "qc_table")

qc_table<-read_tsv(qc_file)

dir.create(outFolder,recursive = TRUE, showWarnings = FALSE)
setwd(outFolder)

samplesheet<-samplesheet%>% 
  rename(sex_to_use=`Sex (f=fem, m=m)`)%>%
  mutate(sex_for_fam=ifelse(sex_to_use=="f", 2, ifelse(sex_to_use=="m",1,0)))

sample_sex<-samplesheet %>% select(sex_to_use, s)

qc_table<-qc_table %>%
  left_join(sample_sex, by=(c("s"="s")))

qc_table<-qc_table %>% 
  mutate(Age=as.integer(Age),
         A2=as.integer(A2),
         B2=as.integer(B2),
         C2=as.integer(C2),
         is_case=as.integer(is_case)
         )



# Check Age vs. Sex

SexPlot<-ggplot(qc_table)+
  geom_histogram(aes(x=Age, fill=sex_to_use), bins=20)+
  theme_bw()

print(SexPlot)
ggsave(filename="SexPlot.pdf",
       plot=SexPlot,
       width=7,
       height=3)

SexInfo<-qc_table %>% 
  summarise(mean(sex_to_use=="m"), num=sum(sex_to_use!="noSex"))
write_tsv(x=SexInfo, file="SexInfo.tsv")

# Check Age vs. Severity
qc_table <- qc_table %>% 
  mutate(is_case_F=as.factor(is_case))

AgePlot<-ggplot(qc_table)+
  geom_histogram(aes(x=Age, fill=is_case_F), bins=20)+
  geom_vline(data=qc_table %>% filter(is_case==0), aes(xintercept = mean(Age)), col="red")+
  geom_vline(data=qc_table %>% filter(is_case==1), aes(xintercept = mean(Age)), col="green")+
  geom_vline(data=qc_table %>% filter(is_case==2), aes(xintercept = mean(Age)), col="blue")+
  theme_bw()

print(AgePlot)
ggsave(filename="AgePlot.pdf",
       plot=AgePlot,
       width=7,
       height=3)

AgeInfo<-qc_table %>% 
  group_by(is_case)%>%
  summarise(MEAN=mean(Age), SD=sd(Age), N=length(Age))%>%
  mutate(SEM=SD/sqrt(N))


write_tsv(x=AgeInfo, file="AgeInfo.tsv")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library(tidyverse)
library(readxl)

# Params:
# 1. input pheno-excel
# 2. input PC file
# 3. output file

args=c("config/post_df3_HGI_sample_QC_summary.xlsx","data/PCAcovar/A2_EUR_PCA.eigenvec")

# first: excel_file, second: fam-file, third: Phenotype column in excel

args = commandArgs(trailingOnly=TRUE)
print(args)

# File formats:
#Line 1 : Header with FID, IID and C covariate names.
#Followed by lines of C+2 values. Space/tab separated.

samplesheet<-read_tsv(args[1])%>%
  mutate(sex_for_fam=ifelse(is_female==TRUE, 2, ifelse(is_female==FALSE,1,0)))




col_names_eigenv<-c("FID","IID", paste0("PC",1:10))

# sex coding: 1=male, 2=female
# pheno coding: 1=control, 2=case
eigenv<-read_delim(args[2], col_names=col_names_eigenv, delim=" ")

cov_info<-eigenv %>% left_join(samplesheet, by=c("IID"="s"))%>%
  mutate(Age=as.integer(Age)) %>%
  mutate(age_sex=Age*sex_for_fam)%>%
  mutate(Age2=Age*Age)%>%
  select(all_of(col_names_eigenv), Age, Age2, age_sex, sex_for_fam)

write_tsv(x = cov_info, file = args[3], col_names = TRUE) # pheno file for regenie
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
library(tidyverse)
library(readxl)
# Params:
# 1. input pheno-excel
# 2. input fam file
# 3. pheno column name in excel
# 4. Output: fam file (Plink)
# 5. Output: pheno file (Regenie)

#testing
#setwd("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline")
args=c("../../resources/EURs_unrel.tsv","data/anc_vcf/EUR_vcf_not_rel.vcf.bgz.fam", "A1", 
       "data/regenie_pheno/fam_file.fam", "data/regenie_pheno/phenox")
# first: excel_file, second: fam-file, third: Phenotype column in excel

args = commandArgs(trailingOnly=TRUE)
print(args)

# File formats:
#FAM file: 
# A text file with no header line, and one line per sample with the following six fields:
#Family ID ('FID')
#Within-family ID ('IID'; cannot be '0')
#Within-family ID of father ('0' if father isn't in dataset)
#    Within-family ID of mother ('0' if mother isn't in dataset)
#Sex code ('1' = male, '2' = female, '0' = unknown)
#Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

# Regnie pheno file
# FID IID Y1 Y2
#  0=control, 1=case, missing values must be coded as NA

samplesheet<-read_tsv(args[1])%>%
  mutate(sex_for_fam=ifelse(is_female==TRUE, 2, ifelse(is_female==FALSE,1,0)))

pheno_col_num<-which(colnames(samplesheet) == args[3])
pheno_PLINK<-as.integer(unlist(samplesheet[,pheno_col_num]))+1
samplesheet$pheno_P=replace_na(pheno_PLINK, -9)
pheno_REGENIE<-as.integer(unlist(samplesheet[,pheno_col_num]))
samplesheet$pheno_R=pheno_REGENIE

# add sex stratification / age strat
samplesheet <- samplesheet %>%
  mutate(pheno_R_female=ifelse(is_female==TRUE, pheno_R, NA), # just keep females
         pheno_R_male=ifelse(is_female==FALSE, pheno_R, NA), # just keep males
         pheno_R_GE60=ifelse(pheno_R==1 & Age < 60, NA, pheno_R),  # set cases 59 or younger to missing
         pheno_R_LT60=ifelse(pheno_R==1 & Age >= 60, NA, pheno_R),
         pheno_R_GE60CC=ifelse(Age < 60, NA, pheno_R),  # set persons 59 or younger to missing
         pheno_R_LT60CC=ifelse(Age >= 60, NA, pheno_R)) # set persons 60 or older to missing 

col_names_fam<-c("FID","IID","father","mother","sex","pheno")

# sex coding: 1=male, 2=female
# pheno coding: 1=control, 2=case
fam<-read_delim(args[2],
delim=" ", 
col_names=col_names_fam,
col_types=c(
  FID = col_character(),
  IID = col_character(),
  father = col_character(),
  mother = col_character(),
  sex = col_integer(),
  pheno = col_integer()
)
)

print("fam:")
head(fam)
print("samplesheet:")
head(samplesheet)

all_info_joined<-fam %>% left_join(samplesheet, by=c("IID"="s"))%>%
  mutate(sex=sex_for_fam)

fam_new<-all_info_joined %>% 
  select(all_of(col_names_fam[1:4]), sex_for_fam, pheno_P)
write_tsv(x = fam_new, file = args[4], col_names = FALSE) # fam_file for plink

pheno_file_regenie <- all_info_joined %>%
  select(FID, IID, pheno_R, pheno_R_female, pheno_R_male, pheno_R_GE60, pheno_R_LT60, pheno_R_GE60CC, pheno_R_LT60CC)

colnames(pheno_file_regenie)<-c("FID","IID",
                                args[3],
                                paste0(args[3],"_female"),
                                paste0(args[3],"_male"),
                                paste0(args[3],"_GE60"),
                                paste0(args[3],"_LT60"),
                                paste0(args[3],"_GE60CC"),
                                paste0(args[3],"_LT60CC"))

print("out:")
head(pheno_file_regenie)

write_tsv(x = pheno_file_regenie, file = args[5], col_names = TRUE) # pheno file for regenie
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
if (!require("GenomicRanges", quietly = TRUE) & (!require("rtracklayer", quietly = TRUE)) ){
  install.packages("BiocManager", repos='http://cran.us.r-project.org')
  BiocManager::install("GenomicRanges")
  BiocManager::install("rtracklayer")
}

library(tidyverse)
library(data.table)
library(GenomicRanges)
library(rtracklayer)


gene_set_path<-"resources/gene_sets.tsv"
gene_set_path<-snakemake@input[[1]]
bed_path<-"resources/genomic_ranges.bed"
bed_path<-snakemake@input[[2]]
aaf_path<-"results/all_vars_for_RVAS/data.anno.aaf.file.txt"
aaf_path<-snakemake@input[[3]]
anno_path<-"/home/aschmidt/Arbeit_Gen/COVID_WGS/COVID_annotation/results/for_RVAS/all_contigs_anno.file.txt"
anno_path<-snakemake@input[[4]]
bim_path<-"results/all_vars_for_RVAS/eurs.bim"
bim_path<-snakemake@input[[5]]
#order_csq<-"resources/genomic_ranges.bed"
#order_csq<-snakemake@input[[6]]

new_anno_path<-"set.annos.tsv"
new_anno_path<-snakemake@output[[1]]
new_aafs_path<-"aafs.tsv"
new_aafs_path<-snakemake@output[[2]]
new_sets_path<-"sets.tsv"
new_sets_path<-snakemake@output[[3]]
relevant_variants_path<-"relevant_variants.tsv"
relevant_variants_path<-snakemake@output[[4]]
new_bim_path<-"new_bim.bim"
new_bim_path<-snakemake@output[[5]]

aafs<-read_delim(aaf_path, delim=" ", col_names = c("ID", "AAF"))
print("aafs:")
head(aafs)
annos<-read_delim(anno_path, delim=" ", col_names = c("ID", "gene_id", "annot"))
print("annos")
head(annos)

bim<-fread(bim_path, sep ="\t", col.names = c("CHR", "ID", "cM","BP","A1","A2"))

print("bim")
head(bim)

#bim_test<-head(bim, n=5000000)

bim_new_id<-bim %>%
  mutate(IDnew=paste("1", 1:n(), A1, A2, sep=":"))%>%
  mutate(BPnew=1:n())%>%
  mutate(CHRnew=1)

comb_sources<-bim_new_id %>%
  left_join(aafs, by="ID")%>%
  left_join(annos, by="ID")

print("combsource")
head(comb_sources)
head(comb_sources %>% filter(!is.na(annot)))

# make sets based on gene definitions
gene_sets<-read_tsv(gene_set_path)
new_sets<-tibble()
new_annos<-tibble()
new_aaf<-tibble()

score_conseq<-function(conseq){
  conseq_tbl<-c("pLoF",
                "missense.revel07",
                "missense.revel05",
                "missense.revel03",
                "moderate",
                "synonymous",
                "UTR5_CADD",
                "UTR5",
		            "UTR3_CADD",
                "UTR3",
                "promoter_CADD",
                "promoter",
                "enhancer_CADD",
                "enhancer",
                "regionBased")


conseq_rank<-c()
for (single_conseq in conseq){
  single_rank<-which(conseq_tbl %in% single_conseq)
  single_rank<-ifelse(is.na(single_rank), 99, single_rank)
  conseq_rank<-c(conseq_rank, single_rank)
}
conseq_rank<-
return(conseq_rank)
}


i=1
for (single_gene_set in unique(gene_sets$set)){
  rel_genes<-gene_sets %>%
    filter(set==single_gene_set)

  tmp_annos<-comb_sources %>%
    filter(gene_id %in% rel_genes$gene_id)%>%
    filter(annot!="not_relevant")%>%
    mutate(gene_id=single_gene_set)
  if (nrow(tmp_annos)==0){
    next()	
  }

  # check for variants that are added to the gene set more than once, take the most severe conseq
  mltpl_occuring_tmp<-tmp_annos %>% 
    add_count(IDnew)%>%
    filter(n>1)%>%
    select(-n)

  if (nrow(mltpl_occuring_tmp)>0){
    tmp_annos<-tmp_annos %>%
      filter(!IDnew %in% mltpl_occuring_tmp$IDnew)

    for (IDnew_tmp in unique(mltpl_occuring_tmp$IDnew)){
      mltpl_occuring_single<-mltpl_occuring_tmp %>%
        filter(IDnew==IDnew_tmp)%>%
        mutate(eff_score=score_conseq(annot))%>%
        arrange(eff_score) 

      print(IDnew_tmp)
      print(mltpl_occuring_single)

      mltpl_occuring_single<-mltpl_occuring_single %>%
        select(-eff_score)

      tmp_annos<-rbind(tmp_annos, mltpl_occuring_single[1,])
    }
  }

  new_annos<-rbind(new_annos, tmp_annos) 
  tmp_IDs<-paste(tmp_annos$IDnew, collapse=",")
  tmp_set<-tibble(gene_id=single_gene_set, chr="1", pos=i, IDs=tmp_IDs)
  new_sets<-rbind(new_sets, tmp_set)
  i=i+1 
}



#### 

# make sets based on bed file

region_sets<-import(bed_path)
head(region_sets)
region_sets_names<-str_split(region_sets$name, ";", simplify=TRUE)
region_sets_names<-as_tibble(region_sets_names)
colnames(region_sets_names)<- c("set", "gene")
set_names<-unique(region_sets_names$set)

sources_w_AF<-comb_sources %>%
  filter(!is.na(AAF))%>%
  mutate(pos_start=BP,
         pos_end=BP+nchar(A1))

head(sources_w_AF)
chroms<-paste0("chr",sources_w_AF$CHR)
chroms<-str_replace(chroms, "chr23", "chrX")

annos_granges <- GRanges(
  seqnames = chroms,
  ranges = IRanges(start=sources_w_AF$pos_start, end = sources_w_AF$pos_end) )
head(annos_granges)
for (set_name in set_names){

  tmp_region<-region_sets[region_sets_names$set==set_name]
  tmp_intersect<-GenomicRanges::intersect(annos_granges, tmp_region)

  tmp_chr_pos_inReg <-sources_w_AF[annos_granges %in% tmp_intersect,]%>%
    distinct(ID, .keep_all=TRUE)%>%
    mutate(gene_id=set_name,
           annot="regionBased")
  if (nrow(tmp_chr_pos_inReg)==0){
    next()
  }

  new_annos<-rbind(new_annos, tmp_chr_pos_inReg, fill=TRUE)

  tmp_IDs<-paste(tmp_chr_pos_inReg$IDnew, collapse=",")
  tmp_set<-tibble(gene_id=set_name, chr="1", pos=i, IDs=tmp_IDs)
  new_sets<-rbind(new_sets, tmp_set)
  i=i+1 
}


####

new_annos_export<-new_annos %>%
 distinct(IDnew, gene_id, annot)

new_aaf<-comb_sources %>% distinct(IDnew, AAF) %>%
  filter(!is.na(AAF))

relevant_variants<-unique(new_annos_export$ID)

write_delim(x=new_annos_export,
            file = new_anno_path,
            col_names = FALSE )


write_delim(x=new_aaf,
            file = new_aafs_path,
            col_names = FALSE )

write_delim(x=new_sets,
            file = new_sets_path,
            col_names = FALSE )

write(x = relevant_variants,
      file=relevant_variants_path)

bim_new_id<-bim_new_id%>%
  select(CHRnew,IDnew,cM,BPnew,A1,A2)%>%
  relocate(CHRnew,IDnew,cM,BPnew,A1,A2)

write_tsv(x=bim_new_id,
            file = new_bim_path,
            col_names = FALSE )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
library(data.table)
vars<-fread(snakemake@input[[1]], fill=TRUE, header = TRUE, na.strings = ".")
# vars<-fread("results/for_RVAS/annotated_split_vep_1.tsv", fill=TRUE, header = TRUE, na.strings = ".")

vars<-vars[, c("chrom","pos","ref","alt"):=tstrsplit(ID, ":", fixed=TRUE)]

###### MASKS ######
vars<-vars[BIOTYPE == "protein_coding"]

vars<-vars[,REVEL_score:=as.numeric(REVEL_score)]


vars<-vars[, ':=' (revel03=!is.na(REVEL_score) & REVEL_score>0.3, 
                   revel05=!is.na(REVEL_score) & REVEL_score>0.5, 
                   revel07=!is.na(REVEL_score) & REVEL_score>0.7,
                   missense=like(Consequence, "missense_variant"),
                   syn=like(Consequence, "synonymous_variant"),
                   pLOF=(IMPACT=="HIGH"),
                   UTR5=like(Consequence, "5_prime_UTR_variant"),
                   UTR3=like(Consequence, "3_prime_UTR_variant"),
		   promoter=like(Consequence,"upstream_gene_variant") & TSSDistance <= 1000,
                   enhancer=like(Consequence, "upstream_gene_variant")& TSSDistance <= 50000 & !is.na(DHS),
                   CADD10= !(is.na(CADD_PHRED) | CADD_PHRED<10) ) ]

vars<-vars[, ':=' (revel03_mis=(revel03 & missense),
                   revel05_mis=(revel05 & missense), 
                   revel07_mis=(revel07 & missense), 
                   moderate_non_missense=(!missense & (IMPACT=="MODERATE")))]

vars<-vars[, by=c("ID", "Gene"), mask_anno:=ifelse(sum(pLOF)>0,"pLoF", 
                                            ifelse(sum(revel07_mis)>0, "missense.revel07",
                                            ifelse(sum(revel05_mis)>0, "missense.revel05",
                                            ifelse(sum(revel03_mis)>0, "missense.revel03",
                                            ifelse(sum(missense | moderate_non_missense)>0, "moderate",
                                            ifelse(sum(syn)>0, "synonymous", 
                                            ifelse(sum(UTR5)>0 & sum(CADD10)>0, "UTR5_CADD", 
                                            ifelse(sum(UTR5)>0, "UTR5",
                                            ifelse(sum(UTR3)>0 & sum(CADD10)>0, "UTR3_CADD",
                                            ifelse(sum(UTR3)>0, "UTR3", 
                                            ifelse(sum(promoter)>0 & sum(CADD10)>0, "promoter_CADD", 
                                            ifelse(sum(promoter)>0, "promoter", 
                                            ifelse(sum(enhancer)>0 & sum(CADD10)>0, "enhancer_CADD", 
                                            ifelse(sum(enhancer)>0, "enhancer",
                                                   "not_relevant")
                                                                        ) ) ) ) ) ) ) ) ) ) ) ) )
]

anno_list<-rbindlist(list(vars[, c("ID", "Gene", "mask_anno")]), use.names = FALSE)

###### AF ######
vars=vars[, maxAF:=pmax(as.numeric(gnomAD_ex_AF), as.numeric(gnomAD_ge_AF), 0,
                        na.rm = TRUE)]

vars=vars[, minAF:=pmin(as.numeric(gnomAD_ex_AF), as.numeric(gnomAD_ge_AF), 1,
                        na.rm = TRUE)]

vars=vars[, maxAC:=pmax(as.numeric(gnomAD_ex_AC), as.numeric(gnomAD_ge_AC), 0, na.rm = TRUE)]

vars=vars[, maxHom:=pmax(as.numeric(gnomAD_ex_nhomalt), as.numeric(gnomAD_ge_nhomalt), 0, na.rm = TRUE)]


#vars=vars[,AF_category:=ifelse( (maxAF < 0.005) & (maxAC > 9 | maxHom > 0 ), 0.005, maxAF) ]
vars=vars[,AF_category:=maxAF]

###### SET LIST #######
collapsed_vars<-vars[,concats:=paste(ID, collapse=","), by=Gene]
collapsed_vars<-unique(collapsed_vars[,c("Gene","chrom","concats")])
collapsed_vars<-collapsed_vars[,pos:=1:nrow(collapsed_vars)]
collapsed_vars<-collapsed_vars[Gene!="" & !is.na(Gene), ]


fwrite(unique(anno_list), file = snakemake@output[[1]], col.names = FALSE, sep=" ")
fwrite(unique(vars[, c("ID", "AF_category")]), file = snakemake@output[[2]], col.names = FALSE, sep=" ")
fwrite((collapsed_vars[,c("Gene","chrom","pos","concats")]), file = snakemake@output[[3]], col.names = FALSE, sep=" ")

fwrite(vars[, -"concats"], file = snakemake@output[[4]], col.names = TRUE, sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
library(tidyverse)

#cov<-read_tsv("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline/results/regenie_pheno/cov_A2_EUR")
#pheno<-read_tsv("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline/results/regenie_pheno/pheno_EUR_A2")

cov<-read_tsv(snakemake@input[[2]])%>%
  select(-FID)
pheno<-read_tsv(snakemake@input[[1]])%>%
  select(-FID)

pheno_cov<-pheno %>%
 left_join(cov, by = "IID")

write_tsv(x=pheno_cov,file=snakemake@output[[1]])

males<-pheno %>% 
  filter(sex_for_fam==1) %>%
  select(IID)

write_tsv(x=males, file=snakemake@output[[2]], col_names=FALSE)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import pandas as pd
import pandas_plink
from xarray import DataArray
import shutil
import dask.array

# %%
mask_def_file = snakemake.input.regenie_mask_def
aaf_file = snakemake.input.data_anno_aaf_file
anno_file = snakemake.input.regenie_anno_file

af_cutoffs = list(map(float, snakemake.params.af_cutoffs))
plink_in_prefix = snakemake.params.plink_in_prefix

out_prefix = snakemake.params.plink_out_folder
set_file = snakemake.output.setFile_rvtest

debug_joined_out = snakemake.output.debug_joined_out
#%%

# read plink with pandas_plink
(bim, fam, bed) = pandas_plink.read_plink(plink_in_prefix, verbose=False)
bim.cm = bim.cm.astype(int)
#%%
# Read regenie aaf_file, anno_file, mask_def
aafs = pd.read_csv(aaf_file, sep=' ', names=['ID', 'aaf'])

mask_def = pd.read_csv(mask_def_file, sep=' ', names=['mask', "annots"])
mask_def.annots = mask_def.annots.map(lambda a: a.split(','))
mask_def = mask_def.set_index("mask")

anno = pd.read_csv(anno_file, sep=' ', names=['ID', 'gene', 'annot'])
anno = anno.dropna()
# %%
# Create M0-M4 True/False columns in table
for mask, annots in mask_def.annots.items():
    anno[mask] = anno.annot.isin(annots)

# Join AF info into table
anno_aaf = anno.join(aafs.set_index("ID"), on="ID", how="outer")

# %%

# join bim (variant table) with regenie anno
joined = bim.join(anno_aaf.set_index("ID"), on="snp", how="inner")
joined = joined.sort_values(by=['gene', 'pos'])

joined.to_csv(debug_joined_out, sep="\t")

#%%

for mask in mask_def.index.unique():
    for af_cutoff in af_cutoffs:
        mask_out_prefix = f"{out_prefix}{mask}_{af_cutoff}"
        # subset variant table
        subset = joined[joined[mask] & (joined["aaf"] < af_cutoff)]
        # export genotype data (subset["i"] contains positions of variants in table)
        pandas_plink.write_plink1_bin(
            DataArray(bed[subset["i"]], dims=["variant", "sample"]),
            bed=mask_out_prefix+".bed"
            )
        # write variant subset to bim file
        subset[["gene","snp","cm","pos","a0","a1"]].to_csv(mask_out_prefix+".bim", sep=" ", header=False, index=False)
        # reuse original fam file
        shutil.copy(plink_in_prefix + ".fam", mask_out_prefix +".fam")

#%%

# create setfile, each gene has its own "chromosome"/contig
set_str = ""
for gene in joined.gene.unique():
    set_str += f"{gene}\t{gene}:0-{joined.pos.max() + 100}\n"

with open(set_file, "wt") as f:
    f.write(set_str)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
library(tidyverse)

check_S_path<-snakemake@input[[1]]
plot_out<-snakemake@output[[1]]


check_S<-read.table(check_S_path, header = T)
F_plot<-ggplot(check_S, aes(x=F, fill=PEDSEX==1 | PEDSEX==0))+
  geom_histogram(bins=100)

ggsave(filename = plot_out, plot = F_plot)
ggsave(filename = paste0(plot_out, "zoomed.jpg"), plot = F_plot + ylim(c(0,100)) )
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
packs <- c("qqman","optparse","data.table","R.utils")


for (p in packs) {
  if( !require(p, character.only = T)) {
    print(p)
    install.packages( p,  repos = c(CRAN = "http://cran.r-project.org") )
  }
}

option_list = list(
  make_option(c("-f", "--file"), type="character", default=NULL,
              help="dataset file name", metavar="character"),
  make_option(c("-o", "--out"), type="character",
              help="output file name [default= %default]", metavar="character"),
  make_option(c("-c","--chrcol"), type="character", default="CHR",
              help="chromosome column [default= %default]", metavar="character"),
  make_option(c("-p","--pval_col"), type="character", default="P",
              help="pvalue column [default= %default]. This can be a comma separated list and plots will be generated for each of these", metavar="character"),
  make_option(c("-b","--bp_col"), type="character", default="BP",
              help="bp column [default= %default]", metavar="character"),
  make_option(c("-l","--loglog_pval"), type="integer", default=10,
              help="-log10 p-val threshold for using log-log scale in manhattan plot [default= %default]", metavar="integer"),
  make_option(c("-i","--id_col"), type="character", default="ID",
              help="-ID_col [default= %default]", metavar="character"),
  make_option(c("-g","--log"), type="logical", default="FALSE",
              help="-log [default= %default]", metavar="logical"),
  make_option(c("-m","--minrep_col"), type="character",
              help="if given then chr:bp:ref:alt identifier assumed and chr and bp are read from there [default= %default]", metavar="character")
);

opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser, positional_arguments=0);

# file="../GWAS/REGENIE_OUT_STEP2_chr1_pheno.regenie"
# bp_col="GENPOS"
# chr_col="CHROM"
# pcols =c("LOG10P")

file <- opt$options$file
print(paste("reading file:", file))

data <- fread(file, header=T)

options(bitmapType='cairo')

print(str(opt))
bp_col <- opt$options$bp_col
chr_col <- opt$options$chrcol
id_col <- opt$options$id_col

print(summary(data))
print( summary( data[[chr_col]] ) )
#colnames(data) <- toupper( colnames(data) )

pcols <- unlist(strsplit(opt$options$pval_col,","))

output_prefix=file

if( !is.null(opt$options$out)) {
  output_prefix=opt$options$out
}


if(! is.null(opt$options$minrep_col ) ) {
    print("getting BP and CHR from minrepid")
    split <- strsplit(as.character(data[[opt$options$minrep_col]]), ":")
    data[[bp_col]] <- unlist( lapply( split, function(x) as.numeric(x[2]) ))
    data[[chr_col]] <- unlist( lapply( split, function(x) x[1] ))
}

print(append(pcols,c(bp_col,chr_col)))

if( any( ! append(pcols,c(bp_col,chr_col)) %in% colnames(data)   )) {
  stop( paste0("All required columns do not exist in the data: ", paste(pcols,sep=",", collapse=""),",", bp_col, ",",chr_col,  collapse="" ))
}


print(summary(as.factor(data[[chr_col]])))

data[[chr_col]] <- gsub("chr","",data[[chr_col]])
data[[chr_col]] <- gsub("X|chrX","23",data[[chr_col]])
data[[chr_col]] <- gsub("Y|chrY","24",data[[chr_col]])
data[[chr_col]] <- gsub("MT|chrMT|M|chrM","25",data[[chr_col]])

data[[chr_col]] <- as.numeric(data[[chr_col]])
data <- data[ !is.na(data[[chr_col]]) ]

quants <- c(0.7,0.5,0.456,0.1,0.01, 0.001)



for( pcol in pcols) {
  if (opt$options$log==TRUE){
	data[[pcol]]<- 10^(-data[[pcol]])
  }
  subdata <- data[ !is.na(data[[pcol]]) & is.numeric( data[[pcol]]  ) ]

  lambda  <- round(  quantile(  (qchisq(1-subdata[[pcol]], 1) ), probs=quants ) / qchisq(quants,1), 3)
  png( paste(output_prefix,"_", pcol ,"_qqplot.png", sep="" ))
  qq(subdata[[pcol]], main=paste("\nlambda ", quants, ": ", lambda, sep="" ) )
  dev.off()
  sink( paste(output_prefix,"_",  pcol ,"_qquantiles.txt", sep="" ) )
  cat( paste( quants, ":", lambda, sep=""))
  sink()

  print("subsetting p-vals < 0.01 for manhattan...")

  subdata <- subdata[ subdata[[pcol]]<0.01 & subdata[[pcol]]>0 ]
  print( paste0("Plotting manhattan with ", nrow(subdata), " variants") )
  print( summary(subdata[[pcol]] ))
  png( paste(output_prefix,"_",pcol,"_manhattan.png", sep=""), width=1000, height=400)
  logs <- -log10(subdata[[pcol]])

  manhattan( subdata , chr=chr_col, bp=bp_col, p=pcol,snp=id_col, ylim=c( 2,max(logs)+1)  )
  dev.off()

  print("!!!!!!!!!!!!!!!!!!!!!!!")

  print("plotting log-log manhattan")
  loglog_p <- opt$options$loglog_pval
  logs <- ifelse(logs < loglog_p, logs, loglog_p * log10(logs) / log10(loglog_p))
  subdata[["p_scaled"]] <- 10^(-logs)
  tick_pos <- round(seq(1, max(logs), length.out=round(max(logs))))
  tick_lab <- sapply(tick_pos, function(pos) { round(ifelse(pos < loglog_p, pos, loglog_p^(pos/loglog_p))) })
  png( paste(output_prefix,"_",pcol,"_manhattan_loglog.png", sep=""), width=1000, height=400)
  manhattan( subdata, chr=chr_col, bp=bp_col, p="p_scaled", snp=id_col, ylim=c( 2,max(logs)+1), yaxt="n")
  axis(2, at = tick_pos, labels=tick_lab, las=2)
  dev.off()
}
63
64
65
66
67
68
69
70
71
72
shell:
	"""
	#wget -nc ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	#gzip -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	bcftools filter -r chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX \
	  {input.bcf} -Ou | \
	  bcftools norm --check-ref w -f {input.fasta} -Ou | \
	  bcftools annotate --set-id '%CHROM:%POS:%REF:%FIRST_ALT' -Oz > {output.vcf_gz}
	tabix -p vcf {output.vcf_gz}
	"""
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
shell:
	"""
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/vcf2hail.py $(pwd)/{input.vcf} \
	$(pwd)/{output} \
	$(pwd)/{params.tmp_dir} \
	$(pwd)/{input.individual_list}"

	if [ {config[cluster]} = "yes" ]; then
	worker_nodes_excluded={config[worker_nodes_excluded]}
	num_workers=60
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi
	"""
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
shell:
	"""	
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/primaryQC.py \
	$(pwd)/{input.mt} \
	$(pwd)/{params.tmp_dir} \
	$(pwd)/{output.sample_qc_file} \
	$(pwd)/{params.out_plink}"

	if [ {config[cluster]} = "yes" ]; then
	queue="medium"
	export TMPDIR=$(pwd)/{params.tmp_dir}
	hours_to_run=12
	DRIVE_MEM=78
	SPARK_LOCAL_DIRS=$(pwd)/{params.tmp_dir}
	worker_nodes_excluded={config[worker_nodes_excluded]}
	num_workers=60
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi

	rm -rf {params.tmp_dir}*
	"""
160
161
162
163
164
165
166
167
168
169
170
171
shell:
	"""
	echo $(pwd)

	cp workflow/scripts/plot_sampleQC.R {params.path_output}
	cp {input.xl_file} {params.path_output}

	cd {params.path_output}

	Rscript -e 'library(rmarkdown); rmarkdown::render("plot_sampleQC.R", "html_document")'
	rm *.R
	"""
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
shell:
	"""
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/MakeVarSets.py \
	$(pwd)/{params.tmp_dir} \
	$(pwd)/{input.MT} \
	$(pwd)/{input.input_vcf} \
	$(pwd)/{input.individual_list} \
	$(pwd)/{params.out_trunk} \
	$(pwd)/{params.out_Commontrunk}"

	if [ {config[cluster]} = "yes" ]; then
	export TMPDIR=$(pwd)/{params.tmp_dir}
	worker_nodes_excluded={config[worker_nodes_excluded]}
	hours_to_run=12
	DRIVE_MEM=148
	num_workers=60
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi
	"""
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
shell:
	"""
	plink2 \
	--bfile {params.in_plink} \
	--maf 0.01 \
	--exclude bed0 {input.bed} \
	--make-bed \
	--out {params.tmp_plink}		

	plink \
	--bfile {params.tmp_plink} \
	--indep-pairwise 1000kb 50 0.2 \
	--out {params.out_plink}

	plink \
	--bfile  {params.in_plink} \
	--extract {params.out_plink}.prune.in \
	--split-x hg38 no-fail \
	--make-bed \
	--out {params.out_plink}

	rm {params.tmp_plink}*
	"""
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--filter-females \
	--hwe 1e-3 midp include-nonctrl \
	--make-just-bim \
	--out {params.hwe_vars}

	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--extract {output.hwe_vars} \
	--check-sex 0.5 0.8 \
	--out {params.out_plink}

	"""
305
script: "scripts/plotFcheckSex.R"
320
321
322
323
324
325
326
327
328
329
shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--chr 1-22 \
	--missing \
	--out {params.out_plink}
	"""
343
344
345
346
347
348
349
350
351
352
shell:
	"""
	mkdir -p {params.out_folder}
	cd {params.out_folder}
	wget -nc ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

	#reference genome  (GRCh38)
	#wget -nc ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	#gzip -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	"""
362
363
364
365
366
367
368
369
shell:
	"""
	mkdir -p {params.folder_cont}
	cd {params.folder_cont}
	prefix="ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.chr"
	suffix=".shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz"
	wget -nc "$prefix""{wildcards.contig}""$suffix" "$prefix""{wildcards.contig}""$suffix".tbi
	"""
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
shell:
	"""
	if bcftools view -q 0.05:minor {input.vcf} | \
	bcftools norm -m-any --check-ref w -f "{input.fasta}" | \
	bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
	bcftools norm -Ob --rm-dup both \
	> {output.bcf} ; then
	echo "no error"
	fi

	plink --noweb \
	--bcf {output.bcf} \
	--keep-allele-order \
	--vcf-idspace-to _ \
	--allow-extra-chr 0 \
	--split-x b38 no-fail \
	--make-bed \
	--out {params.bed1}

	plink --noweb \
	--bfile {params.bed1} \
	--extract {input.hailbim} \
	--maf 0.10 --indep 50 5 1.5 \
	--make-bed \
	--out {params.bed2}
	"""
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
shell:
	"""
	echo {input._1000G_data} | tr " " "\\n" | sed 's/.bed//g' > {output.merge_list}
	plink --merge-list {output.merge_list} --out results/1000G/Merged
	awk '{{ print $2 }}' results/1000G/Merged.bim > results/1000G/MergeVariants.txt

	plink --bfile {params.hailplink} \
	 --extract results/1000G/MergeVariants.txt \
	 --make-bed \
	 --out results/1000G/hail_for_ancestry

	printf "results/1000G/Merged\\nresults/1000G/hail_for_ancestry" > results/1000G/ForMergeFull.list

	mkdir -p results/PCA
	plink --merge-list results/1000G/ForMergeFull.list --out results/PCA/MergeFullForPCA

	awk '{{ print $1,$2 }}' results/1000G/Merged.fam | awk '$(NF+1) = "1000G"' > results/PCA/clusters.txt
	awk '{{ print $1,$2 }}' results/1000G/hail_for_ancestry.fam | awk '$(NF+1) = "Cohort"' >> results/PCA/clusters.txt

	plink --bfile results/PCA/MergeFullForPCA \
	 --pca-cluster-names 1000G \
	 --pca 20 \
	 --out {params.pca_prefix} \
	 --within results/PCA/clusters.txt
	"""
470
471
472
473
474
475
shell:
	"""
	cp -f workflow/scripts/populations_PCA.R results/PCA/populations_PCA.R
	cd results/PCA/
	Rscript -e 'library(rmarkdown); rmarkdown::render("populations_PCA.R", "html_document")'
	"""
488
489
490
491
492
493
494
shell:
	"""
	king \
	-b {params.plink_in}.bed \
	--kinship \
	--prefix {params.prefix}
	"""
520
521
522
523
524
525
526
527
528
529
530
531
532
shell:
	"""
	cut -f1 {input.list} > {params.tmp_id}
	cat {input.fam} | grep -f {params.tmp_id} > {params.tmp_fam}

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--keep {params.tmp_fam} \
	--make-bed \
	--out {params.out_prefix}
	"""
559
560
561
562
563
564
565
566
shell:
	"""
	mkdir -p {params.path} 
	Rscript workflow/scripts/generate_regenie_pheno.R {input.individual_list} {input.fam} {wildcards.pheno} {output.fam} {output.pheno}

	cp -f {input.bim} {output.bim}
	cp -f {input.bed} {output.bed}
	"""
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
shell:
	"""
	mkdir -p {params.pathQC}
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--filter-cases \
	--missing \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcCAS

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--filter-controls \
	--missing \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcCON

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--hardy \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcALL

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--hardy \
	--chr 23 \
	--filter-females \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcALL_female

	### R session
	wd=$(pwd)
	cp workflow/scripts/additional_QC_cohort.R {params.pathQC}/
	cd {params.pathQC}
	Rscript -e 'library(rmarkdown); rmarkdown::render("additional_QC_cohort.R", "html_document")'
	cd $wd

	# 1e. final QC (4675116)
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--exclude {params.pathQC}/missingness_hardy_weinberg_filter.txt \
	--make-bed \
	--out {params.out_prefix}

	rm -rf {params.pathQC}
	"""
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
shell:
	"""
	mkdir -p {params.folderPCA}

	#common
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--chr 1-22 \
	--indep-pairwise 50 5 0.05 \
	--keep-allele-order \
	--maf 0.01 \
	--out "{params.pathinterim}"

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--extract "{params.pathinterim}.prune.in" \
	--pca 10 \
	--out "{params.out_prefix}"

	rm {params.pathinterim}*
	"""
691
692
693
694
shell:
	"""
	Rscript workflow/scripts/generate_covar_file.R {input.individual_list} {input.PCA_cov} {output}
	"""
708
709
shell:
	"Rscript workflow/scripts/compile_phenotype_plots.R {input} {params.out_dir}"
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
shell:
	"""
	regenie \
	  --step 2 \
	  --minMAC 5 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --bed {params.plink_file} \
	  --bt \
	  --ignore-pred \
	  --write-samples \
	  --bsize 5000 \
	  --out {params.regenie_step2} \
	  --af-cc \
	  --firth \
	  --firth-se \
	  --approx \
	  --gz
	  """
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
shell:
	"""
	regenie \
	  --step 2 \
	  --minMAC 5 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --bed {params.plink_file} \
	  --condition-list {input.conditional_list} \
	  --bt \
	  --ignore-pred \
	  --write-samples \
	  --bsize 5000 \
	  --out {params.regenie_step2} \
	  --af-cc \
	  --firth \
	  --firth-se \
	  --approx \
	  --gz
	  """
819
820
821
822
823
824
825
826
827
828
829
830
shell:
	"""
	plink2 \
	--bim {input.bim} \
	--bed {input.bed} \
	--fam {input.fam} \
	--glm no-x-sex hide-covar log10 \
	--covar {input.cov} \
	--covar-variance-standardize \
	--out {params.plink_out} \
	--mac 5
	  """
842
843
844
845
shell:
	"""
	Rscript workflow/scripts/qqplots_GWAS.R -f {input} -o {input} -c CHROM -p LOG10P -b GENPOS --log TRUE -i ID
	"""
864
865
script:
	"scripts/add_PRS_to_COV.R"
887
888
script:
	"scripts/gene_sets.R"
906
907
908
909
910
911
912
913
914
915
shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	--extract {input.relevant_vars} \
	--make-bed \
	--out {params.plink_out}
	"""
935
936
937
938
939
940
941
942
943
944
shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	--max-maf {wildcards.AF} \
	--make-bed \
	--out {params.plink_out}
	"""
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
shell:
	"""
	export LD_LIBRARY_PATH={config[LD_LIBRARY_PATH]}

	regenie \
	  --step 2 \
	  --minMAC 1 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --aaf-file {input.anno_aaf} \
	  --anno-file {input.anno_csq} \
	  --mask-def {input.mask_def} \
	  --set-list {input.var_sets} \
	  --bed {params.plink_file} \
	  --aaf-bins {wildcards.AF} \
	  --bt \
	  --write-samples \
	  --ignore-pred \
	  --out {params.regenie_step2} \
	  {params.test_params} \
	  --af-cc \
	  --maxstep-null 2 \
	  --maxiter-null 100000 \
	  --gz

	  #--check-burden-files \
	  #--strict-check-burden \
	"""
1012
1013
1014
1015
1016
1017
1018
1019
1020
shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.mask_bim} \
	--bed {input.mask_bed} \
	--model fisher \
	--out {params.plink_out}
	"""
1034
1035
script:
	"scripts/qq_plots_RVAS.R"
1056
1057
1058
1059
1060
1061
1062
1063
1064
shell:
	"""
	plink \
	--fam {input.famCommon} \
	--bim {input.bimCommon} \
	--bed {input.bedCommon} \
	--score {input.PRS} 1 2 3 sum \
	--out {params.out_pref}
	"""
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--chr X \
	--from-bp 12767072 \
	--to-bp 12967072 \
	--make-bed \
	--out {params.out_plink}
	"""
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
shell:
	"""
	export LD_LIBRARY_PATH={config[LD_LIBRARY_PATH]}

	regenie \
	  --step 2 \
	  --minMAC 1 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --bed {params.plink_file} \
	  --bt \
	  --firth \
	  --firth-se \
	  --write-samples \
	  --ignore-pred \
	  --out {params.regenie_step2} \
	  {params.rez} \
	  {params.sex_param} \
	  --af-cc \
	  --bsize 500 \
	  --gz
	"""
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	{params.sex_param} \
	--model fisher \
	--out {params.plink_out}
	"""
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	{params.sex_param} \
	--filter-cases \
	--freqx \
	--out {params.out_case}

	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	{params.sex_param} \
	--filter-controls \
	--freqx \
	--out {params.out_ctrl}		
	"""
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	{params.maf} \
	--homozyg \
	--out {params.ROH_out}
	"""
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	--het gz \
	--ibc \
	--out {params.ROH_out}
	"""
1273
1274
1275
1276
1277
1278
shell: 
	"""
	cd {params.out_folder}
	wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr{wildcards.contig}.vcf.bgz
	wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr{wildcards.contig}.vcf.bgz.tbi
	"""
1287
1288
1289
1290
1291
1292
1293
1294
shell:
	"""
	cd {params.out_dir}
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz.tbi
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.snv.tsv.gz
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.snv.tsv.gz.tbi
	"""
1308
1309
1310
1311
1312
1313
1314
1315
shell:
	"""
	keep_string="^INFO/AC,INFO/AN,INFO/AF,INFO/popmax,INFO/faf95_popmax,INFO/AC_oth,INFO/AN_oth,INFO/AF_oth,INFO/nhomalt_oth,INFO/AC_ami,INFO/AN_ami,INFO/AF_ami,INFO/nhomalt_ami,INFO/AC_sas,INFO/AN_sas,INFO/AF_sas,INFO/nhomalt_sas,INFO/AC_fin,INFO/AN_fin,INFO/AF_fin,INFO/nhomalt_fin,INFO/AC_eas,INFO/AN_eas,INFO/AF_eas,INFO/nhomalt_eas,INFO/AC_amr,INFO/AN_amr,INFO/AF_amr,INFO/nhomalt_amr,INFO/AC_afr,INFO/AN_afr,INFO/AF_afr,INFO/nhomalt_afr,INFO/AC_raw,INFO/AN_raw,INFO/AF_raw,INFO/nhomalt_raw,INFO/AC_mid,INFO/AN_mid,INFO/AF_mid,INFO/nhomalt_mid,INFO/nhomalt,INFO/AC_asj,INFO/AN_asj,INFO/AF_asj,INFO/nhomalt_asj,INFO/AC_nfe,INFO/AN_nfe,INFO/AF_nfe,INFO/nhomalt_nfe,INFO/AC_popmax,INFO/AN_popmax,INFO/AF_popmax,INFO/nhomalt_popmax"
	bcftools annotate -x "$keep_string" {input} -Oz -o {output}

	tabix -pvcf {output}
	#rm {input}
	"""
1327
1328
1329
1330
1331
1332
1333
shell:
	"""
	bcftools concat -Oz {input} > {output.reduced}
	tabix -pvcf {output.reduced}
	#bcftools view ouptut.reduced -i "INFO/AF>0.01" -Oz -o output.above_one_perc
	#tabix -pvcf output.above_one_perc
	"""
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
shell:
	"""
	# software required: bedops, bedtools
	mkdir -p {params.out_folder}
	cd {params.out_folder}

	# get bedtools reference file
	wget -Nc https://raw.githubusercontent.com/arq5x/bedtools2/master/genomes/human.hg38.genome -O bedtools_hg38_ref_file.genome

	# GENCODE
	wget -Nc ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gff3.gz

	# exons
	zcat gencode.v35.annotation.gff3.gz | \
	 awk '$3 == "exon"' - | \
	 sort -k1,1 -k4,4n -V | \
	 bedtools merge -i stdin | \
	 cut -f1-3 | gzip > gencode35Exons.bed.gz

	# add padding to gencode file
	bedtools slop -b 20 \
	-i gencode35Exons.bed.gz \
	-g bedtools_hg38_ref_file.genome | \
	gzip > gencode35ExonsPadded.bed.gz

	# promoters
	zcat gencode.v35.annotation.gff3.gz | \
	grep -P "transcript\t" | grep -P "\t\+\t" | \
	 awk '{{OFS="\t"}}{{print $1,$4-1001,$4-1}}' > plus_strand_promotor.bed

	zcat gencode.v35.annotation.gff3.gz | \
	grep -P "transcript\t" | grep -P "\t\-\t" | \
	awk '{{OFS="\t"}}{{print $1,$5-1,$5+999}}' > minus_strand_promotor.bed

	cat plus_strand_promotor.bed minus_strand_promotor.bed | \
	sort -k1,1 -k3,3n -V | \
	awk '{{OFS="\t"}}{{if ($2<0) print $1,"0",$3; else print $0}}' | \
	gzip > promoters.bed.gz


	# ClinVar
	wget -Nc ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
	zcat clinvar.vcf.gz | grep "^#"  > clinvar_pathogenic.vcf
	zcat clinvar.vcf.gz | grep -v "^#" | egrep "pathogenic|Pathogenic" | grep -v "uncertain" >> clinvar_pathogenic.vcf
	bedtools merge -i clinvar_pathogenic.vcf | cut -f1-3 | awk '{{OFS=""}}{{print "chr",$0}}' | gzip > clinvar_pathogenic.bed.gz

	# SpliceAI
	# filter splice AI for scores >0.5 and convert to BED
	zcat {params.anno_sources}'/spliceai_scores_sorted.hg38.vcf.gz' | \
	egrep "0\.[5-9]|1\.0" | \
	awk 'BEGIN {{OFS="\t"}} {{ print $1, $2-1, $2 }}' | \
	gzip > \
	splice_ai_positions.bed.gz


	# merge and sort individual bed files
	zcat gencode35ExonsPadded.bed.gz clinvar_pathogenic.bed.gz splice_ai_positions.bed.gz promoters.bed.gz ../../{input.DHS} | \
	cat - ../../{input.GenRa} | \
	cut -f1-3 | \
	bedtools sort | \
	bedtools merge > "ExonsClinVarSpliceAI.bed"
	"""
1425
1426
1427
1428
1429
1430
1431
1432
1433
shell:
	"""
	cat {input.bedfile} | grep "^chr{wildcards.contig}	" > {output.relevant_bed}

	bcftools view {input} -R {output.relevant_bed} -Ob | \
	bcftools sort -T $TMPDIR | \
	bcftools norm -m-any -Ob | \
	bcftools norm --remove-duplicates -Oz -o {output.vcf_prefiltered}
	"""
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
shell:
	"""

	work_dir=$(pwd)
	database_dir={params.anno_sources}

	echo $SLURM_NODELIST			
	echo $TMPDIR

	cd $TMPDIR
	#git clone https://github.com/ImperialCardioGenetics/UTRannotator.git
	#export PERL5LIB=$(pwd)/UTRannotator
	cp -v $database_dir'/Homo_sapiens_assembly38.fasta.bgz'* . #/dev/shm/ #1G, store to RAMdisk
	cp -v $database_dir'/gnomad.genomes.v3.1.2.sites.all_red.vcf.gz'* . # 9G
	#cp -v $database_dir'/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz'* . # 86 G biggest file, do not copy
	mkdir -p dbNSFP41
	cp -v $database_dir'/dbNSFP41/dbNSFP4.1a_hg38.gz'* dbNSFP41/ #31G
	cp -v $database_dir'/dbNSFP41/dbNSFP4.1a.readme.txt' dbNSFP41/
	cp -v $database_dir'/spliceai_scores_sorted.hg38.vcf.gz'* . #0,6G
	cp -v $database_dir'/cnv_database_171_samples.vcf.gz'* .
	#cp -v $database_dir'/whole_genome_SNVs.tsv.gz'* . # copy CADD 81G
	#cp -v $database_dir'/gnomad.genomes.r3.0.snv.tsv.gz'* . # copy CADD 6G
	cp -rv $database_dir'/homo_sapiens' . # 15G
	wget -N ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz  
	wget -N ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi	

	cd $work_dir

	#export PERL5LIB=$database_dir'/UTRannotator/'
	#wget https://raw.githubusercontent.com/Ensembl/VEP_plugins/postreleasefix/101/TSSDistance.pm
	export PERL5LIB=$database_dir

	vep -v  \
	--cache \
	--offline \
	--force_overwrite \
	--fork 8 \
	--buffer {params.buffer} \
	--hgvs \
	--no_stats \
	--format vcf \
	--numbers \
	--nearest gene \
	--vcf \
	--distance 50000,1000 \
	--gene_phenotype \
	--pick_allele_gene \
	--af \
	--af_esp \
	--check_existing \
	-a GRCh38 \
	--dir_cache $TMPDIR \
	--fasta $TMPDIR'/Homo_sapiens_assembly38.fasta.bgz' \
	--custom $TMPDIR'/clinvar.vcf.gz',ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
	--custom $TMPDIR'/gnomad.genomes.v3.1.2.sites.all_red.vcf.gz',gnomAD_ge,vcf,exact,0,AF,AC,AN,nhomalt,popmax,AF_oth,AF_ami,AF_sas,AF_fin,AF_eas,AF_amr,AF_afr,nhomalt_afr,AF_raw,AF_mid,AF_asj,AF_nfe,AF_popmax \
	--custom $database_dir'/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz',gnomAD_ex,vcf,exact,0,AF,AN,AC,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,nhomalt \
	--plugin dbNSFP,$TMPDIR'/dbNSFP41/dbNSFP4.1a_hg38.gz',Ensembl_proteinid,MutationTaster_pred,MutationTaster_score,LRT_pred,Polyphen2_HDIV_pred,Polyphen2_HVAR_pred,SIFT_pred,REVEL_score,REVEL_rankscore,phyloP100way_vertebrate,phyloP100way_vertebrate_rankscore,phastCons100way_vertebrate,phastCons100way_vertebrate_rankscore,BayesDel_noAF_score,BayesDel_noAF_pred \
	--plugin CADD,$database_dir'/whole_genome_SNVs.tsv.gz',$database_dir'/gnomad.genomes.r3.0.snv.tsv.gz' \
	--custom $TMPDIR'/spliceai_scores_sorted.hg38.vcf.gz',SpliceAI,vcf,exact,0,ALLELE,SYMBOL,DS_AG,DS_AL,DS_DG,DS_DL  \
	--custom {input.DHS},DHS,bed,exact,0 \
	--plugin StructuralVariantOverlap,file=$TMPDIR'/cnv_database_171_samples.vcf.gz' \
	--plugin TSSDistance \
	-i {input.vcf} \
	-o stdout | bgzip > {output}

	#--plugin UTRannotator,$TMPDIR'/UTRannotator/uORF_starts_ends_GRCh38_PUBLIC.txt' \
	#--custom $TMPDIR'/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz',gnomAD_ex,vcf,exact,0,AF,AN,AC,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,nhomalt \

	#rm /dev/shm/Homo_sapiens_assembly38.fasta.bgz*

	"""
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
shell:
	"""
	function return_AF_string_w_genomes
	{{
	echo "((gnomAD_ge_AF < "$1" or not gnomAD_ge_AF) and ((IMPACT is HIGH) or (IMPACT is MODERATE) or (IMPACT is LOW) or (SpliceAI_SYMBOL)) and (gnomAD_ge_nhomalt < 11 or not gnomAD_ge_nhomalt)) or (ClinVar_CLNSIG and not (ClinVar_CLNSIG match benign or ClinVar_CLNSIG match Benign)) or (CLIN_SIG match pathogenic or CLIN_SIG match Pathogenic)"

	#echo "((gnomAD_ge_AF_AFR < "$1" or not gnomAD_ge_AF_AFR) and (gnomAD_ge_AF_AMR < "$1" or not gnomAD_ge_AF_AMR) and (gnomAD_ge_AF_EAS < "$1" or not gnomAD_ge_AF_EAS) and (gnomAD_ge_AF_FIN < "$1" or not gnomAD_ge_AF_FIN) and (gnomAD_ge_AF_NFE < "$1" or not gnomAD_ge_AF_NFE) and (gnomAD_ge_AF_SAS < "$1" or not gnomAD_ge_AF_SAS) and (AA_AF < "$1" or not AA_AF) and (EA_AF < "$1" or not EA_AF) and ((IMPACT is HIGH) or (IMPACT is MODERATE) or (IMPACT is LOW) or (SpliceAI_SYMBOL)) and (gnomAD_ex_nhomalt < 5 or not gnomAD_ex_nhomalt)) or (ClinVar_CLNSIG and not (ClinVar_CLNSIG match benign or ClinVar_CLNSIG match Benign)) or (CLIN_SIG match pathogenic or CLIN_SIG match Pathogenic)"
	}}

	filter_vep_str="$(return_AF_string_w_genomes 0.02)"
	echo "bcftools +fill-tags {input} -- -t AN,AC | filter_vep --format vcf --force_overwrite --filter \\""$filter_vep_str"\\" | bgzip > {output}" | bash
	"""
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
shell:
	"""

	zcat {input} | sed 's:;AN=:;AN_cohort=:g' | \
	sed 's:ID=AN,:ID=AN_cohort,:g' | \
	sed 's:;AC=:;AC_cohort=:g' | \
	sed 's:ID=AC,:ID=AC_cohort,:g' | \
	sed 's:||:|0|:g'| sed 's:||:|0|:g' | bgzip > {params.vep_renamed}

	split_string1='%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%QUAL\\t%AC_cohort\\t%AN_cohort\\t'
	split_string2='%CSQ\\t'
	split_string3='[%GT\\t%TGT\\t%AD\\t%GQ\\t]' # 

	header1=$(echo $split_string1 | sed 's:\\\\t:	:g')
	header2=$(bcftools +split-vep {params.vep_renamed} -l -l | cut -f2 | tr "\\n" "\\t")
	header3=$(bcftools query -u -H -f $split_string3'\\n' {params.vep_renamed} | head -n1 | cut -f2-5) ||  echo ""
	echo "sample	${{header1}}	$header2	$header3" | awk -v OFS="\\t" '$1=$1'  > {output}
	for sample in `bcftools query -l {params.vep_renamed}`; do
		bcftools view -c1 -s $sample {params.vep_renamed}  | \
		bcftools +split-vep -f $split_string1$split_string2$split_string3'\\n' -d -A tab | \
		awk -v var="$sample"  -F $'\t' 'BEGIN {{OFS = FS}} {{print var, $0 }}' >> {output}
	done
	#rm -f {params.vep_renamed}
	"""
1598
1599
1600
1601
1602
shell:
	"""
	bcftools +split-vep -l {input} | cut -f2 | tr '\\n' ';' | awk 'BEGIN {{ FS = ";"}} ;{{ print "ID;"$0}}' > {output}
	bcftools +split-vep -d -f'%CHROM:%POS:%REF:%ALT;%CSQ\n' -A ";" {input} >> {output}
	"""
1618
1619
script:
	"scripts/make_aaf_anno.R"
1632
1633
shell:
	"cat {input} > {output}"
1648
1649
1650
1651
1652
1653
1654
shell:
	"""
	bcftools view --min-ac 1 {input} | \
	bcftools +fill-tags | \
	bcftools +split-vep -f "%CHROM\t%ID\t%MAF\t%AN\t%gnomAD_ge_AF\t%gnomAD_ge_AF_nfe\t%gnomAD_ge_AC\n" --select worst:any |
	gzip > {output}
	"""
1691
1692
1693
1694
1695
shell:
        """
        plink2 --bfile {params.plink_in_prefix} --freq --out {params.freq_out_prefix}
        plink2 --bfile {params.plink_in_prefix} --freq counts --out {params.ac_out_prefix}
        """
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
run:
        import pandas as pd
        import numpy as np
        data_af = pd.read_csv(input.data_af_file, sep="\s+")
        data_af=data_af[["ID","ALT_FREQS"]]
        anno_af = pd.read_csv(input.regenie_aaf_file, sep=" ", names=["snp", "aaf"])
        joined = data_af.join(anno_af.set_index("snp"), on="ID", how="inner")
        joined["merged_aaf"] = joined[["ALT_FREQS", "aaf"]].max(axis=1)

        # added AC check - however - this is not effective with the current sample size)
        data_ac = pd.read_csv(input.data_ac_file, sep="\s+")
        data_ac=data_ac[["ID","ALT_CTS"]]
        joined_AC=joined.join(data_ac.set_index("ID"), on="ID", how="inner")
        joined_AC["merged_aaf"] = np.where( (joined_AC.ALT_CTS > 2) & (joined_AC.merged_aaf<0.001), 0.005, joined_AC.merged_aaf)
        joined_AC.merged_aaf = joined_AC.merged_aaf.round(4)
        joined_AC[["ID", "merged_aaf"]].to_csv(output.data_anno_aaf_file, sep=" ", header=False, index=False, float_format='%.5E')
1749
script: "scripts/plink_subset_for_rvtest.py"
1767
1768
1769
1770
1771
shell:
        """
        plink2 --bfile {params.plink_in_prefix} --export vcf bgz id-paste=iid --out {params.vcf_out_prefix} --allow-extra-chr
        bcftools index -f -t {params.vcf_out_prefix}.vcf.gz
        """
1783
1784
1785
1786
1787
1788
1789
1790
run:
        import pandas as pd
        fam = pd.read_csv(input.fam, names=["fid", "iid","fatid","matid","sex","empty"], sep=' ')
        pheno = pd.read_csv(input.phenotypes, names=["fid","iid","pheno"], sep='\t', header=0).set_index("iid")
        pheno.head()
        out = fam.join(pheno.pheno, on="iid")
        out.pheno = out.pheno + 1
        out[["fid", "iid","fatid","matid","sex","pheno"]].to_csv(output.rvtest_pheno_file, sep="\t", header=False, index=False)
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
shell:
        """
        rvtest \
        --inVcf {input.invcf} \
        --pheno {input.rvtest_pheno_file} \
        --setFile {input.setFile_rvtest} \
        --noweb \
        --kernel skat \
        --out {params.out_prefix}
        """
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
shell:
        """
        rvtest \
        --inVcf {input.invcf} \
        --pheno {input.rvtest_pheno_file} \
        --setFile {input.setFile_rvtest} \
        --noweb \
        --burden exactCMC \
        --out {params.out_prefix}
        """
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
shell:
        """
        rvtest \
        --inVcf {input.invcf} \
        --pheno {input.rvtest_pheno_file} \
        --setFile {input.setFile_rvtest} \
        --noweb \
        --kernel skat \
        --out {params.out_prefix}
        """
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
shell:
        """
        cp -f {input.phenofam} {params.out_prefix}G.fam
        cp -f {input.bim} {params.out_prefix}G.bim
        cp -f {input.bed} {params.out_prefix}G.bed

        cp {input.temp} {params.out_prefix}.param

        sed -i "s:bfile:{params.out_prefix}G:g" {params.out_prefix}.param
        sed -i "s:outx:{params.out_prefix}:g" {params.out_prefix}.param

        resources/.no_upload/GECS/gecs {params.out_prefix}.param

        rm -f {params.out_prefix}G.b*
        """
1947
1948
1949
1950
1951
shell:
	"""
	zcat {input} > {output.pos}
	zcat {input} | cut -f1 -d" " > {output.ids}
	"""
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
shell:
	"""
	plink \
	 --bfile "{params.ar_pl_in}" \
	 --extract "{input.dbSNP_ids}" \
	 --make-bed \
	 --out {params.ar_pl_out}_tmp1

	plink \
	 --bfile {params.ar_pl_out}_tmp1 \
	 --update-map "{input.dbSNP_pos}" 3 1 \
	 --make-bed \
	 --out {params.ar_pl_out}

	awk '{{print $2"ARRAY",$2"ARRAY",$3,$4,$5,$6}}' {params.ar_pl_out}.fam > {params.ar_pl_out}_tmp.fam

	mv {params.ar_pl_out}_tmp.fam {params.ar_pl_out}.fam
	rm -f {params.ar_pl_out}_tmp*
	"""
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
shell:
	"""
	echo "start"

	plink \
	 --vcf {input.vcf} \
	 --chr 18,19,20,21 \
	 --vcf-half-call h \
	 --double-id \
	 --make-bed \
	 --out {params.plink_pref}_tmp

	plink \
	--bfile {params.plink_pref}_tmp \
	--update-name "{input.dbSNP_pos}" 1 4 \
	--make-just-bim \
	--out {params.plink_pref}_tmp2

	cat {params.plink_pref}_tmp2.bim | sed "s/chr//g" > {params.plink_pref}_tmp.bim


	plink \
	--fam {params.plink_pref}_tmp.fam \
	--bed {params.plink_pref}_tmp.bed \
	--bim {params.plink_pref}_tmp.bim \
	--extract {params.array_plink}.bim \
	--make-bed \
	--out {params.plink_pref}

	rm -f {params.plink_pref}_tmp*
	""" 
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
shell:
	"""
	#module load king

	king \
	-b {params.hl_pl}.bed,{params.ar_pl}.bed \
	--duplicate \
	--related \
	--degree 1 \
	--prefix {params.king_prefix}

	king \
	-b {params.hl_pl}.bed,{params.ar_pl}.bed \
	--kinship \
	--prefix {params.king_prefix}_kinship

	"""
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
shell:
	"""
	#export LD_LIBRARY_PATH={config[LD_LIBRARY_PATH]}

	regenie \
	  --step 1 \
	  --bed {params.plink_file} \
	  --covarFile {input.pathCov} \
	  --phenoFile {input.pathPheno} \
	  --bt \
	  --extract {input.filtered} \
	  --loocv \
	  --bsize 1000 \
	  --out {params.step1}
	#--extract input.pruned \
	 """
2111
script: "scripts/make_phenoCov.R"
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
shell:
	"""
	# LD pruned variants for regenie step 1
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--keep {input.keep_fam} \
	--maf 0.01 \
	--chr 1-22 \
	--geno 0.01 \
	--snps-only just-acgt \
	--hwe 1e-10 \
	--exclude range {input.longrange_LD} \
	--make-bed \
	--out {params.pruned_plink}

	cut -f2 {output.bim} > {output.var_ids}

	"""
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
shell:
	"""
	step1_fitNULLGLMM.R \
	--plinkFile={params.plink_file} \
	--phenoFile={input.phenocov} \
	--phenoCol={wildcards.pheno} \
	--covarColList=PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,Age,Age2,age_sex,sex_for_fam \
	--sampleIDColinphenoFile=IID \
	--traitType=binary \
	--outputPrefix={params.step1} \
	--LOCO=TRUE \
	--IsOverwriteVarianceRatioFile=TRUE \
	--nThreads=4
	"""
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
	shell:
		"""
		plink \
		--bfile {params.plink_in} \
		--recode vcf-iid \
		--out {params.vcfprefix}

		cat  {params.vcfprefix}.vcf | bgzip > {output.vcf}
		tabix --csi -pvcf {output.vcf}

		plink \
		--bfile {params.plink_in} \
		--recode vcf-iid \
 		--chr X \
		--out {params.vcfprefix_justX}

 		cat  {params.vcfprefix_justX}.vcf | bgzip > {output.vcf_justX}
 		tabix --csi -pvcf {output.vcf_justX}

		"""
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
shell:
	"""

	step2_SPAtests.R \
	--vcfField=GT \
	--GMMATmodelFile={params.saige_step1}.rda \
	--varianceRatioFile={params.saige_step1}.varianceRatio.txt \
	--SAIGEOutputFile={output} \
	--numLinesOutput=1000 \
	--IsOutputNinCaseCtrl=TRUE \
	--IsOutputHetHomCountsinCaseCtrl=TRUE \
	--minMAC=3 \
	--IsOutputAFinCaseCtrl=TRUE \
	--vcfFile={params.in_vcf} {params.loco}	

	"""
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
shell:
	"""
	if cat {input} | grep "CHR" | head -n1 > {params.header}
	then
	echo "don't worry"
	fi

	cat {input} | grep -v "CHR" | \
	cat {params.header} - |
	gzip > {output}
	"""
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
shell:
	"""	
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/relatedness_filter.py $(pwd)/{input.GWAS_MT} $(pwd)/{params.tmp_dir} $(pwd)/{input.pheno_file} $(pwd)/{output.kinship_vals} $(pwd)/{output.to_remove}"

	if [ {config[cluster]} = "yes" ]; then
	queue="long"
	hours_to_run=168
	DRIVE_MEM=78
	worker_nodes_excluded={config[worker_nodes_excluded]}
	num_workers=20
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi

	rm -rf {params.tmp_dir}*
	"""
ShowHide 71 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Ax-Sch/DeCOI_WGS
Name: decoi_wgs
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...