Multi-Step Genomic Analysis Pipeline for DeCOI WGS Cohort

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Pipeline for QC, GWAS and gene-based collapsing within the DeCOI WGS cohort (1220 individuals). Rules for ROH-analysis and gene-set based analyses are also included. This repository is for documentary purposes and the pipeline will not run out of the box - e.g. adjustments are needed for the local compute environment and not all inpute files (e.g. genotype files, phenotype files, annotation sources) are provided.

Setup

This pipeline was run on a compute cluster running Linux (CentOS Linux 7) and slurm 22.05.6. Miniconda3 was manually installed (https://docs.conda.io/en/latest/miniconda.html). The setup is quiet complex, due to the use of the slurm scheduler and the use of Apache Spark. The setup was done in the following way:

Two conda environments were manually created:

an environment running snakemake-7.3.7:

	conda create --name snakemake7 -c bioconda snakemake=7.3.7	# for letting snakemake submit jobs to slurm a profile file with the default settings for snakemake was used.	# for documentation you can find the file I used in "config/snakemake_slurm_profile/config.yaml"	# you would need to modify it to have e.g. the correct default partition and account. Then you would move it to snakemakes config directory:	mkdir -p ~/.config/snakemake/slurm	cp config/snakemake_slurm_profile/config.yaml ~/.config/snakemake/slurm	# afterwards you should be able to execute "snakemake --profile slurm"; then each rule that is executed should be submitted as an individual slurm job. Resource requirements can be specified in each snakemake-rule (see file workflow/Snakefile).

an environment for running hail / Apache Spark: Note that this environment is being created to be able to set up a Apache Spark cluster on top of slurm which can then be used by hail. This will probably need some testing. If running hail on a single node is enough, you could simply install hail via conda (e.g. just run the first two commands from below and then run "conda install -c bioconda hail"). You could then set the variable cluster to "no" in config/config.yaml.

	conda env update --file env/spark_from_history.yaml	conda activate spark	mkdir -p ~/scratch/spark	cd ~/scratch/spark	wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz	tar zxvf spark-3.1.1-bin-hadoop3.2.tgz	git clone https://github.com/hail-is/hail.git	cd hail/hail	make install-on-cluster HAIL_COMPILE_NATIVES=1 SCALA_VERSION=2.12.13 SPARK_VERSION=3.1.1	pip install pyspark==3.1.1	# now manually modify the file "workflow/scripts/spark_submit_command.sh" as indicated within the file.	# also adjust the last 5 variables within config/config.yaml

The pipeline could now be tested by running

conda activate snakemake7
snakemake -np

The whole pipeline was executing by using the file "run.sh"; this file would also need adjustments (as indicated in the file):

conda activate snakemake7
sbatch run.sh

Here are the DAGs of some analyses that were conducted - the names of the rules are indicated.

Population-PCA:

PCA Pipeline

GWAS:

GWAS Pipeline

RVAS:

RVAS Pipeline

Inputs

Not all input files that were used for the pipeline are included in this repository due to privacy issues or file-size. The following files are missing:

genotype data / cohort bcf, which was produced by glnexus. The path ("input_vcf") can be set in the file config/config.yaml.
fasta file of the reference genome (here hg38). The path ("fasta") can be set in the file config/config.yaml.
Data of the 1000 genomes project; You only need to provide a folder; there is a snakemake-rule ("download_1000G_genotypes") to download the data. The path ("location_1000G") can be set in the file config/config.yaml.
Phenotype files are not included due to privacy issues; see resources folder for a short description of those files (for reference only).
several annotation sources for VEP are not included, which were placed in the folder indicated under the variable "database_dir" in the config file:
"gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz" source: https://gnomad.broadinstitute.org/downloads#v2-liftover ,
"dbNSFP4.1a_hg38.gz" source: https://sites.google.com/site/jpopgen/dbNSFP - the prepared for VEP/the dbNSFP plugin of VEP as described in https://github.com/Ensembl/VEP_plugins/blob/release/102/dbNSFP.pm
"spliceai_scores_sorted.hg38.vcf.gz" source: downloaded from https://basespace.illumina.com/s/otSPW8hnhaZR , then sorted by chrom:pos and bgziped/indexed

Code Snippets

library(tidyverse)

cov_path<-"results/var_sets/EURs_unrel/B2/cov.cov"
prs_path<-"results/PRS/scores.profile"
cov_out_path<-"results/var_sets/EURs_unrel/B2/cov.cov"

cov_path<-snakemake@input[[1]]
prs_path<-snakemake@input[[2]]
cov_out_path<-snakemake@output[[1]]

COV<-read_tsv(file=cov_path, col_names=TRUE)
PRS<-read.table(file = prs_path, header = TRUE)

print("Percentage of individuals in the COV file that also got a PRS calculated:")
mean(COV$IID %in% PRS$IID)

PRS_red<-PRS %>%
  select(IID, SCORESUM)

COV<-COV %>% 
  left_join(PRS_red, by="IID")

write_tsv(x=COV, file=cov_out_path, col_names = TRUE)

R tidyverse From line 1 of scripts/add_PRS_to_COV.R

library(tidyverse)
library(readxl)

args=c("config/post_df3_HGI_sample_QC_summary.xlsx", "data/hail_gather_data/for_sample_QC.tsv", "data/plotPhenoInfo/")

args = commandArgs(trailingOnly=TRUE)
print(args)
PhenoFileLocation=args[1]
qc_file<-args[2]
outFolder=args[3]

samplesheet<-read_excel(PhenoFileLocation, sheet = "qc_table")

qc_table<-read_tsv(qc_file)

dir.create(outFolder,recursive = TRUE, showWarnings = FALSE)
setwd(outFolder)

samplesheet<-samplesheet%>% 
  rename(sex_to_use=`Sex (f=fem, m=m)`)%>%
  mutate(sex_for_fam=ifelse(sex_to_use=="f", 2, ifelse(sex_to_use=="m",1,0)))

sample_sex<-samplesheet %>% select(sex_to_use, s)

qc_table<-qc_table %>%
  left_join(sample_sex, by=(c("s"="s")))

qc_table<-qc_table %>% 
  mutate(Age=as.integer(Age),
         A2=as.integer(A2),
         B2=as.integer(B2),
         C2=as.integer(C2),
         is_case=as.integer(is_case)
         )



# Check Age vs. Sex

SexPlot<-ggplot(qc_table)+
  geom_histogram(aes(x=Age, fill=sex_to_use), bins=20)+
  theme_bw()

print(SexPlot)
ggsave(filename="SexPlot.pdf",
       plot=SexPlot,
       width=7,
       height=3)

SexInfo<-qc_table %>% 
  summarise(mean(sex_to_use=="m"), num=sum(sex_to_use!="noSex"))
write_tsv(x=SexInfo, file="SexInfo.tsv")

# Check Age vs. Severity
qc_table <- qc_table %>% 
  mutate(is_case_F=as.factor(is_case))

AgePlot<-ggplot(qc_table)+
  geom_histogram(aes(x=Age, fill=is_case_F), bins=20)+
  geom_vline(data=qc_table %>% filter(is_case==0), aes(xintercept = mean(Age)), col="red")+
  geom_vline(data=qc_table %>% filter(is_case==1), aes(xintercept = mean(Age)), col="green")+
  geom_vline(data=qc_table %>% filter(is_case==2), aes(xintercept = mean(Age)), col="blue")+
  theme_bw()

print(AgePlot)
ggsave(filename="AgePlot.pdf",
       plot=AgePlot,
       width=7,
       height=3)

AgeInfo<-qc_table %>% 
  group_by(is_case)%>%
  summarise(MEAN=mean(Age), SD=sd(Age), N=length(Age))%>%
  mutate(SEM=SD/sqrt(N))


write_tsv(x=AgeInfo, file="AgeInfo.tsv")

R tidyverse readxl From line 1 of scripts/compile_phenotype_plots.R

library(tidyverse)
library(readxl)

# Params:
# 1. input pheno-excel
# 2. input PC file
# 3. output file

args=c("config/post_df3_HGI_sample_QC_summary.xlsx","data/PCAcovar/A2_EUR_PCA.eigenvec")

# first: excel_file, second: fam-file, third: Phenotype column in excel

args = commandArgs(trailingOnly=TRUE)
print(args)

# File formats:
#Line 1 : Header with FID, IID and C covariate names.
#Followed by lines of C+2 values. Space/tab separated.

samplesheet<-read_tsv(args[1])%>%
  mutate(sex_for_fam=ifelse(is_female==TRUE, 2, ifelse(is_female==FALSE,1,0)))




col_names_eigenv<-c("FID","IID", paste0("PC",1:10))

# sex coding: 1=male, 2=female
# pheno coding: 1=control, 2=case
eigenv<-read_delim(args[2], col_names=col_names_eigenv, delim=" ")

cov_info<-eigenv %>% left_join(samplesheet, by=c("IID"="s"))%>%
  mutate(Age=as.integer(Age)) %>%
  mutate(age_sex=Age*sex_for_fam)%>%
  mutate(Age2=Age*Age)%>%
  select(all_of(col_names_eigenv), Age, Age2, age_sex, sex_for_fam)

write_tsv(x = cov_info, file = args[3], col_names = TRUE) # pheno file for regenie

R tidyverse readxl regenie From line 1 of scripts/generate_covar_file.R

library(tidyverse)
library(readxl)
# Params:
# 1. input pheno-excel
# 2. input fam file
# 3. pheno column name in excel
# 4. Output: fam file (Plink)
# 5. Output: pheno file (Regenie)

#testing
#setwd("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline")
args=c("../../resources/EURs_unrel.tsv","data/anc_vcf/EUR_vcf_not_rel.vcf.bgz.fam", "A1", 
       "data/regenie_pheno/fam_file.fam", "data/regenie_pheno/phenox")
# first: excel_file, second: fam-file, third: Phenotype column in excel

args = commandArgs(trailingOnly=TRUE)
print(args)

# File formats:
#FAM file: 
# A text file with no header line, and one line per sample with the following six fields:
#Family ID ('FID')
#Within-family ID ('IID'; cannot be '0')
#Within-family ID of father ('0' if father isn't in dataset)
#    Within-family ID of mother ('0' if mother isn't in dataset)
#Sex code ('1' = male, '2' = female, '0' = unknown)
#Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

# Regnie pheno file
# FID IID Y1 Y2
#  0=control, 1=case, missing values must be coded as NA

samplesheet<-read_tsv(args[1])%>%
  mutate(sex_for_fam=ifelse(is_female==TRUE, 2, ifelse(is_female==FALSE,1,0)))

pheno_col_num<-which(colnames(samplesheet) == args[3])
pheno_PLINK<-as.integer(unlist(samplesheet[,pheno_col_num]))+1
samplesheet$pheno_P=replace_na(pheno_PLINK, -9)
pheno_REGENIE<-as.integer(unlist(samplesheet[,pheno_col_num]))
samplesheet$pheno_R=pheno_REGENIE

# add sex stratification / age strat
samplesheet <- samplesheet %>%
  mutate(pheno_R_female=ifelse(is_female==TRUE, pheno_R, NA), # just keep females
         pheno_R_male=ifelse(is_female==FALSE, pheno_R, NA), # just keep males
         pheno_R_GE60=ifelse(pheno_R==1 & Age < 60, NA, pheno_R),  # set cases 59 or younger to missing
         pheno_R_LT60=ifelse(pheno_R==1 & Age >= 60, NA, pheno_R),
         pheno_R_GE60CC=ifelse(Age < 60, NA, pheno_R),  # set persons 59 or younger to missing
         pheno_R_LT60CC=ifelse(Age >= 60, NA, pheno_R)) # set persons 60 or older to missing 

col_names_fam<-c("FID","IID","father","mother","sex","pheno")

# sex coding: 1=male, 2=female
# pheno coding: 1=control, 2=case
fam<-read_delim(args[2],
delim=" ", 
col_names=col_names_fam,
col_types=c(
  FID = col_character(),
  IID = col_character(),
  father = col_character(),
  mother = col_character(),
  sex = col_integer(),
  pheno = col_integer()
)
)

print("fam:")
head(fam)
print("samplesheet:")
head(samplesheet)

all_info_joined<-fam %>% left_join(samplesheet, by=c("IID"="s"))%>%
  mutate(sex=sex_for_fam)

fam_new<-all_info_joined %>% 
  select(all_of(col_names_fam[1:4]), sex_for_fam, pheno_P)
write_tsv(x = fam_new, file = args[4], col_names = FALSE) # fam_file for plink

pheno_file_regenie <- all_info_joined %>%
  select(FID, IID, pheno_R, pheno_R_female, pheno_R_male, pheno_R_GE60, pheno_R_LT60, pheno_R_GE60CC, pheno_R_LT60CC)

colnames(pheno_file_regenie)<-c("FID","IID",
                                args[3],
                                paste0(args[3],"_female"),
                                paste0(args[3],"_male"),
                                paste0(args[3],"_GE60"),
                                paste0(args[3],"_LT60"),
                                paste0(args[3],"_GE60CC"),
                                paste0(args[3],"_LT60CC"))

print("out:")
head(pheno_file_regenie)

write_tsv(x = pheno_file_regenie, file = args[5], col_names = TRUE) # pheno file for regenie

R tidyverse pLink readxl regenie From line 1 of scripts/generate_regenie_pheno.R

if (!require("GenomicRanges", quietly = TRUE) & (!require("rtracklayer", quietly = TRUE)) ){
  install.packages("BiocManager", repos='http://cran.us.r-project.org')
  BiocManager::install("GenomicRanges")
  BiocManager::install("rtracklayer")
}

library(tidyverse)
library(data.table)
library(GenomicRanges)
library(rtracklayer)


gene_set_path<-"resources/gene_sets.tsv"
gene_set_path<-snakemake@input[[1]]
bed_path<-"resources/genomic_ranges.bed"
bed_path<-snakemake@input[[2]]
aaf_path<-"results/all_vars_for_RVAS/data.anno.aaf.file.txt"
aaf_path<-snakemake@input[[3]]
anno_path<-"/home/aschmidt/Arbeit_Gen/COVID_WGS/COVID_annotation/results/for_RVAS/all_contigs_anno.file.txt"
anno_path<-snakemake@input[[4]]
bim_path<-"results/all_vars_for_RVAS/eurs.bim"
bim_path<-snakemake@input[[5]]
#order_csq<-"resources/genomic_ranges.bed"
#order_csq<-snakemake@input[[6]]

new_anno_path<-"set.annos.tsv"
new_anno_path<-snakemake@output[[1]]
new_aafs_path<-"aafs.tsv"
new_aafs_path<-snakemake@output[[2]]
new_sets_path<-"sets.tsv"
new_sets_path<-snakemake@output[[3]]
relevant_variants_path<-"relevant_variants.tsv"
relevant_variants_path<-snakemake@output[[4]]
new_bim_path<-"new_bim.bim"
new_bim_path<-snakemake@output[[5]]

aafs<-read_delim(aaf_path, delim=" ", col_names = c("ID", "AAF"))
print("aafs:")
head(aafs)
annos<-read_delim(anno_path, delim=" ", col_names = c("ID", "gene_id", "annot"))
print("annos")
head(annos)

bim<-fread(bim_path, sep ="\t", col.names = c("CHR", "ID", "cM","BP","A1","A2"))

print("bim")
head(bim)

#bim_test<-head(bim, n=5000000)

bim_new_id<-bim %>%
  mutate(IDnew=paste("1", 1:n(), A1, A2, sep=":"))%>%
  mutate(BPnew=1:n())%>%
  mutate(CHRnew=1)

comb_sources<-bim_new_id %>%
  left_join(aafs, by="ID")%>%
  left_join(annos, by="ID")

print("combsource")
head(comb_sources)
head(comb_sources %>% filter(!is.na(annot)))

# make sets based on gene definitions
gene_sets<-read_tsv(gene_set_path)
new_sets<-tibble()
new_annos<-tibble()
new_aaf<-tibble()

score_conseq<-function(conseq){
  conseq_tbl<-c("pLoF",
                "missense.revel07",
                "missense.revel05",
                "missense.revel03",
                "moderate",
                "synonymous",
                "UTR5_CADD",
                "UTR5",
		            "UTR3_CADD",
                "UTR3",
                "promoter_CADD",
                "promoter",
                "enhancer_CADD",
                "enhancer",
                "regionBased")


conseq_rank<-c()
for (single_conseq in conseq){
  single_rank<-which(conseq_tbl %in% single_conseq)
  single_rank<-ifelse(is.na(single_rank), 99, single_rank)
  conseq_rank<-c(conseq_rank, single_rank)
}
conseq_rank<-
return(conseq_rank)
}


i=1
for (single_gene_set in unique(gene_sets$set)){
  rel_genes<-gene_sets %>%
    filter(set==single_gene_set)

  tmp_annos<-comb_sources %>%
    filter(gene_id %in% rel_genes$gene_id)%>%
    filter(annot!="not_relevant")%>%
    mutate(gene_id=single_gene_set)
  if (nrow(tmp_annos)==0){
    next()	
  }

  # check for variants that are added to the gene set more than once, take the most severe conseq
  mltpl_occuring_tmp<-tmp_annos %>% 
    add_count(IDnew)%>%
    filter(n>1)%>%
    select(-n)

  if (nrow(mltpl_occuring_tmp)>0){
    tmp_annos<-tmp_annos %>%
      filter(!IDnew %in% mltpl_occuring_tmp$IDnew)

    for (IDnew_tmp in unique(mltpl_occuring_tmp$IDnew)){
      mltpl_occuring_single<-mltpl_occuring_tmp %>%
        filter(IDnew==IDnew_tmp)%>%
        mutate(eff_score=score_conseq(annot))%>%
        arrange(eff_score) 

      print(IDnew_tmp)
      print(mltpl_occuring_single)

      mltpl_occuring_single<-mltpl_occuring_single %>%
        select(-eff_score)

      tmp_annos<-rbind(tmp_annos, mltpl_occuring_single[1,])
    }
  }

  new_annos<-rbind(new_annos, tmp_annos) 
  tmp_IDs<-paste(tmp_annos$IDnew, collapse=",")
  tmp_set<-tibble(gene_id=single_gene_set, chr="1", pos=i, IDs=tmp_IDs)
  new_sets<-rbind(new_sets, tmp_set)
  i=i+1 
}



#### 

# make sets based on bed file

region_sets<-import(bed_path)
head(region_sets)
region_sets_names<-str_split(region_sets$name, ";", simplify=TRUE)
region_sets_names<-as_tibble(region_sets_names)
colnames(region_sets_names)<- c("set", "gene")
set_names<-unique(region_sets_names$set)

sources_w_AF<-comb_sources %>%
  filter(!is.na(AAF))%>%
  mutate(pos_start=BP,
         pos_end=BP+nchar(A1))

head(sources_w_AF)
chroms<-paste0("chr",sources_w_AF$CHR)
chroms<-str_replace(chroms, "chr23", "chrX")

annos_granges <- GRanges(
  seqnames = chroms,
  ranges = IRanges(start=sources_w_AF$pos_start, end = sources_w_AF$pos_end) )
head(annos_granges)
for (set_name in set_names){

  tmp_region<-region_sets[region_sets_names$set==set_name]
  tmp_intersect<-GenomicRanges::intersect(annos_granges, tmp_region)

  tmp_chr_pos_inReg <-sources_w_AF[annos_granges %in% tmp_intersect,]%>%
    distinct(ID, .keep_all=TRUE)%>%
    mutate(gene_id=set_name,
           annot="regionBased")
  if (nrow(tmp_chr_pos_inReg)==0){
    next()
  }

  new_annos<-rbind(new_annos, tmp_chr_pos_inReg, fill=TRUE)

  tmp_IDs<-paste(tmp_chr_pos_inReg$IDnew, collapse=",")
  tmp_set<-tibble(gene_id=set_name, chr="1", pos=i, IDs=tmp_IDs)
  new_sets<-rbind(new_sets, tmp_set)
  i=i+1 
}


####

new_annos_export<-new_annos %>%
 distinct(IDnew, gene_id, annot)

new_aaf<-comb_sources %>% distinct(IDnew, AAF) %>%
  filter(!is.na(AAF))

relevant_variants<-unique(new_annos_export$ID)

write_delim(x=new_annos_export,
            file = new_anno_path,
            col_names = FALSE )


write_delim(x=new_aaf,
            file = new_aafs_path,
            col_names = FALSE )

write_delim(x=new_sets,
            file = new_sets_path,
            col_names = FALSE )

write(x = relevant_variants,
      file=relevant_variants_path)

bim_new_id<-bim_new_id%>%
  select(CHRnew,IDnew,cM,BPnew,A1,A2)%>%
  relocate(CHRnew,IDnew,cM,BPnew,A1,A2)

write_tsv(x=bim_new_id,
            file = new_bim_path,
            col_names = FALSE )

R tidyverse data.table GenomicRanges rtracklayer BiocManager From line 1 of scripts/gene_sets.R

library(data.table)
vars<-fread(snakemake@input[[1]], fill=TRUE, header = TRUE, na.strings = ".")
# vars<-fread("results/for_RVAS/annotated_split_vep_1.tsv", fill=TRUE, header = TRUE, na.strings = ".")

vars<-vars[, c("chrom","pos","ref","alt"):=tstrsplit(ID, ":", fixed=TRUE)]

###### MASKS ######
vars<-vars[BIOTYPE == "protein_coding"]

vars<-vars[,REVEL_score:=as.numeric(REVEL_score)]


vars<-vars[, ':=' (revel03=!is.na(REVEL_score) & REVEL_score>0.3, 
                   revel05=!is.na(REVEL_score) & REVEL_score>0.5, 
                   revel07=!is.na(REVEL_score) & REVEL_score>0.7,
                   missense=like(Consequence, "missense_variant"),
                   syn=like(Consequence, "synonymous_variant"),
                   pLOF=(IMPACT=="HIGH"),
                   UTR5=like(Consequence, "5_prime_UTR_variant"),
                   UTR3=like(Consequence, "3_prime_UTR_variant"),
		   promoter=like(Consequence,"upstream_gene_variant") & TSSDistance <= 1000,
                   enhancer=like(Consequence, "upstream_gene_variant")& TSSDistance <= 50000 & !is.na(DHS),
                   CADD10= !(is.na(CADD_PHRED) | CADD_PHRED<10) ) ]

vars<-vars[, ':=' (revel03_mis=(revel03 & missense),
                   revel05_mis=(revel05 & missense), 
                   revel07_mis=(revel07 & missense), 
                   moderate_non_missense=(!missense & (IMPACT=="MODERATE")))]

vars<-vars[, by=c("ID", "Gene"), mask_anno:=ifelse(sum(pLOF)>0,"pLoF", 
                                            ifelse(sum(revel07_mis)>0, "missense.revel07",
                                            ifelse(sum(revel05_mis)>0, "missense.revel05",
                                            ifelse(sum(revel03_mis)>0, "missense.revel03",
                                            ifelse(sum(missense | moderate_non_missense)>0, "moderate",
                                            ifelse(sum(syn)>0, "synonymous", 
                                            ifelse(sum(UTR5)>0 & sum(CADD10)>0, "UTR5_CADD", 
                                            ifelse(sum(UTR5)>0, "UTR5",
                                            ifelse(sum(UTR3)>0 & sum(CADD10)>0, "UTR3_CADD",
                                            ifelse(sum(UTR3)>0, "UTR3", 
                                            ifelse(sum(promoter)>0 & sum(CADD10)>0, "promoter_CADD", 
                                            ifelse(sum(promoter)>0, "promoter", 
                                            ifelse(sum(enhancer)>0 & sum(CADD10)>0, "enhancer_CADD", 
                                            ifelse(sum(enhancer)>0, "enhancer",
                                                   "not_relevant")
                                                                        ) ) ) ) ) ) ) ) ) ) ) ) )
]

anno_list<-rbindlist(list(vars[, c("ID", "Gene", "mask_anno")]), use.names = FALSE)

###### AF ######
vars=vars[, maxAF:=pmax(as.numeric(gnomAD_ex_AF), as.numeric(gnomAD_ge_AF), 0,
                        na.rm = TRUE)]

vars=vars[, minAF:=pmin(as.numeric(gnomAD_ex_AF), as.numeric(gnomAD_ge_AF), 1,
                        na.rm = TRUE)]

vars=vars[, maxAC:=pmax(as.numeric(gnomAD_ex_AC), as.numeric(gnomAD_ge_AC), 0, na.rm = TRUE)]

vars=vars[, maxHom:=pmax(as.numeric(gnomAD_ex_nhomalt), as.numeric(gnomAD_ge_nhomalt), 0, na.rm = TRUE)]


#vars=vars[,AF_category:=ifelse( (maxAF < 0.005) & (maxAC > 9 | maxHom > 0 ), 0.005, maxAF) ]
vars=vars[,AF_category:=maxAF]

###### SET LIST #######
collapsed_vars<-vars[,concats:=paste(ID, collapse=","), by=Gene]
collapsed_vars<-unique(collapsed_vars[,c("Gene","chrom","concats")])
collapsed_vars<-collapsed_vars[,pos:=1:nrow(collapsed_vars)]
collapsed_vars<-collapsed_vars[Gene!="" & !is.na(Gene), ]


fwrite(unique(anno_list), file = snakemake@output[[1]], col.names = FALSE, sep=" ")
fwrite(unique(vars[, c("ID", "AF_category")]), file = snakemake@output[[2]], col.names = FALSE, sep=" ")
fwrite((collapsed_vars[,c("Gene","chrom","pos","concats")]), file = snakemake@output[[3]], col.names = FALSE, sep=" ")

fwrite(vars[, -"concats"], file = snakemake@output[[4]], col.names = TRUE, sep="\t")

R data.table From line 1 of scripts/make_aaf_anno.R

library(tidyverse)

#cov<-read_tsv("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline/results/regenie_pheno/cov_A2_EUR")
#pheno<-read_tsv("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline/results/regenie_pheno/pheno_EUR_A2")

cov<-read_tsv(snakemake@input[[2]])%>%
  select(-FID)
pheno<-read_tsv(snakemake@input[[1]])%>%
  select(-FID)

pheno_cov<-pheno %>%
 left_join(cov, by = "IID")

write_tsv(x=pheno_cov,file=snakemake@output[[1]])

males<-pheno %>% 
  filter(sex_for_fam==1) %>%
  select(IID)

write_tsv(x=males, file=snakemake@output[[2]], col_names=FALSE)

R tidyverse From line 1 of scripts/make_phenoCov.R

import pandas as pd
import pandas_plink
from xarray import DataArray
import shutil
import dask.array

# %%
mask_def_file = snakemake.input.regenie_mask_def
aaf_file = snakemake.input.data_anno_aaf_file
anno_file = snakemake.input.regenie_anno_file

af_cutoffs = list(map(float, snakemake.params.af_cutoffs))
plink_in_prefix = snakemake.params.plink_in_prefix

out_prefix = snakemake.params.plink_out_folder
set_file = snakemake.output.setFile_rvtest

debug_joined_out = snakemake.output.debug_joined_out
#%%

# read plink with pandas_plink
(bim, fam, bed) = pandas_plink.read_plink(plink_in_prefix, verbose=False)
bim.cm = bim.cm.astype(int)
#%%
# Read regenie aaf_file, anno_file, mask_def
aafs = pd.read_csv(aaf_file, sep=' ', names=['ID', 'aaf'])

mask_def = pd.read_csv(mask_def_file, sep=' ', names=['mask', "annots"])
mask_def.annots = mask_def.annots.map(lambda a: a.split(','))
mask_def = mask_def.set_index("mask")

anno = pd.read_csv(anno_file, sep=' ', names=['ID', 'gene', 'annot'])
anno = anno.dropna()
# %%
# Create M0-M4 True/False columns in table
for mask, annots in mask_def.annots.items():
    anno[mask] = anno.annot.isin(annots)

# Join AF info into table
anno_aaf = anno.join(aafs.set_index("ID"), on="ID", how="outer")

# %%

# join bim (variant table) with regenie anno
joined = bim.join(anno_aaf.set_index("ID"), on="snp", how="inner")
joined = joined.sort_values(by=['gene', 'pos'])

joined.to_csv(debug_joined_out, sep="\t")

#%%

for mask in mask_def.index.unique():
    for af_cutoff in af_cutoffs:
        mask_out_prefix = f"{out_prefix}{mask}_{af_cutoff}"
        # subset variant table
        subset = joined[joined[mask] & (joined["aaf"] < af_cutoff)]
        # export genotype data (subset["i"] contains positions of variants in table)
        pandas_plink.write_plink1_bin(
            DataArray(bed[subset["i"]], dims=["variant", "sample"]),
            bed=mask_out_prefix+".bed"
            )
        # write variant subset to bim file
        subset[["gene","snp","cm","pos","a0","a1"]].to_csv(mask_out_prefix+".bim", sep=" ", header=False, index=False)
        # reuse original fam file
        shutil.copy(plink_in_prefix + ".fam", mask_out_prefix +".fam")

#%%

# create setfile, each gene has its own "chromosome"/contig
set_str = ""
for gene in joined.gene.unique():
    set_str += f"{gene}\t{gene}:0-{joined.pos.max() + 100}\n"

with open(set_file, "wt") as f:
    f.write(set_str)

Python Pandas pLink xarray regenie dask pandas-plink From line 4 of scripts/plink_subset_for_rvtest.py

library(tidyverse)

check_S_path<-snakemake@input[[1]]
plot_out<-snakemake@output[[1]]


check_S<-read.table(check_S_path, header = T)
F_plot<-ggplot(check_S, aes(x=F, fill=PEDSEX==1 | PEDSEX==0))+
  geom_histogram(bins=100)

ggsave(filename = plot_out, plot = F_plot)
ggsave(filename = paste0(plot_out, "zoomed.jpg"), plot = F_plot + ylim(c(0,100)) )

R tidyverse From line 1 of scripts/plotFcheckSex.R

packs <- c("qqman","optparse","data.table","R.utils")


for (p in packs) {
  if( !require(p, character.only = T)) {
    print(p)
    install.packages( p,  repos = c(CRAN = "http://cran.r-project.org") )
  }
}

option_list = list(
  make_option(c("-f", "--file"), type="character", default=NULL,
              help="dataset file name", metavar="character"),
  make_option(c("-o", "--out"), type="character",
              help="output file name [default= %default]", metavar="character"),
  make_option(c("-c","--chrcol"), type="character", default="CHR",
              help="chromosome column [default= %default]", metavar="character"),
  make_option(c("-p","--pval_col"), type="character", default="P",
              help="pvalue column [default= %default]. This can be a comma separated list and plots will be generated for each of these", metavar="character"),
  make_option(c("-b","--bp_col"), type="character", default="BP",
              help="bp column [default= %default]", metavar="character"),
  make_option(c("-l","--loglog_pval"), type="integer", default=10,
              help="-log10 p-val threshold for using log-log scale in manhattan plot [default= %default]", metavar="integer"),
  make_option(c("-i","--id_col"), type="character", default="ID",
              help="-ID_col [default= %default]", metavar="character"),
  make_option(c("-g","--log"), type="logical", default="FALSE",
              help="-log [default= %default]", metavar="logical"),
  make_option(c("-m","--minrep_col"), type="character",
              help="if given then chr:bp:ref:alt identifier assumed and chr and bp are read from there [default= %default]", metavar="character")
);

opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser, positional_arguments=0);

# file="../GWAS/REGENIE_OUT_STEP2_chr1_pheno.regenie"
# bp_col="GENPOS"
# chr_col="CHROM"
# pcols =c("LOG10P")

file <- opt$options$file
print(paste("reading file:", file))

data <- fread(file, header=T)

options(bitmapType='cairo')

print(str(opt))
bp_col <- opt$options$bp_col
chr_col <- opt$options$chrcol
id_col <- opt$options$id_col

print(summary(data))
print( summary( data[[chr_col]] ) )
#colnames(data) <- toupper( colnames(data) )

pcols <- unlist(strsplit(opt$options$pval_col,","))

output_prefix=file

if( !is.null(opt$options$out)) {
  output_prefix=opt$options$out
}


if(! is.null(opt$options$minrep_col ) ) {
    print("getting BP and CHR from minrepid")
    split <- strsplit(as.character(data[[opt$options$minrep_col]]), ":")
    data[[bp_col]] <- unlist( lapply( split, function(x) as.numeric(x[2]) ))
    data[[chr_col]] <- unlist( lapply( split, function(x) x[1] ))
}

print(append(pcols,c(bp_col,chr_col)))

if( any( ! append(pcols,c(bp_col,chr_col)) %in% colnames(data)   )) {
  stop( paste0("All required columns do not exist in the data: ", paste(pcols,sep=",", collapse=""),",", bp_col, ",",chr_col,  collapse="" ))
}


print(summary(as.factor(data[[chr_col]])))

data[[chr_col]] <- gsub("chr","",data[[chr_col]])
data[[chr_col]] <- gsub("X|chrX","23",data[[chr_col]])
data[[chr_col]] <- gsub("Y|chrY","24",data[[chr_col]])
data[[chr_col]] <- gsub("MT|chrMT|M|chrM","25",data[[chr_col]])

data[[chr_col]] <- as.numeric(data[[chr_col]])
data <- data[ !is.na(data[[chr_col]]) ]

quants <- c(0.7,0.5,0.456,0.1,0.01, 0.001)



for( pcol in pcols) {
  if (opt$options$log==TRUE){
	data[[pcol]]<- 10^(-data[[pcol]])
  }
  subdata <- data[ !is.na(data[[pcol]]) & is.numeric( data[[pcol]]  ) ]

  lambda  <- round(  quantile(  (qchisq(1-subdata[[pcol]], 1) ), probs=quants ) / qchisq(quants,1), 3)
  png( paste(output_prefix,"_", pcol ,"_qqplot.png", sep="" ))
  qq(subdata[[pcol]], main=paste("\nlambda ", quants, ": ", lambda, sep="" ) )
  dev.off()
  sink( paste(output_prefix,"_",  pcol ,"_qquantiles.txt", sep="" ) )
  cat( paste( quants, ":", lambda, sep=""))
  sink()

  print("subsetting p-vals < 0.01 for manhattan...")

  subdata <- subdata[ subdata[[pcol]]<0.01 & subdata[[pcol]]>0 ]
  print( paste0("Plotting manhattan with ", nrow(subdata), " variants") )
  print( summary(subdata[[pcol]] ))
  png( paste(output_prefix,"_",pcol,"_manhattan.png", sep=""), width=1000, height=400)
  logs <- -log10(subdata[[pcol]])

  manhattan( subdata , chr=chr_col, bp=bp_col, p=pcol,snp=id_col, ylim=c( 2,max(logs)+1)  )
  dev.off()

  print("!!!!!!!!!!!!!!!!!!!!!!!")

  print("plotting log-log manhattan")
  loglog_p <- opt$options$loglog_pval
  logs <- ifelse(logs < loglog_p, logs, loglog_p * log10(logs) / log10(loglog_p))
  subdata[["p_scaled"]] <- 10^(-logs)
  tick_pos <- round(seq(1, max(logs), length.out=round(max(logs))))
  tick_lab <- sapply(tick_pos, function(pos) { round(ifelse(pos < loglog_p, pos, loglog_p^(pos/loglog_p))) })
  png( paste(output_prefix,"_",pcol,"_manhattan_loglog.png", sep=""), width=1000, height=400)
  manhattan( subdata, chr=chr_col, bp=bp_col, p="p_scaled", snp=id_col, ylim=c( 2,max(logs)+1), yaxt="n")
  axis(2, at = tick_pos, labels=tick_lab, las=2)
  dev.off()
}

R From line 3 of scripts/qqplots_GWAS.R

shell:
	"""
	#wget -nc ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	#gzip -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	bcftools filter -r chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX \
	  {input.bcf} -Ou | \
	  bcftools norm --check-ref w -f {input.fasta} -Ou | \
	  bcftools annotate --set-id '%CHROM:%POS:%REF:%FIRST_ALT' -Oz > {output.vcf_gz}
	tabix -p vcf {output.vcf_gz}
	"""

SnakeMake BCFtools tabix From line 63 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/vcf2hail.py $(pwd)/{input.vcf} \
	$(pwd)/{output} \
	$(pwd)/{params.tmp_dir} \
	$(pwd)/{input.individual_list}"

	if [ {config[cluster]} = "yes" ]; then
	worker_nodes_excluded={config[worker_nodes_excluded]}
	num_workers=60
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi
	"""

SnakeMake From line 87 of workflow/Snakefile

shell:
	"""	
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/primaryQC.py \
	$(pwd)/{input.mt} \
	$(pwd)/{params.tmp_dir} \
	$(pwd)/{output.sample_qc_file} \
	$(pwd)/{params.out_plink}"

	if [ {config[cluster]} = "yes" ]; then
	queue="medium"
	export TMPDIR=$(pwd)/{params.tmp_dir}
	hours_to_run=12
	DRIVE_MEM=78
	SPARK_LOCAL_DIRS=$(pwd)/{params.tmp_dir}
	worker_nodes_excluded={config[worker_nodes_excluded]}
	num_workers=60
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi

	rm -rf {params.tmp_dir}*
	"""

SnakeMake From line 121 of workflow/Snakefile

shell:
	"""
	echo $(pwd)

	cp workflow/scripts/plot_sampleQC.R {params.path_output}
	cp {input.xl_file} {params.path_output}

	cd {params.path_output}

	Rscript -e 'library(rmarkdown); rmarkdown::render("plot_sampleQC.R", "html_document")'
	rm *.R
	"""

SnakeMake From line 160 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/MakeVarSets.py \
	$(pwd)/{params.tmp_dir} \
	$(pwd)/{input.MT} \
	$(pwd)/{input.input_vcf} \
	$(pwd)/{input.individual_list} \
	$(pwd)/{params.out_trunk} \
	$(pwd)/{params.out_Commontrunk}"

	if [ {config[cluster]} = "yes" ]; then
	export TMPDIR=$(pwd)/{params.tmp_dir}
	worker_nodes_excluded={config[worker_nodes_excluded]}
	hours_to_run=12
	DRIVE_MEM=148
	num_workers=60
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi
	"""

SnakeMake From line 194 of workflow/Snakefile

shell:
	"""
	plink2 \
	--bfile {params.in_plink} \
	--maf 0.01 \
	--exclude bed0 {input.bed} \
	--make-bed \
	--out {params.tmp_plink}		

	plink \
	--bfile {params.tmp_plink} \
	--indep-pairwise 1000kb 50 0.2 \
	--out {params.out_plink}

	plink \
	--bfile  {params.in_plink} \
	--extract {params.out_plink}.prune.in \
	--split-x hg38 no-fail \
	--make-bed \
	--out {params.out_plink}

	rm {params.tmp_plink}*
	"""

SnakeMake pLink plink2 From line 236 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--filter-females \
	--hwe 1e-3 midp include-nonctrl \
	--make-just-bim \
	--out {params.hwe_vars}

	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--extract {output.hwe_vars} \
	--check-sex 0.5 0.8 \
	--out {params.out_plink}

	"""

SnakeMake pLink From line 275 of workflow/Snakefile

script: "scripts/plotFcheckSex.R"

SnakeMake From line 305 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--chr 1-22 \
	--missing \
	--out {params.out_plink}
	"""

SnakeMake pLink From line 320 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.out_folder}
	cd {params.out_folder}
	wget -nc ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

	#reference genome  (GRCh38)
	#wget -nc ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	#gzip -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
	"""

SnakeMake From line 343 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.folder_cont}
	cd {params.folder_cont}
	prefix="ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.chr"
	suffix=".shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz"
	wget -nc "$prefix""{wildcards.contig}""$suffix" "$prefix""{wildcards.contig}""$suffix".tbi
	"""

SnakeMake From line 362 of workflow/Snakefile

shell:
	"""
	if bcftools view -q 0.05:minor {input.vcf} | \
	bcftools norm -m-any --check-ref w -f "{input.fasta}" | \
	bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
	bcftools norm -Ob --rm-dup both \
	> {output.bcf} ; then
	echo "no error"
	fi

	plink --noweb \
	--bcf {output.bcf} \
	--keep-allele-order \
	--vcf-idspace-to _ \
	--allow-extra-chr 0 \
	--split-x b38 no-fail \
	--make-bed \
	--out {params.bed1}

	plink --noweb \
	--bfile {params.bed1} \
	--extract {input.hailbim} \
	--maf 0.10 --indep 50 5 1.5 \
	--make-bed \
	--out {params.bed2}
	"""

SnakeMake BCFtools pLink From line 387 of workflow/Snakefile

shell:
	"""
	echo {input._1000G_data} | tr " " "\\n" | sed 's/.bed//g' > {output.merge_list}
	plink --merge-list {output.merge_list} --out results/1000G/Merged
	awk '{{ print $2 }}' results/1000G/Merged.bim > results/1000G/MergeVariants.txt

	plink --bfile {params.hailplink} \
	 --extract results/1000G/MergeVariants.txt \
	 --make-bed \
	 --out results/1000G/hail_for_ancestry

	printf "results/1000G/Merged\\nresults/1000G/hail_for_ancestry" > results/1000G/ForMergeFull.list

	mkdir -p results/PCA
	plink --merge-list results/1000G/ForMergeFull.list --out results/PCA/MergeFullForPCA

	awk '{{ print $1,$2 }}' results/1000G/Merged.fam | awk '$(NF+1) = "1000G"' > results/PCA/clusters.txt
	awk '{{ print $1,$2 }}' results/1000G/hail_for_ancestry.fam | awk '$(NF+1) = "Cohort"' >> results/PCA/clusters.txt

	plink --bfile results/PCA/MergeFullForPCA \
	 --pca-cluster-names 1000G \
	 --pca 20 \
	 --out {params.pca_prefix} \
	 --within results/PCA/clusters.txt
	"""

SnakeMake pLink From line 430 of workflow/Snakefile

shell:
	"""
	cp -f workflow/scripts/populations_PCA.R results/PCA/populations_PCA.R
	cd results/PCA/
	Rscript -e 'library(rmarkdown); rmarkdown::render("populations_PCA.R", "html_document")'
	"""

SnakeMake From line 470 of workflow/Snakefile

shell:
	"""
	king \
	-b {params.plink_in}.bed \
	--kinship \
	--prefix {params.prefix}
	"""

SnakeMake king From line 488 of workflow/Snakefile

shell:
	"""
	cut -f1 {input.list} > {params.tmp_id}
	cat {input.fam} | grep -f {params.tmp_id} > {params.tmp_fam}

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--keep {params.tmp_fam} \
	--make-bed \
	--out {params.out_prefix}
	"""

SnakeMake pLink From line 520 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.path} 
	Rscript workflow/scripts/generate_regenie_pheno.R {input.individual_list} {input.fam} {wildcards.pheno} {output.fam} {output.pheno}

	cp -f {input.bim} {output.bim}
	cp -f {input.bed} {output.bed}
	"""

SnakeMake From line 559 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.pathQC}
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--filter-cases \
	--missing \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcCAS

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--filter-controls \
	--missing \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcCON

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--hardy \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcALL

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--hardy \
	--chr 23 \
	--filter-females \
	--out {params.pathQC}/Geno05_CR_sex_snp_qc_snpqcALL_female

	### R session
	wd=$(pwd)
	cp workflow/scripts/additional_QC_cohort.R {params.pathQC}/
	cd {params.pathQC}
	Rscript -e 'library(rmarkdown); rmarkdown::render("additional_QC_cohort.R", "html_document")'
	cd $wd

	# 1e. final QC (4675116)
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--exclude {params.pathQC}/missingness_hardy_weinberg_filter.txt \
	--make-bed \
	--out {params.out_prefix}

	rm -rf {params.pathQC}
	"""

SnakeMake pLink From line 584 of workflow/Snakefile

shell:
	"""
	mkdir -p {params.folderPCA}

	#common
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--chr 1-22 \
	--indep-pairwise 50 5 0.05 \
	--keep-allele-order \
	--maf 0.01 \
	--out "{params.pathinterim}"

	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--extract "{params.pathinterim}.prune.in" \
	--pca 10 \
	--out "{params.out_prefix}"

	rm {params.pathinterim}*
	"""

SnakeMake pLink From line 654 of workflow/Snakefile

shell:
	"""
	Rscript workflow/scripts/generate_covar_file.R {input.individual_list} {input.PCA_cov} {output}
	"""

SnakeMake From line 691 of workflow/Snakefile

shell:
	"Rscript workflow/scripts/compile_phenotype_plots.R {input} {params.out_dir}"

SnakeMake From line 708 of workflow/Snakefile

shell:
	"""
	regenie \
	  --step 2 \
	  --minMAC 5 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --bed {params.plink_file} \
	  --bt \
	  --ignore-pred \
	  --write-samples \
	  --bsize 5000 \
	  --out {params.regenie_step2} \
	  --af-cc \
	  --firth \
	  --firth-se \
	  --approx \
	  --gz
	  """

SnakeMake regenie From line 742 of workflow/Snakefile

shell:
	"""
	regenie \
	  --step 2 \
	  --minMAC 5 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --bed {params.plink_file} \
	  --condition-list {input.conditional_list} \
	  --bt \
	  --ignore-pred \
	  --write-samples \
	  --bsize 5000 \
	  --out {params.regenie_step2} \
	  --af-cc \
	  --firth \
	  --firth-se \
	  --approx \
	  --gz
	  """

SnakeMake regenie From line 785 of workflow/Snakefile

shell:
	"""
	plink2 \
	--bim {input.bim} \
	--bed {input.bed} \
	--fam {input.fam} \
	--glm no-x-sex hide-covar log10 \
	--covar {input.cov} \
	--covar-variance-standardize \
	--out {params.plink_out} \
	--mac 5
	  """

SnakeMake plink2 From line 819 of workflow/Snakefile

shell:
	"""
	Rscript workflow/scripts/qqplots_GWAS.R -f {input} -o {input} -c CHROM -p LOG10P -b GENPOS --log TRUE -i ID
	"""

SnakeMake From line 842 of workflow/Snakefile

script:
	"scripts/add_PRS_to_COV.R"

SnakeMake From line 864 of workflow/Snakefile

script:
	"scripts/gene_sets.R"

SnakeMake From line 887 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	--extract {input.relevant_vars} \
	--make-bed \
	--out {params.plink_out}
	"""

SnakeMake pLink From line 906 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	--max-maf {wildcards.AF} \
	--make-bed \
	--out {params.plink_out}
	"""

SnakeMake pLink From line 935 of workflow/Snakefile

shell:
	"""
	export LD_LIBRARY_PATH={config[LD_LIBRARY_PATH]}

	regenie \
	  --step 2 \
	  --minMAC 1 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --aaf-file {input.anno_aaf} \
	  --anno-file {input.anno_csq} \
	  --mask-def {input.mask_def} \
	  --set-list {input.var_sets} \
	  --bed {params.plink_file} \
	  --aaf-bins {wildcards.AF} \
	  --bt \
	  --write-samples \
	  --ignore-pred \
	  --out {params.regenie_step2} \
	  {params.test_params} \
	  --af-cc \
	  --maxstep-null 2 \
	  --maxiter-null 100000 \
	  --gz

	  #--check-burden-files \
	  #--strict-check-burden \
	"""

SnakeMake regenie From line 968 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.mask_bim} \
	--bed {input.mask_bed} \
	--model fisher \
	--out {params.plink_out}
	"""

SnakeMake pLink From line 1012 of workflow/Snakefile

script:
	"scripts/qq_plots_RVAS.R"

SnakeMake From line 1034 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.famCommon} \
	--bim {input.bimCommon} \
	--bed {input.bedCommon} \
	--score {input.PRS} 1 2 3 sum \
	--out {params.out_pref}
	"""

SnakeMake pLink From line 1056 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	--chr X \
	--from-bp 12767072 \
	--to-bp 12967072 \
	--make-bed \
	--out {params.out_plink}
	"""

SnakeMake pLink From line 1086 of workflow/Snakefile

shell:
	"""
	export LD_LIBRARY_PATH={config[LD_LIBRARY_PATH]}

	regenie \
	  --step 2 \
	  --minMAC 1 \
	  --covarFile {input.cov} \
	  --phenoFile {input.pheno} \
	  --bed {params.plink_file} \
	  --bt \
	  --firth \
	  --firth-se \
	  --write-samples \
	  --ignore-pred \
	  --out {params.regenie_step2} \
	  {params.rez} \
	  {params.sex_param} \
	  --af-cc \
	  --bsize 500 \
	  --gz
	"""

SnakeMake regenie From line 1118 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	{params.sex_param} \
	--model fisher \
	--out {params.plink_out}
	"""

SnakeMake pLink From line 1157 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	{params.sex_param} \
	--filter-cases \
	--freqx \
	--out {params.out_case}

	plink \
	--fam {input.fam} \
	--bim {input.bim} \
	--bed {input.bed} \
	{params.sex_param} \
	--filter-controls \
	--freqx \
	--out {params.out_ctrl}		
	"""

SnakeMake pLink From line 1185 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	{params.maf} \
	--homozyg \
	--out {params.ROH_out}
	"""

SnakeMake pLink From line 1225 of workflow/Snakefile

shell:
	"""
	plink \
	--fam {input.fam} \
	--bed {input.bed} \
	--bim {input.bim} \
	--het gz \
	--ibc \
	--out {params.ROH_out}
	"""

SnakeMake pLink From line 1248 of workflow/Snakefile

shell: 
	"""
	cd {params.out_folder}
	wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr{wildcards.contig}.vcf.bgz
	wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr{wildcards.contig}.vcf.bgz.tbi
	"""

SnakeMake From line 1273 of workflow/Snakefile

shell:
	"""
	cd {params.out_dir}
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz.tbi
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.snv.tsv.gz
	wget https://kircherlab.bihealth.org/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.snv.tsv.gz.tbi
	"""

SnakeMake From line 1287 of workflow/Snakefile

shell:
	"""
	keep_string="^INFO/AC,INFO/AN,INFO/AF,INFO/popmax,INFO/faf95_popmax,INFO/AC_oth,INFO/AN_oth,INFO/AF_oth,INFO/nhomalt_oth,INFO/AC_ami,INFO/AN_ami,INFO/AF_ami,INFO/nhomalt_ami,INFO/AC_sas,INFO/AN_sas,INFO/AF_sas,INFO/nhomalt_sas,INFO/AC_fin,INFO/AN_fin,INFO/AF_fin,INFO/nhomalt_fin,INFO/AC_eas,INFO/AN_eas,INFO/AF_eas,INFO/nhomalt_eas,INFO/AC_amr,INFO/AN_amr,INFO/AF_amr,INFO/nhomalt_amr,INFO/AC_afr,INFO/AN_afr,INFO/AF_afr,INFO/nhomalt_afr,INFO/AC_raw,INFO/AN_raw,INFO/AF_raw,INFO/nhomalt_raw,INFO/AC_mid,INFO/AN_mid,INFO/AF_mid,INFO/nhomalt_mid,INFO/nhomalt,INFO/AC_asj,INFO/AN_asj,INFO/AF_asj,INFO/nhomalt_asj,INFO/AC_nfe,INFO/AN_nfe,INFO/AF_nfe,INFO/nhomalt_nfe,INFO/AC_popmax,INFO/AN_popmax,INFO/AF_popmax,INFO/nhomalt_popmax"
	bcftools annotate -x "$keep_string" {input} -Oz -o {output}

	tabix -pvcf {output}
	#rm {input}
	"""

SnakeMake BCFtools tabix From line 1308 of workflow/Snakefile

shell:
	"""
	bcftools concat -Oz {input} > {output.reduced}
	tabix -pvcf {output.reduced}
	#bcftools view ouptut.reduced -i "INFO/AF>0.01" -Oz -o output.above_one_perc
	#tabix -pvcf output.above_one_perc
	"""

SnakeMake BCFtools tabix From line 1327 of workflow/Snakefile

shell:
	"""
	# software required: bedops, bedtools
	mkdir -p {params.out_folder}
	cd {params.out_folder}

	# get bedtools reference file
	wget -Nc https://raw.githubusercontent.com/arq5x/bedtools2/master/genomes/human.hg38.genome -O bedtools_hg38_ref_file.genome

	# GENCODE
	wget -Nc ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gff3.gz

	# exons
	zcat gencode.v35.annotation.gff3.gz | \
	 awk '$3 == "exon"' - | \
	 sort -k1,1 -k4,4n -V | \
	 bedtools merge -i stdin | \
	 cut -f1-3 | gzip > gencode35Exons.bed.gz

	# add padding to gencode file
	bedtools slop -b 20 \
	-i gencode35Exons.bed.gz \
	-g bedtools_hg38_ref_file.genome | \
	gzip > gencode35ExonsPadded.bed.gz

	# promoters
	zcat gencode.v35.annotation.gff3.gz | \
	grep -P "transcript\t" | grep -P "\t\+\t" | \
	 awk '{{OFS="\t"}}{{print $1,$4-1001,$4-1}}' > plus_strand_promotor.bed

	zcat gencode.v35.annotation.gff3.gz | \
	grep -P "transcript\t" | grep -P "\t\-\t" | \
	awk '{{OFS="\t"}}{{print $1,$5-1,$5+999}}' > minus_strand_promotor.bed

	cat plus_strand_promotor.bed minus_strand_promotor.bed | \
	sort -k1,1 -k3,3n -V | \
	awk '{{OFS="\t"}}{{if ($2<0) print $1,"0",$3; else print $0}}' | \
	gzip > promoters.bed.gz


	# ClinVar
	wget -Nc ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
	zcat clinvar.vcf.gz | grep "^#"  > clinvar_pathogenic.vcf
	zcat clinvar.vcf.gz | grep -v "^#" | egrep "pathogenic|Pathogenic" | grep -v "uncertain" >> clinvar_pathogenic.vcf
	bedtools merge -i clinvar_pathogenic.vcf | cut -f1-3 | awk '{{OFS=""}}{{print "chr",$0}}' | gzip > clinvar_pathogenic.bed.gz

	# SpliceAI
	# filter splice AI for scores >0.5 and convert to BED
	zcat {params.anno_sources}'/spliceai_scores_sorted.hg38.vcf.gz' | \
	egrep "0\.[5-9]|1\.0" | \
	awk 'BEGIN {{OFS="\t"}} {{ print $1, $2-1, $2 }}' | \
	gzip > \
	splice_ai_positions.bed.gz


	# merge and sort individual bed files
	zcat gencode35ExonsPadded.bed.gz clinvar_pathogenic.bed.gz splice_ai_positions.bed.gz promoters.bed.gz ../../{input.DHS} | \
	cat - ../../{input.GenRa} | \
	cut -f1-3 | \
	bedtools sort | \
	bedtools merge > "ExonsClinVarSpliceAI.bed"
	"""

SnakeMake BEDTools From line 1348 of workflow/Snakefile

shell:
	"""
	cat {input.bedfile} | grep "^chr{wildcards.contig}	" > {output.relevant_bed}

	bcftools view {input} -R {output.relevant_bed} -Ob | \
	bcftools sort -T $TMPDIR | \
	bcftools norm -m-any -Ob | \
	bcftools norm --remove-duplicates -Oz -o {output.vcf_prefiltered}
	"""

SnakeMake BCFtools From line 1425 of workflow/Snakefile

shell:
	"""

	work_dir=$(pwd)
	database_dir={params.anno_sources}

	echo $SLURM_NODELIST			
	echo $TMPDIR

	cd $TMPDIR
	#git clone https://github.com/ImperialCardioGenetics/UTRannotator.git
	#export PERL5LIB=$(pwd)/UTRannotator
	cp -v $database_dir'/Homo_sapiens_assembly38.fasta.bgz'* . #/dev/shm/ #1G, store to RAMdisk
	cp -v $database_dir'/gnomad.genomes.v3.1.2.sites.all_red.vcf.gz'* . # 9G
	#cp -v $database_dir'/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz'* . # 86 G biggest file, do not copy
	mkdir -p dbNSFP41
	cp -v $database_dir'/dbNSFP41/dbNSFP4.1a_hg38.gz'* dbNSFP41/ #31G
	cp -v $database_dir'/dbNSFP41/dbNSFP4.1a.readme.txt' dbNSFP41/
	cp -v $database_dir'/spliceai_scores_sorted.hg38.vcf.gz'* . #0,6G
	cp -v $database_dir'/cnv_database_171_samples.vcf.gz'* .
	#cp -v $database_dir'/whole_genome_SNVs.tsv.gz'* . # copy CADD 81G
	#cp -v $database_dir'/gnomad.genomes.r3.0.snv.tsv.gz'* . # copy CADD 6G
	cp -rv $database_dir'/homo_sapiens' . # 15G
	wget -N ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz  
	wget -N ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi	

	cd $work_dir

	#export PERL5LIB=$database_dir'/UTRannotator/'
	#wget https://raw.githubusercontent.com/Ensembl/VEP_plugins/postreleasefix/101/TSSDistance.pm
	export PERL5LIB=$database_dir

	vep -v  \
	--cache \
	--offline \
	--force_overwrite \
	--fork 8 \
	--buffer {params.buffer} \
	--hgvs \
	--no_stats \
	--format vcf \
	--numbers \
	--nearest gene \
	--vcf \
	--distance 50000,1000 \
	--gene_phenotype \
	--pick_allele_gene \
	--af \
	--af_esp \
	--check_existing \
	-a GRCh38 \
	--dir_cache $TMPDIR \
	--fasta $TMPDIR'/Homo_sapiens_assembly38.fasta.bgz' \
	--custom $TMPDIR'/clinvar.vcf.gz',ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
	--custom $TMPDIR'/gnomad.genomes.v3.1.2.sites.all_red.vcf.gz',gnomAD_ge,vcf,exact,0,AF,AC,AN,nhomalt,popmax,AF_oth,AF_ami,AF_sas,AF_fin,AF_eas,AF_amr,AF_afr,nhomalt_afr,AF_raw,AF_mid,AF_asj,AF_nfe,AF_popmax \
	--custom $database_dir'/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz',gnomAD_ex,vcf,exact,0,AF,AN,AC,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,nhomalt \
	--plugin dbNSFP,$TMPDIR'/dbNSFP41/dbNSFP4.1a_hg38.gz',Ensembl_proteinid,MutationTaster_pred,MutationTaster_score,LRT_pred,Polyphen2_HDIV_pred,Polyphen2_HVAR_pred,SIFT_pred,REVEL_score,REVEL_rankscore,phyloP100way_vertebrate,phyloP100way_vertebrate_rankscore,phastCons100way_vertebrate,phastCons100way_vertebrate_rankscore,BayesDel_noAF_score,BayesDel_noAF_pred \
	--plugin CADD,$database_dir'/whole_genome_SNVs.tsv.gz',$database_dir'/gnomad.genomes.r3.0.snv.tsv.gz' \
	--custom $TMPDIR'/spliceai_scores_sorted.hg38.vcf.gz',SpliceAI,vcf,exact,0,ALLELE,SYMBOL,DS_AG,DS_AL,DS_DG,DS_DL  \
	--custom {input.DHS},DHS,bed,exact,0 \
	--plugin StructuralVariantOverlap,file=$TMPDIR'/cnv_database_171_samples.vcf.gz' \
	--plugin TSSDistance \
	-i {input.vcf} \
	-o stdout | bgzip > {output}

	#--plugin UTRannotator,$TMPDIR'/UTRannotator/uORF_starts_ends_GRCh38_PUBLIC.txt' \
	#--custom $TMPDIR'/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz',gnomAD_ex,vcf,exact,0,AF,AN,AC,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,nhomalt \

	#rm /dev/shm/Homo_sapiens_assembly38.fasta.bgz*

	"""

SnakeMake Variant Effect Predictor (VEP) From line 1448 of workflow/Snakefile

shell:
	"""
	function return_AF_string_w_genomes
	{{
	echo "((gnomAD_ge_AF < "$1" or not gnomAD_ge_AF) and ((IMPACT is HIGH) or (IMPACT is MODERATE) or (IMPACT is LOW) or (SpliceAI_SYMBOL)) and (gnomAD_ge_nhomalt < 11 or not gnomAD_ge_nhomalt)) or (ClinVar_CLNSIG and not (ClinVar_CLNSIG match benign or ClinVar_CLNSIG match Benign)) or (CLIN_SIG match pathogenic or CLIN_SIG match Pathogenic)"

	#echo "((gnomAD_ge_AF_AFR < "$1" or not gnomAD_ge_AF_AFR) and (gnomAD_ge_AF_AMR < "$1" or not gnomAD_ge_AF_AMR) and (gnomAD_ge_AF_EAS < "$1" or not gnomAD_ge_AF_EAS) and (gnomAD_ge_AF_FIN < "$1" or not gnomAD_ge_AF_FIN) and (gnomAD_ge_AF_NFE < "$1" or not gnomAD_ge_AF_NFE) and (gnomAD_ge_AF_SAS < "$1" or not gnomAD_ge_AF_SAS) and (AA_AF < "$1" or not AA_AF) and (EA_AF < "$1" or not EA_AF) and ((IMPACT is HIGH) or (IMPACT is MODERATE) or (IMPACT is LOW) or (SpliceAI_SYMBOL)) and (gnomAD_ex_nhomalt < 5 or not gnomAD_ex_nhomalt)) or (ClinVar_CLNSIG and not (ClinVar_CLNSIG match benign or ClinVar_CLNSIG match Benign)) or (CLIN_SIG match pathogenic or CLIN_SIG match Pathogenic)"
	}}

	filter_vep_str="$(return_AF_string_w_genomes 0.02)"
	echo "bcftools +fill-tags {input} -- -t AN,AC | filter_vep --format vcf --force_overwrite --filter \\""$filter_vep_str"\\" | bgzip > {output}" | bash
	"""

SnakeMake BCFtools From line 1534 of workflow/Snakefile

shell:
	"""

	zcat {input} | sed 's:;AN=:;AN_cohort=:g' | \
	sed 's:ID=AN,:ID=AN_cohort,:g' | \
	sed 's:;AC=:;AC_cohort=:g' | \
	sed 's:ID=AC,:ID=AC_cohort,:g' | \
	sed 's:||:|0|:g'| sed 's:||:|0|:g' | bgzip > {params.vep_renamed}

	split_string1='%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%QUAL\\t%AC_cohort\\t%AN_cohort\\t'
	split_string2='%CSQ\\t'
	split_string3='[%GT\\t%TGT\\t%AD\\t%GQ\\t]' # 

	header1=$(echo $split_string1 | sed 's:\\\\t:	:g')
	header2=$(bcftools +split-vep {params.vep_renamed} -l -l | cut -f2 | tr "\\n" "\\t")
	header3=$(bcftools query -u -H -f $split_string3'\\n' {params.vep_renamed} | head -n1 | cut -f2-5) ||  echo ""
	echo "sample	${{header1}}	$header2	$header3" | awk -v OFS="\\t" '$1=$1'  > {output}
	for sample in `bcftools query -l {params.vep_renamed}`; do
		bcftools view -c1 -s $sample {params.vep_renamed}  | \
		bcftools +split-vep -f $split_string1$split_string2$split_string3'\\n' -d -A tab | \
		awk -v var="$sample"  -F $'\t' 'BEGIN {{OFS = FS}} {{print var, $0 }}' >> {output}
	done
	#rm -f {params.vep_renamed}
	"""

SnakeMake BCFtools From line 1561 of workflow/Snakefile

shell:
	"""
	bcftools +split-vep -l {input} | cut -f2 | tr '\\n' ';' | awk 'BEGIN {{ FS = ";"}} ;{{ print "ID;"$0}}' > {output}
	bcftools +split-vep -d -f'%CHROM:%POS:%REF:%ALT;%CSQ\n' -A ";" {input} >> {output}
	"""

SnakeMake BCFtools From line 1598 of workflow/Snakefile

script:
	"scripts/make_aaf_anno.R"

SnakeMake From line 1618 of workflow/Snakefile

shell:
	"cat {input} > {output}"

SnakeMake From line 1632 of workflow/Snakefile

shell:
	"""
	bcftools view --min-ac 1 {input} | \
	bcftools +fill-tags | \
	bcftools +split-vep -f "%CHROM\t%ID\t%MAF\t%AN\t%gnomAD_ge_AF\t%gnomAD_ge_AF_nfe\t%gnomAD_ge_AC\n" --select worst:any |
	gzip > {output}
	"""

SnakeMake BCFtools From line 1648 of workflow/Snakefile

shell:
        """
        plink2 --bfile {params.plink_in_prefix} --freq --out {params.freq_out_prefix}
        plink2 --bfile {params.plink_in_prefix} --freq counts --out {params.ac_out_prefix}
        """

SnakeMake plink2 From line 1691 of workflow/Snakefile

run:
        import pandas as pd
        import numpy as np
        data_af = pd.read_csv(input.data_af_file, sep="\s+")
        data_af=data_af[["ID","ALT_FREQS"]]
        anno_af = pd.read_csv(input.regenie_aaf_file, sep=" ", names=["snp", "aaf"])
        joined = data_af.join(anno_af.set_index("snp"), on="ID", how="inner")
        joined["merged_aaf"] = joined[["ALT_FREQS", "aaf"]].max(axis=1)

        # added AC check - however - this is not effective with the current sample size)
        data_ac = pd.read_csv(input.data_ac_file, sep="\s+")
        data_ac=data_ac[["ID","ALT_CTS"]]
        joined_AC=joined.join(data_ac.set_index("ID"), on="ID", how="inner")
        joined_AC["merged_aaf"] = np.where( (joined_AC.ALT_CTS > 2) & (joined_AC.merged_aaf<0.001), 0.005, joined_AC.merged_aaf)
        joined_AC.merged_aaf = joined_AC.merged_aaf.round(4)
        joined_AC[["ID", "merged_aaf"]].to_csv(output.data_anno_aaf_file, sep=" ", header=False, index=False, float_format='%.5E')

SnakeMake Pandas numpy From line 1707 of workflow/Snakefile

script: "scripts/plink_subset_for_rvtest.py"

SnakeMake From line 1749 of workflow/Snakefile

shell:
        """
        plink2 --bfile {params.plink_in_prefix} --export vcf bgz id-paste=iid --out {params.vcf_out_prefix} --allow-extra-chr
        bcftools index -f -t {params.vcf_out_prefix}.vcf.gz
        """

SnakeMake BCFtools plink2 From line 1767 of workflow/Snakefile

run:
        import pandas as pd
        fam = pd.read_csv(input.fam, names=["fid", "iid","fatid","matid","sex","empty"], sep=' ')
        pheno = pd.read_csv(input.phenotypes, names=["fid","iid","pheno"], sep='\t', header=0).set_index("iid")
        pheno.head()
        out = fam.join(pheno.pheno, on="iid")
        out.pheno = out.pheno + 1
        out[["fid", "iid","fatid","matid","sex","pheno"]].to_csv(output.rvtest_pheno_file, sep="\t", header=False, index=False)

SnakeMake Pandas From line 1783 of workflow/Snakefile

shell:
        """
        rvtest \
        --inVcf {input.invcf} \
        --pheno {input.rvtest_pheno_file} \
        --setFile {input.setFile_rvtest} \
        --noweb \
        --kernel skat \
        --out {params.out_prefix}
        """

SnakeMake From line 1807 of workflow/Snakefile

shell:
        """
        rvtest \
        --inVcf {input.invcf} \
        --pheno {input.rvtest_pheno_file} \
        --setFile {input.setFile_rvtest} \
        --noweb \
        --burden exactCMC \
        --out {params.out_prefix}
        """

SnakeMake From line 1834 of workflow/Snakefile

shell:
        """
        rvtest \
        --inVcf {input.invcf} \
        --pheno {input.rvtest_pheno_file} \
        --setFile {input.setFile_rvtest} \
        --noweb \
        --kernel skat \
        --out {params.out_prefix}
        """

SnakeMake From line 1876 of workflow/Snakefile

shell:
        """
        cp -f {input.phenofam} {params.out_prefix}G.fam
        cp -f {input.bim} {params.out_prefix}G.bim
        cp -f {input.bed} {params.out_prefix}G.bed

        cp {input.temp} {params.out_prefix}.param

        sed -i "s:bfile:{params.out_prefix}G:g" {params.out_prefix}.param
        sed -i "s:outx:{params.out_prefix}:g" {params.out_prefix}.param

        resources/.no_upload/GECS/gecs {params.out_prefix}.param

        rm -f {params.out_prefix}G.b*
        """

SnakeMake From line 1906 of workflow/Snakefile

shell:
	"""
	zcat {input} > {output.pos}
	zcat {input} | cut -f1 -d" " > {output.ids}
	"""

SnakeMake From line 1947 of workflow/Snakefile

shell:
	"""
	plink \
	 --bfile "{params.ar_pl_in}" \
	 --extract "{input.dbSNP_ids}" \
	 --make-bed \
	 --out {params.ar_pl_out}_tmp1

	plink \
	 --bfile {params.ar_pl_out}_tmp1 \
	 --update-map "{input.dbSNP_pos}" 3 1 \
	 --make-bed \
	 --out {params.ar_pl_out}

	awk '{{print $2"ARRAY",$2"ARRAY",$3,$4,$5,$6}}' {params.ar_pl_out}.fam > {params.ar_pl_out}_tmp.fam

	mv {params.ar_pl_out}_tmp.fam {params.ar_pl_out}.fam
	rm -f {params.ar_pl_out}_tmp*
	"""

SnakeMake pLink From line 1965 of workflow/Snakefile

shell:
	"""
	echo "start"

	plink \
	 --vcf {input.vcf} \
	 --chr 18,19,20,21 \
	 --vcf-half-call h \
	 --double-id \
	 --make-bed \
	 --out {params.plink_pref}_tmp

	plink \
	--bfile {params.plink_pref}_tmp \
	--update-name "{input.dbSNP_pos}" 1 4 \
	--make-just-bim \
	--out {params.plink_pref}_tmp2

	cat {params.plink_pref}_tmp2.bim | sed "s/chr//g" > {params.plink_pref}_tmp.bim


	plink \
	--fam {params.plink_pref}_tmp.fam \
	--bed {params.plink_pref}_tmp.bed \
	--bim {params.plink_pref}_tmp.bim \
	--extract {params.array_plink}.bim \
	--make-bed \
	--out {params.plink_pref}

	rm -f {params.plink_pref}_tmp*
	""" 

SnakeMake pLink From line 1998 of workflow/Snakefile

shell:
	"""
	#module load king

	king \
	-b {params.hl_pl}.bed,{params.ar_pl}.bed \
	--duplicate \
	--related \
	--degree 1 \
	--prefix {params.king_prefix}

	king \
	-b {params.hl_pl}.bed,{params.ar_pl}.bed \
	--kinship \
	--prefix {params.king_prefix}_kinship

	"""

SnakeMake king From line 2046 of workflow/Snakefile

shell:
	"""
	#export LD_LIBRARY_PATH={config[LD_LIBRARY_PATH]}

	regenie \
	  --step 1 \
	  --bed {params.plink_file} \
	  --covarFile {input.pathCov} \
	  --phenoFile {input.pathPheno} \
	  --bt \
	  --extract {input.filtered} \
	  --loocv \
	  --bsize 1000 \
	  --out {params.step1}
	#--extract input.pruned \
	 """

SnakeMake regenie From line 2082 of workflow/Snakefile

script: "scripts/make_phenoCov.R"

SnakeMake From line 2111 of workflow/Snakefile

shell:
	"""
	# LD pruned variants for regenie step 1
	plink \
	--bed {input.bed} \
	--bim {input.bim} \
	--fam {input.fam} \
	--keep {input.keep_fam} \
	--maf 0.01 \
	--chr 1-22 \
	--geno 0.01 \
	--snps-only just-acgt \
	--hwe 1e-10 \
	--exclude range {input.longrange_LD} \
	--make-bed \
	--out {params.pruned_plink}

	cut -f2 {output.bim} > {output.var_ids}

	"""

SnakeMake pLink regenie From line 2131 of workflow/Snakefile

shell:
	"""
	step1_fitNULLGLMM.R \
	--plinkFile={params.plink_file} \
	--phenoFile={input.phenocov} \
	--phenoCol={wildcards.pheno} \
	--covarColList=PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,Age,Age2,age_sex,sex_for_fam \
	--sampleIDColinphenoFile=IID \
	--traitType=binary \
	--outputPrefix={params.step1} \
	--LOCO=TRUE \
	--IsOverwriteVarianceRatioFile=TRUE \
	--nThreads=4
	"""

SnakeMake From line 2167 of workflow/Snakefile

	shell:
		"""
		plink \
		--bfile {params.plink_in} \
		--recode vcf-iid \
		--out {params.vcfprefix}

		cat  {params.vcfprefix}.vcf | bgzip > {output.vcf}
		tabix --csi -pvcf {output.vcf}

		plink \
		--bfile {params.plink_in} \
		--recode vcf-iid \
 		--chr X \
		--out {params.vcfprefix_justX}

 		cat  {params.vcfprefix_justX}.vcf | bgzip > {output.vcf_justX}
 		tabix --csi -pvcf {output.vcf_justX}

		"""

SnakeMake tabix pLink From line 2197 of workflow/Snakefile

shell:
	"""

	step2_SPAtests.R \
	--vcfField=GT \
	--GMMATmodelFile={params.saige_step1}.rda \
	--varianceRatioFile={params.saige_step1}.varianceRatio.txt \
	--SAIGEOutputFile={output} \
	--numLinesOutput=1000 \
	--IsOutputNinCaseCtrl=TRUE \
	--IsOutputHetHomCountsinCaseCtrl=TRUE \
	--minMAC=3 \
	--IsOutputAFinCaseCtrl=TRUE \
	--vcfFile={params.in_vcf} {params.loco}	

	"""

SnakeMake From line 2235 of workflow/Snakefile

shell:
	"""
	if cat {input} | grep "CHR" | head -n1 > {params.header}
	then
	echo "don't worry"
	fi

	cat {input} | grep -v "CHR" | \
	cat {params.header} - |
	gzip > {output}
	"""

SnakeMake From line 2261 of workflow/Snakefile

shell:
	"""	
	mkdir -p {params.tmp_dir}

	hail_python_script="workflow/scripts/relatedness_filter.py $(pwd)/{input.GWAS_MT} $(pwd)/{params.tmp_dir} $(pwd)/{input.pheno_file} $(pwd)/{output.kinship_vals} $(pwd)/{output.to_remove}"

	if [ {config[cluster]} = "yes" ]; then
	queue="long"
	hours_to_run=168
	DRIVE_MEM=78
	worker_nodes_excluded={config[worker_nodes_excluded]}
	num_workers=20
	source workflow/scripts/spark_submit_command.sh
	$spark_submit_command $hail_python_script

	else
	python $hail_python_script
	fi

	rm -rf {params.tmp_dir}*
	"""