Workflow Steps and Code Snippets

186 tagged steps and code snippets that match keyword pLink

A Snakemake pipeline for variant calling of genomic FASTQ data using GATK

 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
shell: """echo '''
Information about where these files fall in the workflow can be seen under the `.smk` files under `workflow/rules`.

The directories for variant calling from `variant_calling.smk` and `phasing.smk` are created in the following order:

├─ alignments  # Duplicate sequences are marked and sorted
|   |
|   ├─ raw  # .bam files after .fastq files aligned to reference
|   |
|   ├─ markdup  # .bam files after running fixing mates pairs and marking duplicate reads
|   |
|   └─ alignments_recalibrated  # BQSR tables applied to sequences
|       |
|       └─ recal_tables  # BQSR recalibration tables that are applied to sequences
|
├─ gvcf  # Called variants as .g.vcf files
|
├─ db  # GenomicsDB datastore consolidating .g.vcf files. Not human-readable
|
├─ joint_call  # db data converted to a joint .vcf file with VQSR applied
|
├─ vcf  # Multisample VCF
|   |
|   ├─ hard_filtered # (Option 1) Filters by hard values
|   |   |
|   |   ├─ filter_applied  # FILTER column filled included those variants that didn't pass
|   |   |
|   |   └─ pass_only  # Only variants with FILTER=PASS
|   |
|   └─ VQSR  # (Option 2) Variant recalibration, filters by using truth and training data
|       |
|       ├─ model  # Models to be applied to VCFs
|       |
|       ├─ filter_applied  # FILTER column filled included those variants that didn't pass
|       |
|       └─ pass_only  # Only variants with FILTER=PASS
|
├─* plink  # Contains plink files
|
├─ genotypes
|   |
|   ├─ posteriors  # Calculate PP tag (posterior probabilites) and recaluclate MQ tag
|   |
|   ├─ filtered  # Set to missing genotypes with MQ < 20
|   |
|   ├─ filtered  # Set to missing genotypes with MQ < 20
|   |
|   └─ subsets  # Taking subsets of samples based on sequencing methods: GBS, WES, WGS
|
└─ haplotypes  # Phased/imputed VCFs
    |
    ├─ scaffolds  # Preliminary VCFs of genotypes incorportating pedigree information
    |
    └─ SHAPEIT4  # Phased/imputed VCFs using scaffold and original VCF


The directories for relations from `relations.smk` are as follows. `plink` is created first. The others do not follow an order.
However, these require the joint vcf file from `joint_call`:

├─ plink  # Contains plink files
|
├─ relatedness  # Pairwise estimates of relatedness between samples
|
├─ admixture  # Finds rates of admixture among samples
|   |
|   ├─ supervised  # Uses additional samples of known origin
|   |
|   └─ unsupervised  # No known origins
|
└─ aims  # Ancestry informative markers

Directories for determining kinship using LASAR and SEEKIN:

└─ kinship  # Allele frequencies and pairwise relatedness

Other:

└─ structural_variants  # Strutural variants

''' > {output}
"""
63
64
65
66
67
68
69
70
71
shell: """
    plink \
        --vcf {input.vcf} \
        --update-parents {input.parents_table} \
        --update-sex {input.sex_table} \
        --recode12 \
        --make-bed \
        --out {params.out_path}/{wildcards.dataset} \
    """
85
86
87
88
89
90
91
shell: """
    plink \
        --bfile {params.path}/{wildcards.dataset} \
        --update-ids {input.update_fids} \
        --make-bed \
        --out {params.path}/{wildcards.dataset}.same_fid \
    """
133
134
135
136
137
138
139
140
141
shell: """
    plink \
        --vcf {input.vcf} \
        --update-parents {input.parents_table} \
        --update-sex {input.sex_table} \
        --recode12 \
        --make-bed \
        --out {params.out_path}/{wildcards.dataset}.chr{wildcards.chr} \
    """
155
156
157
158
159
160
161
shell: """
    plink \
        --bfile {params.path}/{wildcards.dataset}.chr{wildcards.chr} \
        --update-ids {input.update_fids} \
        --make-bed \
        --out {params.path}/{wildcards.dataset}.chr{wildcards.chr}.same_fid \
    """
272
273
274
275
276
    shell: "plink \
                --file  " + config["results"] + " plink/{wildcards.dataset} \
                --out " + config["results"] + " plink/recode12/{wildcards.dataset}_recode12 \
                --recode12"
'''
331
332
333
334
335
shell: "plink \
            --vcf {input.vcf} \
            --recode12 \
            --make-bed \
            --out {config[results]}admixture/supervised/plink/{wildcards.sample}"
364
365
366
367
shell: "plink \
            --bfile " + config["results"] + "admixture/supervised/plink/{wildcards.dataset} \
            --recode \
            --out " + config["results"] + "admixture/supervised/plink/{wildcards.dataset}"

Multi-Step Genomic Analysis Pipeline for DeCOI WGS Cohort

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
library(tidyverse)
library(readxl)
# Params:
# 1. input pheno-excel
# 2. input fam file
# 3. pheno column name in excel
# 4. Output: fam file (Plink)
# 5. Output: pheno file (Regenie)

#testing
#setwd("/media/axel/Dateien/Arbeit_Gen/COVID_WGS/WGS_GWAS_pipeline")
args=c("../../resources/EURs_unrel.tsv","data/anc_vcf/EUR_vcf_not_rel.vcf.bgz.fam", "A1", 
       "data/regenie_pheno/fam_file.fam", "data/regenie_pheno/phenox")
# first: excel_file, second: fam-file, third: Phenotype column in excel

args = commandArgs(trailingOnly=TRUE)
print(args)

# File formats:
#FAM file: 
# A text file with no header line, and one line per sample with the following six fields:
#Family ID ('FID')
#Within-family ID ('IID'; cannot be '0')
#Within-family ID of father ('0' if father isn't in dataset)
#    Within-family ID of mother ('0' if mother isn't in dataset)
#Sex code ('1' = male, '2' = female, '0' = unknown)
#Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

# Regnie pheno file
# FID IID Y1 Y2
#  0=control, 1=case, missing values must be coded as NA

samplesheet<-read_tsv(args[1])%>%
  mutate(sex_for_fam=ifelse(is_female==TRUE, 2, ifelse(is_female==FALSE,1,0)))

pheno_col_num<-which(colnames(samplesheet) == args[3])
pheno_PLINK<-as.integer(unlist(samplesheet[,pheno_col_num]))+1
samplesheet$pheno_P=replace_na(pheno_PLINK, -9)
pheno_REGENIE<-as.integer(unlist(samplesheet[,pheno_col_num]))
samplesheet$pheno_R=pheno_REGENIE

# add sex stratification / age strat
samplesheet <- samplesheet %>%
  mutate(pheno_R_female=ifelse(is_female==TRUE, pheno_R, NA), # just keep females
         pheno_R_male=ifelse(is_female==FALSE, pheno_R, NA), # just keep males
         pheno_R_GE60=ifelse(pheno_R==1 & Age < 60, NA, pheno_R),  # set cases 59 or younger to missing
         pheno_R_LT60=ifelse(pheno_R==1 & Age >= 60, NA, pheno_R),
         pheno_R_GE60CC=ifelse(Age < 60, NA, pheno_R),  # set persons 59 or younger to missing
         pheno_R_LT60CC=ifelse(Age >= 60, NA, pheno_R)) # set persons 60 or older to missing 

col_names_fam<-c("FID","IID","father","mother","sex","pheno")

# sex coding: 1=male, 2=female
# pheno coding: 1=control, 2=case
fam<-read_delim(args[2],
delim=" ", 
col_names=col_names_fam,
col_types=c(
  FID = col_character(),
  IID = col_character(),
  father = col_character(),
  mother = col_character(),
  sex = col_integer(),
  pheno = col_integer()
)
)

print("fam:")
head(fam)
print("samplesheet:")
head(samplesheet)

all_info_joined<-fam %>% left_join(samplesheet, by=c("IID"="s"))%>%
  mutate(sex=sex_for_fam)

fam_new<-all_info_joined %>% 
  select(all_of(col_names_fam[1:4]), sex_for_fam, pheno_P)
write_tsv(x = fam_new, file = args[4], col_names = FALSE) # fam_file for plink

pheno_file_regenie <- all_info_joined %>%
  select(FID, IID, pheno_R, pheno_R_female, pheno_R_male, pheno_R_GE60, pheno_R_LT60, pheno_R_GE60CC, pheno_R_LT60CC)

colnames(pheno_file_regenie)<-c("FID","IID",
                                args[3],
                                paste0(args[3],"_female"),
                                paste0(args[3],"_male"),
                                paste0(args[3],"_GE60"),
                                paste0(args[3],"_LT60"),
                                paste0(args[3],"_GE60CC"),
                                paste0(args[3],"_LT60CC"))

print("out:")
head(pheno_file_regenie)

write_tsv(x = pheno_file_regenie, file = args[5], col_names = TRUE) # pheno file for regenie
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import pandas as pd
import pandas_plink
from xarray import DataArray
import shutil
import dask.array

# %%
mask_def_file = snakemake.input.regenie_mask_def
aaf_file = snakemake.input.data_anno_aaf_file
anno_file = snakemake.input.regenie_anno_file

af_cutoffs = list(map(float, snakemake.params.af_cutoffs))
plink_in_prefix = snakemake.params.plink_in_prefix

out_prefix = snakemake.params.plink_out_folder
set_file = snakemake.output.setFile_rvtest

debug_joined_out = snakemake.output.debug_joined_out
#%%

# read plink with pandas_plink
(bim, fam, bed) = pandas_plink.read_plink(plink_in_prefix, verbose=False)
bim.cm = bim.cm.astype(int)
#%%
# Read regenie aaf_file, anno_file, mask_def
aafs = pd.read_csv(aaf_file, sep=' ', names=['ID', 'aaf'])

mask_def = pd.read_csv(mask_def_file, sep=' ', names=['mask', "annots"])
mask_def.annots = mask_def.annots.map(lambda a: a.split(','))
mask_def = mask_def.set_index("mask")

anno = pd.read_csv(anno_file, sep=' ', names=['ID', 'gene', 'annot'])
anno = anno.dropna()
# %%
# Create M0-M4 True/False columns in table
for mask, annots in mask_def.annots.items():
    anno[mask] = anno.annot.isin(annots)

# Join AF info into table
anno_aaf = anno.join(aafs.set_index("ID"), on="ID", how="outer")

# %%

# join bim (variant table) with regenie anno
joined = bim.join(anno_aaf.set_index("ID"), on="snp", how="inner")
joined = joined.sort_values(by=['gene', 'pos'])

joined.to_csv(debug_joined_out, sep="\t")

#%%

for mask in mask_def.index.unique():
    for af_cutoff in af_cutoffs:
        mask_out_prefix = f"{out_prefix}{mask}_{af_cutoff}"
        # subset variant table
        subset = joined[joined[mask] & (joined["aaf"] < af_cutoff)]
        # export genotype data (subset["i"] contains positions of variants in table)
        pandas_plink.write_plink1_bin(
            DataArray(bed[subset["i"]], dims=["variant", "sample"]),
            bed=mask_out_prefix+".bed"
            )
        # write variant subset to bim file
        subset[["gene","snp","cm","pos","a0","a1"]].to_csv(mask_out_prefix+".bim", sep=" ", header=False, index=False)
        # reuse original fam file
        shutil.copy(plink_in_prefix + ".fam", mask_out_prefix +".fam")

#%%

# create setfile, each gene has its own "chromosome"/contig
set_str = ""
for gene in joined.gene.unique():
    set_str += f"{gene}\t{gene}:0-{joined.pos.max() + 100}\n"

with open(set_file, "wt") as f:
    f.write(set_str)
tool / biotools

pLink

A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides.