GWAS Meta-Analysis Pipeline with Snakemake and Singularity

public public 1yr ago Version: 0.0.1 0 bookmarks

Prerequisites

  1. Snakemake

  2. Singularity

Run

  1. Modify the config.yaml file

  2. Variant QC

  3. Single-cohort association

  4. Meta-analysis

  5. Scan signal regions

  6. Plot signals

  7. Select independent signals

1. Modify the config.yaml file

You'll need to provide 5 information.

input : Path to the input phenotype files including the cohort , group and phenotype wildcards.
The format of the phenotype file needs to be as shown below.

ID res zres
SampleA 0.5532 0.312
SampleB 0.323 0.150

The third column will be used as the phenotype value.

cohorts : For each cohort, the plink binary prefix ( bfile ) and the path to the genetic-relatedness matrix ( grm ) created using GCTA is required.

group : An identifier for the phenotype group. This currently has no functional importance other than being used as a prefix for output file names.

phenotypes : Either a list of phenotype names or a path to a file containing a list of phenotype names which will be used to fetch the phenotype files based on the input configuration value.

2. Variant QC

This step filters the genotype data according to the thresholds specified in the config.yaml file.

snakemake --cores 1 --use-singularity all1

3. Single-cohort association

Run single-cohort single-point association analysis using GCTA-MLMA

snakemake --cores 1 --resources rate_limit=30 --use-singularity all2

4. Meta-analysis

Run single-point association meta-analysis using METAL

snakemake --cores 1 --use-singularity create_all_metal

5. Scan signal regions

Scan the meta-analysis regions to extract significant signal regions.

snakemake --cores 1 --use-singularity detect_all_peaks

6. Plot signals

Run PeakPlotter on all signal regions to create regional plots and annotated data.

snakemake --cores 1 --resources rate_limit=30 --use-singularity collect_all_peak_csvs

7. Select independent signals

Run LD-clumping and GCTA-COJO to identify independent signals within all signal regions.

snakemake --cores 1 --use-singularity run_all_cojo

Questions

Q. Why do the freq and freq_geno column values in the .jma.cojo file differ? A. freq_geno column is the frequency of the refA column allele in the input bfile (you can use plink --freq to check). The freq column value is the exact value extracted from the input cojofile, where the cojofile was created from the corresponding metal file. So the freq column value comes from the Alt_Freq column value in the metal file, and the Alt_Freq column value is the "weighted average of frequency for Alt allele across all studies". The freq_geno and freq column values differ because freq_geno is just the allele frequency of the variant from the genotype file (plink bfile) that was combined from all cohorts, whereas freq column is the weighted average of frequency across cohorts (calculated by metal).

Q. When I try to run a rule, I get an error saying Text file busy . What do I do? A. Delete the script and restore it using git restore workflow/script/problematic_script.sh . Your rules should run normally after doing this

Snakefile order

  1. read-config.smk

  2. variant-qc.smk

  3. single-cohort.smk

  4. meta-analysis.smk

  5. detect-peaks.smk

  6. peakplot.smk

  7. cojo.smk

  8. query.smk

  9. gwas.smk

Code Snippets

32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
run:
    concat_list = list()
    for f in input:
        string = re.sub(r'.*\/cojo\/', '', f)
        string = re.sub(r'\/.*', '', string)
        group, phenotype = string.split('.')
        peak = f.split('/')[-1].replace(f'{string}.', '').replace('.jma.cojo', '')
        df = pd.read_csv(f, sep = '\t', header = 0)
        min_p = df['p'].min()
        df['is cojo tophit'] = False
        df.loc[df['p'] == min_p, 'is cojo tophit'] = True
        df.insert(0, 'peak', peak)
        df.insert(0, 'phenotype', phenotype)
        df.insert(0, 'group', group)
        concat_list.append(df)
    pd.concat(concat_list).to_csv(output[0], index = False, header = True, compression = 'gzip')
SnakeMake From line 32 of rules/cojo.smk
78
79
80
81
82
83
84
85
86
shell:
    """
    workflow/scripts/make_cojofile.sh \
        {params.group} \
        {params.phenotype} \
        {params.bfile} \
        {input.metal} \
        {params.prefix} 2>&1 > {log}
    """
SnakeMake From line 78 of rules/cojo.smk
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
shell:
    """
    workflow/scripts/run_cojo.sh \
        {params.bfile} \
        {input.cojofile} \
        {input.metal} \
        {params.threshold} \
        {params.group} \
        {params.phenotype} \
        {params.chrom} \
        {params.start} \
        {params.end} \
        {params.prefix} \
        {resources.mem_mb} 2>&1 > {log}
    """
SnakeMake From line 114 of rules/cojo.smk
27
28
29
30
31
32
33
34
35
36
37
run:
    all_df = list()
    for f in input:
        try:
            df = pd.read_csv(f, sep = '\t', header = None)
        except pd.errors.EmptyDataError:
            continue
        all_df.append(df)
    all_df = pd.concat(all_df)
    all_df.sort_values([0, 1, 2, 3, 4], inplace = True)
    all_df.to_csv(output[0], sep = '\t', header = False, index = False)
58
59
shell:
    "python3 workflow/scripts/collect_peaks.py {input} {params.span} {params.signif} {params.group} {params.phenotype} {output} 2>&1 > {log}"
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
shell:
    """
    # Create command file
    echo 'SEPARATOR TAB'  > {output.cmd}
    echo 'MARKER SNP'     >> {output.cmd}
    echo 'ALLELE A1 A2'   >> {output.cmd} 
    echo 'FREQLABEL Freq' >> {output.cmd}
    echo 'EFFECT b'       >> {output.cmd}
    echo 'STDERR se'      >> {output.cmd}
    echo 'PVALUE p'       >> {output.cmd}
    echo 'SCHEME STDERR'  >> {output.cmd}
    echo 'AVERAGEFREQ ON' >> {output.cmd}
    echo 'MINMAXFREQ ON'  >> {output.cmd}
    for f in {input} ; do
      echo \"PROCESS $f\" >> {output.cmd}
    done
    echo 'OUTFILE {params.prefix}. .txt' >> {output.cmd}
    echo 'ANALYZE' >> {output.cmd}

    metal < {output.cmd} 2>&1 > {log}

    ## CLEAN, COMPRESS, INDEX
    cat \
      <(echo "Chrom MarkerName Pos Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P-value Direction" | tr ' ' '\\t') \
      <(tail -n+2 {params.prefix}.1.txt | sed 's/^chr//' | tr ':' '\\t' | awk 'BEGIN{{FS=\"\\t\";OFS=\"\\t\"}}{{print $1,\"chr\"$1\":\"$2,$2,toupper($3),toupper($4),$5,$6,$7,$8,$9,$10,$11,$12}}' | sort -k1,1n -k3,3n) \
      | bgzip > {output.metal} 2>> {log}

    tabix -s1 -b3 -e3 -S1 {output.metal} 2>> {log}
    rm {params.prefix}.1.txt

    ## Modify info file
    mv {params.prefix}.1.txt.info {output.info}

    sed -i 's,{params.prefix}.1.txt,{output.metal},' {output.info}
    sed -i 's/Marker    /MarkerName  /' {output.info}
    """
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
shell:
    """
    mergelist=$(mktemp /tmp/single-point-analysis-pipeline.merge_bfiles.XXXXXX)
    exec 3>\"$mergelist\"
    echo {params.input} | tr ' ' '\\n' >&3

    plink --allow-no-sex --threads {threads} --make-bed \
      --merge-list $mergelist \
      --out {params.out} || true # || true to prevent exitting due to bash strict mode

    # If merge failed due to multiallelics, exclude multiallelics and re-merge
    if [ -f "{output.missnp}" ]
    then
      exec 3>\"$mergelist\"
      for bfile in {params.input}
      do
        tmp_bfile=${{bfile}}-combine-tmp
        plink --allow-no-sex --threads {threads} --make-bed \
          --bfile $bfile \
          --exclude {output.missnp} \
          --out $tmp_bfile
        echo $tmp_bfile >&3
      done

      plink --allow-no-sex --threads {threads} --make-bed \
        --merge-list $mergelist \
        --out {params.out}
    else
      touch {output.missnp}
    fi
    rm output/bfile/*-combine-tmp.*
    exec 3>-
    """
148
149
150
151
152
153
154
155
156
shell:
    """
    plink --threads {threads} \
        --bfile {params.bfile} \
        --freq \
        --out {params.bfile}

    awk -F '[[:space:]]+' '{params.awk}' {output.frq} > {output.frq2}
    """
185
186
187
188
189
190
191
192
193
194
195
shell:
    """
    mac_threshold=$(awk -F' ' -v id='{params.id}' '{{if ($1==id){{print $4}}}}' {input.mac})

    awk -F ' ' -v mac_threshold=\"$mac_threshold\" '{{if ($2<=mac_threshold){{print $1}}}}' {params.bfile}.frq2 > {output.excludelist}

    plink --allow-no-sex --threads {threads} --make-bed \
          --bfile {params.bfile} \
          --exclude {output.excludelist} \
          --out {params.out}
    """
206
207
208
209
210
shell:
    """
    zgrep -v -w -f <(cat {input.missnp} {input.excludelist}) {input.metal} | grep -v na | bgzip > {output.gz}
    tabix -s1 -b3 -e3 -S1 {output.gz}
    """
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
shell:
    """
    run_manqq.R \
    --chr-col Chrom \
    --pval-col P-value \
    --pos-col Pos \
    --a1 Alt \
    --a2 Ref \
    --build 38 \
    --image png \
    --af-col Alt_Freq \
    --maf-filter {params.maf}
    {input} \
    {params.out_prefix} 2>&1 > {log}
    """
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
run:
    peaks = pd.read_csv(input[0])
    novel = peaks['ensembl_consequence']=='novel'
    subset = peaks.loc[novel, ['chrom', 'ps', 'a1', 'a2']].drop_duplicates()

    # Create POST input data
    variants = [f"{chrom} {pos} . {ref}  {alt}  . . ." for _, (chrom, pos, ref, alt) in subset.iterrows()]
    max_limit = 200
    chunks = [variants[i:i+max_limit] for i in range(0, len(variants), max_limit)]
    path = Path(output[1])
    path.mkdir(parents=True, exist_ok=True)
    files = list()
    for idx, chunk in enumerate(chunks):
        filename = path.joinpath(f"chunk_{idx}.txt")
        files.append(filename)
        with open(filename, 'w') as f:
            for line in chunk:
                f.write(line)
                f.write('\n')
    pd.DataFrame(files).to_csv(output[0], index = False, header = False)
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
run:
    concat_list = list()
    for f in input:
        string = re.sub(r'.*\/peaks\/', '', f)
        string = re.sub(r'\/.*', '', string)
        group, phenotype = string.split('.')
        peak = f.split('/')[-1].replace('.500kb.csv', '')
        df = pd.read_csv(f, header = 0)
        df.insert(0, 'peak', peak)
        df.insert(0, 'phenotype', phenotype)
        df.insert(0, 'group', group)
        concat_list.append(df)
    all_data = pd.concat(concat_list)
    all_data.to_csv(output[0], index = False, header = True, compression='gzip')
    all_data[all_data['p-value']<=params.p_threshold].to_csv(output[1], index = False, header = True, compression='gzip')
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
shell:
    """
    peakplotter-region  \
      --chr-col Chrom \
      --pos-col Pos \
      --rs-col MarkerName \
      --pval-col P-value \
      --a1-col Allele1 \
      --a2-col Allele2 \
      --maf-col Freq1 \
      --build 38 \
      --assoc-file {input.metal} \
      --bfiles {params.bfile} \
      --out {params.outdir} \
      --chrom {params.chrom} \
      --start {params.start} \
      --end {params.end} \
      --debug
    """
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
shell:
    """
    pheno_file={params.outprefix}.pheno
    mlma={params.outprefix}.mlma
    mlma_bgz={params.outprefix}.mlma.gz

    awk '{{OFS=\"\\t\"}}NR!=1{{print $1,$1,$3}}' {input.phenotype} > $pheno_file 2> {log}

    gcta64 --mlma \
            --bfile {params.bfile} \
            --grm {params.grm} \
            --pheno $pheno_file \
            --out {params.outprefix} \
            --threads {threads} 2>&1 >> {log}
    bgzip -c $mlma > $mlma_bgz 2>> {log}
    tabix --skip-lines 1 --sequence 1 --begin 3 --end 3 $mlma_bgz 2>&1 >> {log}

    rm $pheno_file $mlma {params.outprefix}.log 2>&1 >> {log}
    """
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
shell:
    """
    run_manqq.R \
      --chr-col Chr \
      --pval-col p \
      --pos-col bp \
      --a1 A1 \
      --a2 A2 \
      --build 38 \
      --image png \
      --af-col Freq \
      --no-man \
      --maf-filter {params.filter} \
      {input} \
      {params.prefix} > {log.out} 2> {log.err}
    """
19
20
21
22
23
24
25
shell:
    """
    plink \
      --bfile {params.input} \
      --hardy \
      --out {params.output}
    """
31
32
33
34
shell:
    """
    tail -n+2 {input} | tr -s ' ' | awk '{{if ($NF<={params}) {{print $2}}}}' > {output}
    """
43
44
45
46
47
48
49
shell:
    """
    plink \
      --bfile {params.input} \
      --missing \
      --out {params.output}
    """
55
56
57
58
shell:
    """
    tail -n+2 {input} | tr -s ' ' | awk '{{if ($NF>={params}) {{print $2}}}}' > {output}
    """
66
67
shell:
    "cat {input} > {output}"
83
84
85
86
87
88
89
90
91
92
shell:
    """
    plink \
      --bfile {params.bfile} \
      --exclude {input.exclude_list} \
      --make-bed \
      --out {params.out} \
      --threads {threads}
    awk -F$'\\t' 'BEGIN{{OFS=\"\\t\"}}{{print $1,\"chr\"$1\":\"$4,$3,$4,$5,$6}}' {params.out}.bim | sponge {params.out}.bim
    """
101
102
103
104
105
106
run:
    data = pd.read_csv(input[0], sep = '\t')
    n = data.shape[0]
    max_mac = n * 2
    threshold = params.threshold / max_mac
    pd.DataFrame([[params.id, n, max_mac, threshold]]).to_csv(output[0], sep = ' ', index = False, header = False)
112
shell: "cat {input} > {output}"
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
run:
    mac = pd.read_csv(
        input[0],
        sep = ' ',
        header = None,
        names = ['id', 'n', 'max_mac', 'maf_threshold'])
    mac['cohort'] = [i[0] for i in mac['id'].str.split('.')]
    mac['group_pheno'] = ['.'.join(i[1:]) for i in mac['id'].str.split('.')]

    group_phenos = set(mac['group_pheno'])

    meta_mac = list()
    for gp in group_phenos:
        combind_n = mac.loc[mac['group_pheno']==gp, 'n'].sum()
        combind_max_mac = combind_n * 2
        combined_threshold = params.threshold / combind_max_mac
        meta_mac.append([gp, combind_n, combind_max_mac, combined_threshold])

    mac.drop(columns = ['cohort', 'group_pheno'], inplace = True)
    all_mac = pd.concat([mac, pd.DataFrame(meta_mac, columns = ['id', 'n', 'max_mac', 'maf_threshold'])])
    all_mac.to_csv(output[0], sep = ' ', index = False, header = True)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
__version__ = '0.2'

import sys

import pandas as pd
from peakplotter.peakit import peakit
from peakplotter.plotpeaks import read_assoc, get_signals
from peakplotter.test_utils import get_test_logger

assocfile = sys.argv[1]
flank_bp = int(sys.argv[2])
signif = float(sys.argv[3])
group = sys.argv[4]
phenotype = sys.argv[5]
output = sys.argv[6]


chr_col = 'Chrom'
pos_col = 'Pos'
rs_col = 'MarkerName'
pval_col = 'P-value'
a1_col = 'Allele1'
a2_col = 'Allele2'
maf_col = 'Freq1' # Frequency of A1
logger = get_test_logger()


print(f'Running collect_peaks.py (v{__version__})')
print(f'''
chr_col = '{chr_col}'
pos_col = '{pos_col}'
rs_col = '{rs_col}'
pval_col = '{pval_col}'
a1_col = '{a1_col}'
a2_col = '{a2_col}'
maf_col = '{maf_col}'
''')

assoc = read_assoc(assocfile, chr_col, pos_col, pval_col, maf_col, rs_col, a1_col, a2_col, logger)
print('Getting signals')
signals = get_signals(assoc, signif, chr_col, pos_col, pval_col)
print(signals)
if signals.empty:
    print("No peaks found. Exiting.")
    pd.DataFrame().to_csv(output, sep = '\t', header = False, index = False)
    sys.exit(0)
print('Making peak_collections')
peak_collections = peakit(signals, pval_col, chr_col, pos_col, flank_bp)
print('peak_collections before merge')
print(peak_collections)
peak_collections.merge()
print('peak_collections after merge')
print(peak_collections)
peaked = peak_collections.data
peaked.insert(0, 'phenotype', phenotype)
peaked.insert(0, 'group', group)
peaked.to_csv(output, sep = '\t', header = False, index = False)
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
make_cojofile=v0.0.2
group=$1
phenotype=$2
bfile=$3
metal_file=$4
prefix=$5

echo "=== Running make_cojofile.sh ===
run_cojo: $make_cojofile
group: $group
phenotype: $phenotype
bfile: $bfile
metal_file: $metal_file
prefix: $prefix
================================"

samplesize=$prefix.samplesize
plink --bfile $bfile --missing --out $prefix
awk -F '[[:space:]]+' '{if(NR!=1){print $3, $5-$4}}' $prefix.lmiss > $samplesize


tmpfile=$(mktemp /tmp/make_cojofile.XXXXXX) # https://unix.stackexchange.com/a/181938
echo "[INFO] Creating temporary file: $tmpfile"
exec 3>"$tmpfile"

join \
  -j 1 \
  -o 0 1.2 1.3 1.4 1.5 1.6 1.7 2.2 \
  <(zcat $metal_file \
      | awk '{if(NR!=1){print "chr"$1":"$3, $4, $5, $6, $10, $11, $12}}' \
      | sort -k 1b,1 \
   ) \
  <(sort -k 1b,1 $samplesize) >&3
# 'sort -k 1b,1' is the recommended sort command if joining on the first field using 'join'
# See 'man join'

cat \
  <(echo "SNP A1 A2 freq b se p N") \
  <(sed 's/:/ /' $tmpfile \
    | sort -n -k1.4 -k2 \
    | sed 's/ /:/' \
   ) \
  | gzip > $prefix.ma.gz

exec 3>-
# rm $tmpfile
# rm $prefix.{lmiss,samplesize}
rm $prefix.{imiss,nosex}
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
run_cojo=v0.3.0
bfile=$1
cojofile=$2
metal_file=$3
threshold=$4 # p-value threshold
group=$5
phenotype=$6
chrom=$7
start=$8
end=$9
prefix=${10} # output prefix
memory=${11}


echo "=== Running run_cojo.sh ===
run_cojo: $run_cojo
group: $group
phenotype: $phenotype
bfile: $bfile
chrom: $chrom
start: $start
end: $end
threshold: $threshold
cojofile: $cojofile
metal_file: $metal_file
prefix: $prefix
===========================
"
# TODO: Find a way to insert the $threshold variable
# inside the below awk command.

# See Plink 1.9's .qassoc file format for more details
qassoc=$prefix.qassoc
echo "[$(date), Step 1/4] Creating $qassoc"
tabix $metal_file ${chrom}:${start}-${end} | \
    awk 'BEGIN{OFS="\t";print "CHR\tSNP\tBP\tBETA\tSE\tR2\tT\tP"} \
      {
        split($12, a, "[eE]") 
        if($12<1e-6 || (length(a)>1 && a[2] < -6)){
          print $1, "chr"$1":"$3, $3, $10, $11, 1, 1, $12}
      }' > $qassoc


echo
echo "Extracted $(( $(wc -l < $prefix.qassoc) - 1 )) SNP(s)."
echo

## Subset bfile
echo "[$(date), Step 2/4] Subset bfile"
plink --allow-no-sex --make-bed \
    --bfile $bfile \
    --chr $chrom \
    --from-bp $start \
    --to-bp $end \
    --memory $memory \
    --out $prefix

echo -e "\n\n"
echo "[$(date), Step 3/4] Clumping"
## CLUMP from merged file 
plink \
  --bfile $prefix \
  --clump $qassoc \
  --clump-kb 1000 \
  --clump-r2 0.05 \
  --memory $memory \
  --out $prefix

clumpedlist=$prefix.clumped.list
awk -v threshold=$threshold '{if($5<threshold && $1~/^[0-9]+$/){print $3}}' $prefix.clumped > $clumpedlist



echo -e "\n\n"
echo "[$(date), Step 4/4] GCTA-COJO"
gcta64 \
  --bfile $prefix \
  --cojo-file <(zcat $cojofile) \
  --extract $clumpedlist \
  --out $prefix \
  --cojo-slct \
  --cojo-p $threshold \
  --cojo-wind 10000 \
  --diff-freq 0.2 \
  --cojo-collinear 0.9 || true

echo -e "\n\n"

echo "[$(date)] Done"

rm -f \
  $prefix.nosex \
  $prefix.clumped.list
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/hmgu-itg/single-point-analysis-pipeline
Name: single-point-analysis-pipeline
Version: 0.0.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...