Use an ensemble of variant callers to call variants from ATAC-seq data

public public 1yr ago Version: v0.3.3 0 bookmarks

varCA

A pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.

The entire pipeline is made up of two smaller subworkflows. The prepare subworkflow calls each variant caller and prepares the resulting data for use by the classify subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.

Code Ocean

Using our Code Ocean compute capsule , you can execute VarCA v0.2.1 on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the prepare subworkflow and the classify subworkflow in the rules README .

download

Execute the following command or download the latest release manually.

git clone https://github.com/aryarm/varCA.git

Also consider downloading the example data .

cd varCA
wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz | tar xvzf -

setup

The pipeline is written as a Snakefile which can be executed via Snakemake . We recommend installing version 5.18.0:

conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'

We highly recommend you install Snakemake via conda like this so that you can use the --use-conda flag when calling snakemake to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files .

execution

  1. Activate snakemake via conda :

    conda activate snakemake
    
  2. Execute the pipeline on the example data

    Locally:

    ./run.bash &
    

    or on an SGE cluster:

    ./run.bash --sge-cluster &
    

Output

VarCA will place all of its output in a new directory ( out/ , by default). Log files describing the progress of the pipeline will also be created there: the log file contains a basic description of the progress of each step, while the qlog file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the rules README .

Executing the pipeline on your own data

You must modify the config.yaml file to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.

Executing each portion of the pipeline separately

The pipeline is made up of two subworkflows . These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the rules README for execution instructions and a description of the outputs. You will need to execute the subworkflows separately if you ever want to create your own trained models .

Reproducing our results

We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow these steps to create all of the plots and tables in our paper.

If this is your first time using Snakemake

We recommend that you run snakemake --help to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the -n -p -r flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the all rule).

Note that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that you create and provide in place of the files it would have generated.

By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the --notemp flag to Snakemake when executing the pipeline.

files and directories

Snakefile

A Snakemake pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:

  1. the prepare subworkflow , which prepares the reads for classification and

  2. the classify subworkflow , which creates a VCF containing predicted variants

rules/

Snakemake rules for the prepare and classify subworkflows. You can either execute these subworkflows from the master Snakefile or individually as their own Snakefiles. See the rules README for more information.

configs/

Config files that define options and input for the pipeline and the prepare and classify subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out the config file for the pipeline .

callers/

Scripts for executing each of the variant callers which are used by the prepare subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the caller README for more information.

breakCA/

Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from @Arkosen 's BreakCA code .

scripts/

Various scripts used by the pipeline. See the script README for more information.

run.bash

An example bash script for executing the pipeline using snakemake and conda . Any arguments to this script are passed directly to snakemake .

citation

There is an option to "Cite this repository" on the right sidebar of the repository homepage .

Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., & McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621

Code Snippets

45
46
shell:
    "zcat {input} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|^({params.callers})~' | gzip > {output}"
55
56
shell:
    "zcat {input} | scripts/filter.bash {params.expr:q} | gzip > {output}"
73
74
shell:
    "zcat {input} | scripts/cgrep.bash - -E '^CHROM$|^POS$|~CLASS:' | gzip > {output}"
89
90
91
92
93
94
95
shell:
    "paste "
    "<(zcat {input.annot} | cut -f 3- | scripts/cgrep.bash - -v '{params.truth}') "
    "<(zcat {input.annot} | cut -f 3- | scripts/cgrep.bash - '{params.truth}') | "
    "sed 's/^\\t//' | paste <(zcat {input.tsv} | "
    "scripts/cgrep.bash - -v '~CLASS:' | "
    "scripts/cgrep.bash - -Evx '(CHROM|POS|REF)') - | gzip > {output}"
113
114
shell:
    "Rscript scripts/train_RF.R {input} {params.balance} {output}"
121
122
shell:
    "Rscript scripts/tune_plot.R {input} {output}"
139
140
shell:
    "Rscript scripts/predict_RF.R {input.predict} {input.model} {output}"
154
155
156
157
158
shell:
    "cat {input.predict} | paste <(zcat {input.annot}) "
    "<(read -r head && echo \"$head\" | tr '\\t' '\\n' | "
    "sed 's/response/CLASS:/' | sed 's/^/varca~/' | "
    "paste -s && cat) | gzip > {output}"
182
183
184
185
186
187
shell:
    "paste "
    "<(zcat {input.results} | scripts/cgrep.bash - '{params.truth}~CLASS:') "
    "<(zcat {input.results} | scripts/cgrep.bash - '{wildcards.caller}~CLASS:') "
    "<(zcat {input.predicts} | scripts/cgrep.bash - -F '{wildcards.caller}~{params.predict_col}')"
    " | tail -n+2 | scripts/metrics.py -o {output} {params.ignore_probs} {params.flip}"
201
202
203
204
205
shell:
    "paste "
    "<(zcat {input.annot} | scripts/cgrep.bash - '{params.truth}~CLASS:') "
    "<(zcat {input.predicts} | scripts/cgrep.bash - -F '{wildcards.caller}~{params.predict_col}') | "
    "tail -n+2 | scripts/statistics.py -o {output} {params.flip} {params.thresh}"
232
233
shell:
    "scripts/prc.py {output} {params.pts} {params.curves}"
244
245
shell:
    "scripts/2vcf.py -o {output} {params.callers} {input.results} {input.merge}"
254
255
shell:
    "bcftools reheader -f {input.genome} {input.vcf} -o {output}"
82
83
84
85
shell:
    "bwa mem {input.ref} {input.fastq1} {input.fastq2} "
    "-R '@RG\\tID:{wildcards.sample}\\tLB:lib1\\tPL:Illumina\\tPU:unit1\\tSM:{wildcards.sample}' | "
    "samtools view -b -h -F 4 -q 20 -> {output}"
96
97
98
99
shell:
    "samtools sort -n {input} | "
    "samtools fixmate -m -O bam - - | "
    "samtools sort -o {output} -"
108
109
shell:
    "samtools markdup {input} {output.final_bam}"
118
119
shell:
    "samtools index -b {input}"
131
132
133
shell:
    "macs2 callpeak --nomodel --extsize 200 --slocal 1000 --qvalue 0.05 "
    "-g hs -f BAMPE -t {input.final_bam} -n {wildcards.sample} --outdir {params.output_dir}"
179
180
181
182
183
shell:
    "mkdir -p \"{output}\" && "
    "{input.caller_script} {input.bam} {input.peaks} "
    "{input.genome} {output} {wildcards.sample} "
    "{input.shared} {params.caller_params}"
201
202
203
204
205
shell:
    "mkdir -p \"{params.out_dir}\" && "
    "{input.caller_script} {input.bam} {input.peaks} "
    "{input.genome} {params.out_dir} {wildcards.sample} "
    "{input.shared} {params.caller_params}"
231
232
shell:
    "bcftools view {config[bcftools_params]} {input.vcf} > {output.vcf}"
243
244
245
shell:
    "bcftools norm -m -any {input.vcf} | bcftools norm --check-ref xw -d "
    "all -f {input.ref} > {output.vcf}"
258
259
shell:
    "bgzip <{input.vcf} >{output.gzvcf} && tabix -p vcf -f {output.gzvcf}"
305
306
307
shell:
    "echo -e '{params.cols}' > {output.tsv} && "
    "bcftools query -f '{params.qstr}' {input.gzvcf} >> {output.tsv}"
334
335
336
shell:
    "tail -n+2 {input} | awk -F '\\t' -v 'OFS=\\t' '{{for (i=1; i<=NF; i++) if ($i==\"NA\") $i=\".\"}}1' | "
    "sed 's/\\t\+/,/' | LC_ALL=C sort -t $'\\t' -k1,1 > {output}"
345
346
347
348
shell:
    "awk '{{printf(\"%s\\t%d\\t%d\\t%s\\n\",$1,int($2)+1,int($3),$4);}}' {input} | "
    "awk -F '\\t' -v 'OFS=\\t' '{{for (i=$2;i<$3;i++) print $1\",\"i,substr($4,i-$2+1,1)}}' | "
    "LC_ALL=C sort -t $'\\t' -k1,1 > {output}"
364
365
366
367
368
shell:
    "LC_ALL=C join -t $'\\t' -e. -a1 -j1 -o auto --nocheck-order "
    "<(cut -f 1 {input.sites}) {input.prepared_tsv} | cut -f 2- | cat "
    "<(head -n1 {input.tsv} | cut -f 3- | tr '\\t' '\\n' | "
    "sed 's/^/{wildcards.caller}~/' | paste -s) - > {output}"
383
384
385
386
shell:
    "paste <(echo -e 'CHROM\\tPOS\\tREF'; sed 's/,/\\t/' "
    "{input.all_sites}) {input.caller_output} | "
    "(read -r head; echo \"$head\"; sort -t $'\\t' -k1,1V -k2,2n) | gzip > {output}"
395
396
shell:
    "paste <(zcat {input} | scripts/cgrep.bash - -Ev '~(REF|ALT)$') <(scripts/classify.bash {input} '{params.label:q}') | gzip >{output}"
423
424
425
426
shell:
    "scripts/fillna.bash {input} {params.na_vals:q} | "
    "awk -F $'\\t' -v 'OFS=\\t' '{{for (i=1;i<=NF;i++) if ($i==\".\") $i=0}}1' | "
    "gzip >{output}"
437
438
shell:
    "zcat {input} | scripts/filter.bash {params.expr:q} | gzip > {output}"
449
450
451
shell:
    "zcat {input} | awk -f scripts/norm_nums.awk -F $'\\t' -v 'OFS=\\t' | "
    "gzip >{output}"
458
459
shell:
    "ln -sf \"$(basename {input})\" {output}"
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
import sys
import pysam
import argparse
import warnings
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import roc_curve



def phred(vals):
    """ apply the phred scale to the vals provided """
    with np.errstate(divide='raise'):
        try:
            return -10*np.log10(1-vals)
        except FloatingPointError:
            return np.float64(30)
            return -10*np.ma.log10(1-vals).filled(-3) # fill all infinite values with a phred scale of 30

def plot_line(lst, show_discards=False):
    plt.clf()
    roc = np.copy(lst.T)
    roc[1] = phred(roc[1])
    max_val = phred(max(roc[2])/(max(roc[2])+1))
    # discard inf or na cols
    inf_or_na_cols = np.isinf(roc).any(axis=0) | np.isnan(roc).any(axis=0)
    # warn the user if we're discarding the majority of points
    discarded = np.sum(inf_or_na_cols)/roc.shape[1]*100
    if (not show_discards and discarded != 0) or discarded >= 50:
        warnings.warn("Discarding NaN or Inf points ({}% of points)".format(discarded))
    roc = roc[:,~(inf_or_na_cols)]
    # perform a simple linear regression
    p = np.polyfit(roc[0], roc[1], 1)
    r_squared = 1 - (sum((roc[1] - (p[0] * roc[0] + p[1]))**2) / ((len(roc[1]) - 1) * np.var(roc[1], ddof=1)))
    p = np.poly1d(p)
    # plot the points and the line
    plt.scatter(roc[0], roc[1], color='r', label="_nolegend_")
    if max(roc[0]) <= 1:
        plt.xlabel("RF Probability")
    elif max(roc[0]) <= np.pi/2:
        plt.xlabel("Reverse Arcsin of RF Probability")
    else:
        plt.xlabel("Phred-Scaled RF Probability")
    plt.ylabel("Phred-Scaled Precision (QUAL)")
    plt.plot(
        roc[0],
        p(roc[0]),
        label=str(p)+"\nr-squared: "+str(round(r_squared, 2))+ \
            ("\ndiscarded: "+str(int(discarded))+"%" if show_discards else "")
    )
    plt.hlines(max_val, min(roc[0]), max(roc[0]), colors='g', linestyles='dashed')
    plt.legend(frameon=False, loc='lower right')
    plt.tight_layout()

def eqbin_mean(grp, log=True, pseudo=True, discards_ok=False, inverse=False):
    if inverse:
        return np.arcsin(grp.mean())
    else:
        if log:
            if discards_ok or not pseudo:
                return phred(grp).mean()
            else:
                return phred(grp.sum()/(len(grp) + pseudo))
        else:
           return grp.mean()

def tpr_probs(df, bins=15, eqbin=True, log=True, pseudo=True, discards_ok=False, inverse=False):
    """ bin the sites and calculate an accuracy (predicted positive value) for that bin """
    # retrieve only predicted positives
    df = df[df['varca~CLASS:']>=0.5]
    if eqbin:
        bin_size = int(len(df)/bins)
        # create bins (ie each grp) and add a single false positive to it so we don't get Inf
        lst = np.array([
            (
                eqbin_mean(grp['varca~prob.1'], log, pseudo, discards_ok, inverse),
                precision_score(
                    np.append(grp['varca~truth'].values, 0) if pseudo else grp['varca~truth'],
                    np.append(grp['varca~CLASS:'].values, 1) if pseudo else grp['varca~CLASS:']
                ),
                grp['varca~prob.1'].size
            )
            for grp in (df.iloc[i:i+bin_size] for i in range(0,len(df)-bin_size+1,bin_size))
        ])
    else:
        if log:
            df = df.copy()
            df['varca~prob.1'] = phred(df['varca~prob.1'])
        start = phred(0.5) if log else 0.5
        # get the end excluding inf values (in case log == True)
        end = df.loc[df['varca~prob.1'] != np.inf, 'varca~prob.1'].max()
        # create bins (ie each grp) and add a single false positive to it so we don't get Inf
        lst = np.array([
            (
                grp[1]['varca~prob.1'].mean(),
                precision_score(
                    np.append(grp[1]['varca~truth'].values, 0) if pseudo else grp['varca~truth'],
                    np.append(grp[1]['varca~CLASS:'].values, 1) if pseudo else grp['varca~CLASS:']
                ),
                grp[1]['varca~prob.1'].size
            )
            for grp in df.groupby(pd.cut(df['varca~prob.1'], pd.interval_range(start, end, bins)))
        ])
    return lst



def strip_type(caller):
    """
        strip the -indel or -snp from the end of a caller name
    """
    vartype = ''
    if caller.endswith('-snp'):
        caller = caller[:-len('-snp')]
        vartype = 'snp'
    elif caller.endswith('-indel'):
        caller = caller[:-len('-indel')]
        vartype = 'indel'
    # if there is still a dash, get everything after it
    i = caller.rfind('-')
    if i != -1:
        caller = caller[i+1:]
    return caller, vartype

def isnan(val):
    return type(val) is float and np.isnan(val)

def get_calls(prepared, callers=None, pretty=False):
    """
        get the alleles in each row of prepared at the location (CHROM/POS) of loc
        when choosing an alt allele, choose from the callers in the order given
    """
    # keep track of the contigs that we've seen
    contigs = set()
    # whether we've read the header yet
    read_header = False
    # iterate through each row in the df and check whether they match loc
    for chunk in prepared:
        # do some special stuff (parsing the header) on the very first iteration
        if not read_header:
            # if callers is None, retrieve the callers from the columns of the dataframe
            if callers is None:
                callers = [
                    caller[:-len('~ALT')] for caller in chunk.columns
                    if caller.endswith('~ALT') and not caller.startswith('pg-')
                ]
            # what types of variants are we dealing with? let's count how many
            # times they appear in the caller names
            vartypes = {'snp': 0, 'indel': 0}
            # also, let's retrieve the callers as a dictionary
            pretty_callers = {}
            for caller in callers:
                pretty_caller, vartype = strip_type(caller)
                # don't beautify the callers if the user didn't request it
                pretty_callers[caller] = pretty_caller if pretty else caller
                # keep track of how often each vartype appears
                if vartype in vartypes:
                    vartypes[vartype] += 1
            callers = pretty_callers
            # retrieve the first CHROM/POS location and yield whether we are reading indels or snps
            loc, predict = yield max(vartypes, key=vartypes.get)
            read_header = True
        # now iterate through each row (and also convert the POS column to an int)
        for idx, row in chunk.iterrows():
            # check if we already passed the row -- ie we either:
            # 1) moved onto a new contig or
            # 2) moved passed the position
            while (
                idx[0] != loc[0] and loc[0] in contigs
            ) or (
                idx[0] == loc[0] and idx[1] > loc[1]
            ):
                # return None if we couldn't find loc in the df
                loc, predict = yield None
            if idx == loc:
                # we found it!
                found = False
                # now, we must figure out which caller to get the alleles from
                for caller in callers:
                    ref, alt = row[caller+"~REF"], row[caller+"~ALT"]
                    # TODO: make this work for an arbitrary number of variant types for multilabel classification using the other CLASS values in classified
                    # right now, this only works if there's a single binary label
                    if not isnan(ref) and (
                        (isnan(alt) + predict) % 2
                    ):
                        found = True
                        break
                if found:
                    loc, predict = yield callers[caller], ref, alt
                else:
                    # if we know there is a variant here, but none of the other
                    # callers found it, just label it as a non-variant
                    # TODO: figure out the alt allele from inspecting the ref genome?
                    loc, predict = yield 'varca', row['REF'], float('nan')
            # save the current contig so that we know which ones we've seen
            contigs.add(idx[0])

def prob2qual(prob, vartype):
    # values are from linear model that we created from using the "-i" option
    if vartype == 'snp':
        return 0.6237*phred(prob)+8.075
    elif vartype == 'indel':
        return 0.8463*phred(prob)+2.724
    else:
        # we shouldn't ever encounter this situation, but just in case...
        return phred(prob)

def main(prepared, classified, callers=None, cs=1000, all_sites=False, pretty=False, vartype=None):
    """
        use the results of the prepare pipeline and the classify pipeline
        to create a VCF with all of the classified sites
    """
    # first, get a generator that can read through each call in the prepared df
    prepared = get_calls(
        pd.read_csv(
            prepared, sep="\t", header=0, index_col=["CHROM", "POS"],
            dtype=str, chunksize=cs, na_values=".",
            usecols=lambda col: col in ['CHROM', 'POS', 'REF'] or col.endswith('~REF') or col.endswith('~ALT')
        ), callers, pretty
    )
    # flush out the first item in the generator: the vartype
    if vartype is None:
        vartype = next(prepared)
    else:
        # if the user already gave us the vartype, then just discard this
        next(prepared)
    # also retrieve the classifications as a df
    classified = pd.read_csv(
        classified, sep="\t", header=0, index_col=["CHROM", "POS"],
        dtype={'CHROM':str, 'POS':int}, chunksize=cs, na_values="."
    )
    # keep track of how many sites in the classifications df we've had to skip
    skipped = 0
    # keep track of how many sites we skipped but were predicted to have a variant
    no_alts = 0
    # iterate through each site in the classifications df, get its alleles, and
    # then return them in a nice-looking dictionary
    for chunk in classified:
        for idx, row in chunk.iterrows():
            try:
                # get the alleles for this CHROM/POS location
                call = prepared.send((idx, row['varca~CLASS:']))
            except StopIteration:
                call = None
            # we found a variant but couldn't find an alternate allele!
            no_alts += not (call is None or (isnan(call[2]) + row['varca~CLASS:']) % 2)
            # check: does the site appear in the prepared pipeline?
            # and does this site have a variant?
            if call is None or (not all_sites and isnan(call[2])):
                skipped += 1
                continue
            caller, ref, alt = call
            qual = prob2qual(
                row['varca~prob.'+str(int(not isnan(alt)))], vartype
            )
            # construct a dictionary with all of the relevant details
            yield {
                'contig': str(idx[0]), 'start': idx[1], 'stop': idx[1]+len(ref),
                'qual': qual, 'alleles': (ref, "." if isnan(alt) else alt), 'info': {'CALLER':caller}
            }
    if skipped:
        warnings.warn(
            "Ignored {:n} classification sites that didn't have a variant.".format(skipped)
        )
    if no_alts:
        warnings.warn(
            "Ignored {:n} sites that we predicted to have variants but didn't appear in any of the callers.".format(no_alts)
        )

def write_vcf(out, records):
    """
        write the records to the output vcf
    """
    vcf = pysam.VariantFile(args.out, mode='w')
    # write the necessary VCF header info
    vcf.header.info.add("CALLER", 1, 'String', "The caller from which this site was taken")
    contigs = set()
    for rec in records:
        # handle pysam increasing the start and end sites by 1
        rec['start'] -= 1
        rec['stop'] -= 1
        # parse the record into a pysam.VariantRecord
        try:
            record = vcf.new_record(
                **rec, samples=None, id=None, filter=None
            )
        except ValueError:
            # add the contig if it hasn't already been added
            if rec['contig'] not in contigs:
                vcf.header.contigs.add(rec['contig'])
                contigs.add(rec['contig'])
            else:
                raise
            # now, try again
            record = vcf.new_record(
                **rec, samples=None, id=None, filter=None
            )
        # write the record to a file
        vcf.write(record)
    return vcf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-o", "--out", default=sys.stdout, help="the filename to save the vcf (or bcf) to"
    )
    parser.add_argument(
        "classified", type=Path, help="a sorted, results.tsv.gz file from the output of the classify pipeline"
    )
    parser.add_argument(
        "prepared", type=Path, nargs="?", default=sys.stdin, help="a sorted, merge.tsv.gz file from the prepare pipeline (if not supplied, this is read from stdin)"
    )
    parser.add_argument(
        '-c', "--callers", default=None, help="a comma separated list of the callers from which to choose alleles, supplied in order of priority (default: all of the callers in the file, in the order they appear)"
    )
    parser.add_argument(
        '-s', "--chunksize", type=np.uint32, default=100000, help="how many rows to read into memory at once (default: 100,000)"
    )
    parser.add_argument(
        '-a', '--all', action='store_true', help="whether to also write non-variant sites to create a gVCF (default: no)"
    )
    parser.add_argument(
        '-p', '--pretty', action='store_true', help="should caller names appear in the vcf by their pretty form (with all dashes intelligently removed) or their original caller ID form? (default: the original form)"
    )
    parser.add_argument(
        '-t', '--type', choices=['indel', 'snp'], default=None, help="whether to recalibrate QUAL values assuming your data are SNPs or indels (default: infer from callers)"
    )
    parser.add_argument(
        '-i', '--internal', action='store_true', help="For testing and internal use: recalibrate the QUAL scores (assumes varca~truth column exists in classified)"
    )
    args = parser.parse_args()

    callers = None
    if args.callers is not None:
        callers = args.callers.split(",")

    if not args.internal:
        import matplotlib
        matplotlib.use('Agg')
        vcf = write_vcf(args.out, main(args.prepared, args.classified, callers, args.chunksize, args.all, args.pretty, args.type))
    else:
        if not sys.flags.interactive:
            sys.exit("ERROR: You must run this script in python's interactive mode (using python's -i flag) when providing the -i flag to this script.")
        try:
            df = pd.read_csv(args.classified, sep="\t", header=0, index_col=["CHROM", "POS"], usecols=['CHROM', 'POS', 'varca~truth', 'varca~prob.1', 'varca~CLASS:'], low_memory=False).sort_values(by='varca~prob.1')
        except ValueError:
            df = pd.read_csv(args.classified, sep="\t", header=0, index_col=["CHROM", "POS"], usecols=['CHROM', 'POS', 'breakca~truth', 'breakca~prob.1', 'breakca~CLASS:'], low_memory=False).sort_values(by='breakca~prob.1')
            df.columns = ['varca~truth', 'varca~prob.1', 'varca~CLASS:']
        plt.ion()
        plot_line(tpr_probs(df))
11
12
13
14
15
16
17
18
19
20
21
22
23
script_dir="$(dirname "$BASH_SOURCE")";

pfiles=()
for (( i=2; i<$#; i+=2 )); do
	j=$((i+1))
	reg="${!i}"
	val="${!j}"
	# retrieve the idx of the column among the columns
	reg="$(zcat "$1" | head -n1 | tr '\t' '\n' | grep -Fn "$reg" | cut -f1 -d: | head -n1)"
	pfiles+=("{if (\$$reg==\".\") \$$reg=\"$val\"}")
done

zcat "$1" | awk -F $'\t' -v 'OFS=\t' "${pfiles[*]} 1"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import sys
import argparse
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.metrics import auc
from itertools import product


parser = argparse.ArgumentParser()
parser.add_argument(
    "out", nargs="?", default=sys.stdout, help="the filename to save the data to"
)
known_args, unknown_args = parser.parse_known_args()
# dynamically parse whatever options the user passes us
count = 0
for arg in unknown_args:
    if arg.startswith('--'):
        parser.add_argument('--{}'.format(arg[2:]), type=argparse.FileType('r'))
        count += 1
if count < 1:
    parser.error("Specify the path to at least one space separated table (w/o a header) with two rows (recall/precision) using options like --gatk-indel path/to/gatk-table")
args = parser.parse_args()
all_args = vars(args)


def get_marker():
    """ retrieve a unique marker in a deterministic order """
    yield from product(range(2, 6), range(1, 3), [52])
    # if that wasn't enough, create some more just at a different angle
    yield from product(range(2, 8), range(1, 3), [0])
    # if we ever get to this point, we need more than 20 markers
    yield from product([2]+list(range(3, 6, 2)), range(1, 3), [90])


# initialize all of the colors
callers = ['gatk_indel', 'varscan_indel', 'vardict_indel', 'breakca', 'delly', 'pindel', 'illumina_manta', 'illumina_strelka', 'pg_indel']
colors = dict(zip(callers, plt.cm.Set1(range(len(callers)))))
# add the other callers, as well
for k in list(colors.keys()):
    if k.endswith('_indel'):
        colors[k[:-len('indel')]+"snp"] = colors[k]
    elif k == 'breakca':
        colors['varca'] = colors[k]

markers = get_marker()
# go through each table and get its name from all of the args
for arg in sorted(all_args.keys()):
    if arg not in known_args:
        table = np.loadtxt(all_args[arg])
        # recall: 1st row, precision: 2nd row
        if table.ndim != 1:
            # area = auc(table[0], table[1])
            # colors[arg] = plt.step(
            #     table[0], table[1], where='post',
            #     label=arg+": area={0:0.2f}".format(area)
            # )[0].get_color()
            if arg in colors:
                plt.step(
                    table[0], table[1], where='post', color=colors[arg],
                    label=arg.replace('breakca', 'varca')
                )
            else:
                # pick the color randomly
                plt.step(
                    table[0], table[1], where='post',
                    label=arg.replace('breakca', 'varca')
                )
        else:
            area = table[1]
            # check if this pt has a curve of the same name
            # to make sure they're the same color
            if arg.endswith("_pt") and arg[:-3] in colors:
                extra = {'color': colors[arg[:-3]]}
            else:
                extra = {}
            plt.plot(
                table[0], table[1], marker=next(markers), ms=12,
                label=arg.replace('breakca', 'varca')+": height={0:0.2f}".format(area), **extra
            )

plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", fontsize='small')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.0])
plt.xlim([0.0, 1.0])
plt.gcf().set_size_inches(10, 10)
plt.savefig(args.out, bbox_inches='tight', pad_inches=0.1, set_dpi=350)
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
args <- commandArgs(trailingOnly = TRUE)
test.data<- args[1]
model <- args[2]
output<- args[3]

# load libraries
library(data.table)
library(plyr)
library(dplyr)
suppressMessages(library(mlr))

# load model
print("loading appropriate model")
load(model)

# load test
print("loading and formatting test data")
test<- read.table(test.data, header=TRUE, sep="\t", na.strings=c("NA",".","na","N/A"), skipNul=FALSE, row.names=NULL)

# making predictions
print("making predictions and outputting results")
pred= predict(fit, newdata= test, type="prob")
write.table(pred$data, sep='\t', quote=FALSE, row.names=FALSE, na=".", output)
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
args <- commandArgs(trailingOnly = TRUE)
training<- args[1]
balance<- args[2] # an integer (0 or 1) indicating whether to balance the data
model<- args[3]
importance<- args[4] # importance for each of the variables is saved here
tune<- args[5] # if specified, the results of cross validation are saved here

# load libraries
library(plyr)
library(dplyr)
suppressMessages(library(mlr))
library(parallelMap)
library(parallel)

# load data.frame
print("loading training data into R")
training<- read.table(training, header=TRUE, sep="\t", na.strings=c("NA",".","na","N/A","inf","Inf","infinity","Infinity"), skipNul=T, row.names=NULL)
nrows_total <- nrow(training)
training <- na.omit(training)
if (nrows_total-nrow(training)>0) {
	print(paste("ignoring", nrows_total-nrow(training), "rows that have NA values"))
}

print("creating training task and making RF learner")

# optimize hyper parameters
# make training task
traintask <- makeClassifTask(data = training, target = colnames(training)[ncol(training)], positive = 1)

# create learner
rf.lrn <- makeLearner("classif.ranger", predict.type = "prob")

if (as.integer(balance)) {
	print("calculating class weights in order to ensure data is balanced when sampled")
	# first, retrieve the inverse of the counts of each of the labels
	w <- 1/table(training[,ncol(training)])
	# calculate probabilities for each label
	w <- w/sum(w)
	# create a vector containing weights instead of labels
	weights <- rep(0, nrow(training))
	for (val in names(w)) {
		weights[training[,ncol(training)] == val] <- w[val]
	}

	# create par.vals
	rf.lrn$par.vals <- list(importance='impurity', verbose=TRUE, case.weights=weights)
} else {
	# create par.vals
	rf.lrn$par.vals <- list(importance='impurity', verbose=TRUE)
}

if (!is.na(tune)) {
	# mtry default: sqrt(number of features)
	# nodesize default: 1
	params <- makeParamSet(makeIntegerParam("mtry",lower = 1,upper = 10),
                           makeIntegerParam("min.node.size",lower = 7,upper = 25))
	# set validation strategy; 4-fold cross validation
	rdesc <- makeResampleDesc("CV",iters=5L)
	# set optimization technique
	ctrl <- makeTuneControlGrid(resolution=c(mtry=10, min.node.size=19))

	# tune hyperparameters
	print("initiating multicore tuning of hyperparameters")
	# but run the hyperparameter tuning in parallel, since it'll take a while
	# number of cores should be detected automatically (but don't use
	# all of the cores because otherwise we'll use too much memory!)
	parallelStartSocket(cpus=trunc(detectCores()/12), level="mlr.tuneParams")
	parallelLibrary("mlr")

	# create a custom F beta measure
	fbeta = makeMeasure(id = "fbeta", minimize = FALSE, best = 1, worst = 0,
		properties = c("classif", "req.pred", "req.truth"),
		name = "Fbeta measure",
		note = "Defined as: (1+beta^2) * tp/ (beta^2 * sum(truth == positive) + sum(response == positive))",
		fun = function(task, model, pred, feats, extra.args) {
			beta = 0.5
			beta = beta^2
			truth = pred$data$truth
			response = pred$data$response
			positive = pred$task.desc$positive
			(1+beta) * measureTP(truth, response, positive) /
				(beta * sum(truth == positive) + sum(response == positive))
		}
	)

	tuned = tuneParams(learner=rf.lrn, task=traintask, resampling=rdesc, measures=list(fbeta), par.set=params, control=ctrl, show.info=T)

	parallelStop()

	print("matrix of classifier performance for each pair of hyperparams")
	data = generateHyperParsEffectData(tuned)
	print(data$data)
	write.table(data$data, sep="\t", file=tune, quote=FALSE, row.names=F)
	print("tuned params are")
	print(tuned$x)
	rf.lrn$par.vals = c(rf.lrn$par.vals, tuned$x)
} else {
	# set default hyperparameter values
	rf.lrn$par.vals = c(rf.lrn$par.vals, mtry=9, min.node.size=16)
}
print("training model")
fit = mlr::train(rf.lrn, traintask)

# print out variable importance
print("recording variable importance:")
importance_df = as.data.frame(sort(fit$learner.model$variable.importance, decreasing=TRUE))
print(importance_df)
names(importance_df) <- c("variable\timportance")
write.table(importance_df, sep="\t", file=importance, quote=FALSE)

# save.data
save.image( model )
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
args <- commandArgs(trailingOnly = TRUE)
results<- args[1]
out<- args[2]

# load libraries
suppressMessages(library(mlr))
library(BBmisc)
library(ggplot2)
library(R.devices)

# load data.frame
print("loading results of hyperparam tuning into R")
results<- read.table(results, header=TRUE, sep="\t", na.strings=c("NA",".","na","N/A"), row.names=NULL)
data = makeS3Obj("HyperParsEffectData", data = results, measures=c("fbeta.test.mean"), hyperparameters=c("mtry", "min.node.size"), partial=F, nested=F)
plt = plotHyperParsEffect(data, x = "mtry", y = "min.node.size", z = "fbeta.test.mean", plot.type = "heatmap", show.experiments = TRUE)
# min_plt = min(data$data$acc.test.mean, na.rm = TRUE)
# max_plt = max(data$data$acc.test.mean, na.rm = TRUE)
# med_plt = mean(c(min_plt, max_plt))
# plt = plt + scale_fill_gradient2(breaks = seq(min_plt, max_plt, length.out = 5), low = "blue", mid = "white", high = "red", midpoint = med_plt)
suppressGraphics(ggsave(out, plt))
ShowHide 30 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/aryarm/varCA
Name: varca
Version: v0.3.3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...