Reproducibility workflow for Gigante et al., 2018: Using long-read sequencing to detect imprinted DNA methylation

public public 1yr ago 0 bookmarks

Haplotyped Methylome

Reproducibility instructions for Gigante et al., 2019.

Note: this repository is still being tested! If you find a bug, please file an issue.

Data available at ENA Accession PRJEB27157 .

Author's note: there is an error in Figure S2 of the paper, in which the figure legend was marked Maternal/Paternal where it should have been Black6/Cast. A corrected version is available here .

Directed Acyclic Dependency Graph

System requirements

  • R

  • Python>=3.5

  • Lots of RAM (min 256GB)

  • Lots of disk space (estimate: 2TB)

Dependencies

To install with conda , run the following command.

conda env create -f environment.yml
source activate haplotyped_methylome

You will then need to install Albacore: it is available on the Nanopore Community .

To install without conda, see the list of dependencies at the bottom of this README.

Required data

For the standard workflow, snakemake will download all the necessary files.

If you wish to avoid running albacore , bwa and nanopolish on the raw nanopore data, you can run the following command, which downloads the output of these programs and tricks snakemake into thinking you have run the pipeline from the beginning:

snakemake intermediate_download

Note: currently this only downloads the methylation files We hope to provide the alignment and phasing data in the future.

If you wish to rerun from the beginning after running this command, you can revert to the original download with snakemake --forceall .

Running the workflow

To generate all plots, tables and notebooks, simply run from the root directory:

snakemake --cores 16

If you don't wish to run the full analysis, you can run specific rules from the Snakefile by running, for example:

snakemake --cores 16 rnaseq_analysis
snakemake --cores 16 haplotype_analysis
snakemake --cores 16 methylation_analysis

Installation without conda

Software dependencies:

Python package dependencies:

pip install --user -r requirements.txt

R package dependencies:

Rscript install_R_deps.R

Known Issues

  • rnaseq_analys.Rmd failed: there is no package called 'ggrastr'

If devtools doesn't play nicely with conda , sometimes the automatic GitHub installation of ggrastr fails. You can resolve if as follows:

git clone --depth 1 https://github.com/VPetukhov/ggrastr.git
cd ggrastr
R -e 'devtools::install()'

Directed Acyclic Dependency Graph: Methylation

Directed Acyclic Dependency Graph: Haplotyping

Directed Acyclic Dependency Graph: RNA-seq

Code Snippets

8
9
shell:
    "python build_supplementary_db.py {input.bam} --summary {output.summary}"
18
19
shell:
    "python split_methylation_by_alignment.py {input.bam} {input.meth} > {output}"
26
27
shell:
    "python calculate_methylation_frequency.py -i {input} -p > {output}"
41
42
shell:
    "python call_variant_proportion.py -b {input.bam} -t {threads} -p {input.phased_bam} -v {input.vcf} -o {output}"
50
51
52
53
54
shell:
    "mkdir -p {output}.tmp && tail -n +2 {input.meth_split} | "
    "sort -k4,4 -k1,1 -k2n,2n -T {output}.tmp | "
    "cat <(head -n 1 {input.meth}) - > {output} && "
    "rm -rf {output}.tmp"
62
63
64
65
66
shell:
    "mkdir -p {output}.tmp && tail -n +2 {input.meth_split} | "
    "sort -k1,1 -k2n,2n -k4,4 -T {output}.tmp | "
    "cat <(head -n 1 {input.meth}) - > {output} && "
    "rm -rf {output}.tmp"
73
74
shell:
    "Rscript read_summary.R {input} {output}"
81
82
shell:
    "Rscript read_haplotypes.R {input} {output}"
94
95
shell:
    "Rscript fit_reads.R ../nanopore/{params.sample} ../RData/{params.sample}"
104
105
shell:
    "python split_methylation_by_haplotype.py -m {input.meth} -p {input.phase}"
112
113
shell:
    "python calculate_methylation_frequency.py -i {input} -p > {output}"
125
126
shell:
    "Rscript dss_paired.R &> {log}"
135
136
shell:
    "python compare_haplotype_methylation.py -m {input.meth} -p {input.phase} -b 11 -o 5 -r $(cat {input.region}) > {output}"
144
145
script:
    "build_dss_dmrlist.R"
154
155
script:
    "build_methylation_df.R"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
suppressMessages(suppressPackageStartupMessages(library(tidyverse)))

load("../RData/paired_DSS.RData")
combined_dmr <- dmr_parent %>%
  mutate(type="imprinted") %>%
  bind_rows(dmr_strain %>%
              mutate(type="strain")) %>%
  mutate(chr = as.character(chr)) %>%
  arrange(-abs(areaStat)) %>%
  mutate(dmr_start=start,
         dmr_end=end,
         start = dmr_start-5000,
         end = dmr_end + 5000,
         id=row_number()) %>%
  select(chr, start, end, everything()) %>%
  fuzzyjoin::genome_left_join(read_tsv("../genome_data/Mus_musculus.GRCm38_90.chr.genes.tsv", col_types="ccccdd") %>%
                                select(chr, start, end, gene_name)) %>%
  select(-starts_with("start"), -starts_with("end"), -chr.y) %>%
  select(id, chr=chr.x, start=dmr_start, end=dmr_end, everything()) %>%
  group_by(id) %>%
  mutate(overlapping_genes=paste0(gene_name, collapse=",")) %>%
  distinct(overlapping_genes, .keep_all = TRUE) %>%
  ungroup() %>%
  select(-gene_name)
combined_dmr %>%
  write_csv("../tables/dss_dmrlist.csv")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(data.table)))

bisulfite_df <- read_tsv("../bisulfite/B6CastF1_1_pe.summary.tsv", col_names=c("chr","pos","percentMeth", "meth", "coverage"), col_types='ciddd') %>%
  arrange(chr, pos) %>%
  dplyr::rename(start=pos) %>%
  mutate(end=start,
         chr=sub("chr","",chr))
save(bisulfite_df, file="../RData/bisulfite_df.RData")

nanopolish_df <- read.table("../nanopore/b6xcast.minion.methylation.summary.tsv",
                            header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
  mutate(start=start+1,
         end=end+2) %>%
  dplyr::rename(chr=chromosome,
                percentMeth=methylated_frequency) %>%
  arrange(chr, start)
save(nanopolish_df, file="../RData/nanopolish_df.RData")
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from __future__ import print_function
import pysam
import pickle
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('bam')
parser.add_argument('-s', '--summary', default=None,
                    help='Optional file for printing read summaries')
parser.add_argument('--summary-only', default=False,
                    action='store_true', help="Don't produce a .suppdb file")
args = parser.parse_args()

bam_fn = args.bam
reads = dict()
with pysam.AlignmentFile(bam_fn, 'r') as bam:
    if args.summary:
        import numpy as np
        quals = dict()
    for read in bam:
        if read.is_unmapped:
            continue
        try:
            overlap = False
            for i in range(len(reads[read.query_name])):
                chr, start, end = reads[read.query_name][i]
                if chr == read.reference_name and start <= read.reference_end and end >= read.reference_start:
                    reads[read.query_name][i] = (
                        chr, min(start, read.reference_start), max(end, read.reference_end))
                    overlap = True
            if not overlap:
                reads[read.query_name].append(
                    (read.reference_name, read.reference_start, read.reference_end))
        except KeyError:
            reads[read.query_name] = [
                (read.reference_name, read.reference_start, read.reference_end)]
        if args.summary:
            try:
                read_qual = np.median(read.query_qualities)
            except TypeError:
                read_qual = "NA"
            try:
                quals[read.query_name].append(read_qual)
            except KeyError:
                quals[read.query_name] = [read_qual]
non_supplementary = set()
for read, alignments in reads.items():
    if len(alignments) == 1:
        non_supplementary.add(read)
if args.summary:
    with open(args.summary, 'w') as summary:
        print("read_name\tchr\tstart\tend\tqual", file=summary)
        for read, alignments in reads.items():
            if read in non_supplementary:
                print("{read_name}\t{chr}\t{start}\t{end}\t{qual}".format(
                    read_name=read,
                    chr=alignments[0][0],
                    start=alignments[0][1],
                    end=alignments[0][2],
                    qual=quals[read][0]), file=summary)
            else:
                for i, alignment in enumerate(alignments):
                    print("{read_name}\t{chr}\t{start}\t{end}\t{qual}".format(
                        read_name=read + "_" + str(i),
                        chr=alignment[0],
                        start=alignment[1],
                        end=alignment[2],
                        qual=quals[read][i]), file=summary)
if not args.summary_only:
    for read in non_supplementary:
        del reads[read]
    with open("{}.suppdb".format(bam_fn), 'wb') as handle:
        pickle.dump(reads, handle, pickle.HIGHEST_PROTOCOL)
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
from __future__ import print_function
import math
import sys
import csv
import argparse
from collections import namedtuple


def make_key(c, s, e):
    return c + ":" + str(s) + ":" + str(e)


class SiteStats:

    def __init__(self, g_size, g_seq):
        self.num_reads = 0
        self.posterior_methylated = 0
        self.called_sites = 0
        self.called_sites_methylated = 0
        self.group_size = g_size
        self.sequence = g_seq


def update_call_stats(key, num_called_cpg_sites, methylation, sequence):
    if key not in sites:
        sites[key] = SiteStats(num_called_cpg_sites, sequence)

    sites[key].num_reads += 1
    sites[key].called_sites += num_called_cpg_sites
    if methylation > 0:
        sites[key].called_sites_methylated += num_called_cpg_sites * methylation

parser = argparse.ArgumentParser(
    description='Calculate methylation frequency at genomic CpG sites')
parser.add_argument('-c', '--call-threshold', type=float,
                    required=False, default=2.5)
parser.add_argument('-i', '--input', type=str, required=False)
parser.add_argument('-s', '--split-groups', action='store_true')
parser.add_argument('-p', '--probabilistic', action='store_true')
parser.add_argument('--prior', type=float, default=0.5,
                    help="Prior probability of methylation")
args = parser.parse_args()
assert(args.call_threshold is not None)
if args.probabilistic:
    args.call_threshold = 0

sites = dict()

if args.input:
    in_fh = open(args.input)
else:
    in_fh = sys.stdin
csv_reader = csv.DictReader(in_fh, delimiter='\t')

for record in csv_reader:

    try:
        num_sites = int(record['num_motifs'])
    except KeyError:
        # backwards compatible
        num_sites = int(record['num_cpgs'])
    llr = float(record['log_lik_ratio'])

    # Skip ambiguous call
    if abs(llr) < args.call_threshold:
        continue
    sequence = record['sequence']

    if args.probabilistic:
        try:
            if llr > 700:
                llr = 700
            elif llr < -700:
                llr = -700
            methylation = 1 / (1 + (1 - args.prior) /
                               (args.prior * math.exp(llr)))
        except OverflowError:
            print(llr)
            raise
    else:
        methylation = 1 if llr > 0 else 0

    # if this is a multi-cpg group and split_groups is set, break up these
    # sites
    if args.split_groups and num_sites > 1:
        c = record['chromosome']
        s = int(record['start'])
        e = int(record['end'])

        # find the position of the first CG dinucleotide
        sequence = record['sequence']
        cg_pos = sequence.find("CG")
        first_cg_pos = cg_pos
        while cg_pos != -1:
            key = make_key(c, s + cg_pos - first_cg_pos,
                           s + cg_pos - first_cg_pos)
            update_call_stats(key, 1, methylation, "split-group")
            cg_pos = sequence.find("CG", cg_pos + 1)
    else:
        key = make_key(record['chromosome'], record['start'], record['end'])
        update_call_stats(key, num_sites, methylation, sequence)

# header
print("\t".join(["chromosome", "start", "end", "num_cpgs_in_group",
                 "called_sites", "called_sites_methylated",
                 "methylated_frequency", "group_sequence"]))

for key in sites:
    if sites[key].called_sites > 0:
        (c, s, e) = key.split(":")
        f = float(sites[key].called_sites_methylated) / sites[key].called_sites
        print("\t".join([str(x) for x in [c, s, e, sites[key].group_size,
                                          sites[key].called_sites,
                                          sites[key].called_sites_methylated,
                                          f, sites[key].sequence]]))
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
from __future__ import print_function
import sys
import pysam
import csv
import pickle
import functools
import multiprocessing
import argparse
import math

__min_coverage = 5
__override_ratio = 3

# fit exponential to log(1-correctness)
__slope = -0.1203
__intercept = -0.6927


def parse_args():
    parser = argparse.ArgumentParser(
        description="Phase nanopore reads from nanopolish output")
    parser.add_argument('-b', '--bam', required=True,
                        help="Path to the original bam file")
    parser.add_argument('-p', '--phased', required=True,
                        help="Path to the phased bam file from nanopolish phase-reads")
    parser.add_argument('-v', '--vcf', required=True,
                        nargs='+', help="Path to the variants file(s)")
    parser.add_argument('-d', '--detailed-output', action='store_true',
                        default=False, help="Print per-read, per-site calls to stdout")
    parser.add_argument('-o', '--outfile',
                        type=argparse.FileType('w'), default=sys.stdout)
    parser.add_argument('-t', '--threads', type=int,
                        default=multiprocessing.cpu_count())
    args = parser.parse_args()
    return args


def get_read_id(read, contig, position, suppdb):
    if read.query_name in suppdb:
        for i, alignment in enumerate(suppdb[read.query_name]):
            chr, start, end = alignment
            if contig == chr and position >= start and position <= end:
                return "{}_{}".format(read.query_name, i)
        raise Exception("Position {}:{} not internal to any alignment of read {}".format(
            contig, position, read.query_name))
    else:
        return read.query_name


def check_column_support(vcf_rows, vcf_pos, column, reads, suppdb, type, detailed_output):
    if column is not None:
        valid_allele = [True if vcf_row['POS'] == vcf_pos and len(
            vcf_row['REF']) == 1 else False for vcf_row in vcf_rows]
        if column.reference_pos == vcf_pos and True in valid_allele and 'PASS' in [vcf_row['FILTER'] for vcf_row in vcf_rows]:
            for read in column.pileups:
                if not read.is_del:
                    read_name = read.alignment.query_name
                    read_pos = read.query_position
                    base = read.alignment.query_sequence[read_pos]
                    call = []
                    try:
                        qual = read.alignment.query_qualities[read_pos]
                        if type == "signal":
                            # nanopolish qualities scores should be offset by
                            # 35 - they never give anything below this
                            qual = max(0, qual - 35)
                        error = math.exp(__intercept + __slope * qual)
                    except TypeError:
                        if read.alignment.query_qualities is None:
                            # quality scores missing
                            error = 0
                        else:
                            # unknown error
                            raise
                    try:
                        reads[read_name]
                    except KeyError:
                        reads[read_name] = [0] * (len(vcf_rows) + 2)

                    # only have to check reference once
                    if base == vcf_rows[0]["REF"]:
                        reads[read_name][0] += 1 - error
                        call.append("ref")
                    else:
                        reads[read_name][0] += error
                    for i in range(len(vcf_rows)):
                        if vcf_rows[i]['FILTER'] == 'PASS':
                            match = False
                            if valid_allele[i]:
                                if base in vcf_rows[i]["ALT"]:
                                    match = True
                            elif "ref" in call and base == vcf_rows[0]["REF"]:
                                # no snp here, same as reference
                                match = True
                            if match:
                                reads[read_name][i + 1] += 1 - error
                                call.append("alt{}".format(i) if len(
                                    vcf_rows) > 1 else "alt")
                            else:
                                reads[read_name][i + 1] += error
                        else:
                            # treat as heterozygous
                            if valid_allele[i] and base in vcf_rows[i]['REF'] + vcf_rows[i]['ALT']:
                                reads[read_name][i + 1] += (1 - error) / 2
                            else:
                                reads[read_name][i + 1] += error
                    # add 1 to coverage
                    reads[read_name][-1] += 1
                    if len(call) > 0 and detailed_output:
                        qual = read.alignment.query_qualities[read_pos]
                        print(vcf_rows)
                        print(base)
                        print(error)
                        print(reads[read_name])
                        print("{}\t{}\t{}\t{}\t{}\t{}".format(
                            column.reference_name, column.reference_pos, read_name, type, ",".join(call), qual))
    return reads


def ref_support(reads):
    return set([read for read, counts in reads.items() if counts[0] > counts[1]])


def summarize_reads(reads):
    if len(reads) == 0:
        return None
    support = len(ref_support(reads))
    return float(support) / len(reads)


def load_vcf(handle):
    header = next(handle)
    while header.startswith("##"):
        header = next(handle)
    header = header.strip("#").strip("\n").split("\t")
    reader = csv.DictReader(handle, delimiter="\t", fieldnames=header)
    return reader


def pileup_step(ref_pos, original_pileup, phased_pileup):
    try:
        original_column = next(original_pileup)
        while original_column.reference_pos < ref_pos:
            original_column = next(original_pileup)
    except StopIteration:
        original_column = None
    try:
        phased_column = next(phased_pileup)
        while phased_column.reference_pos < ref_pos:
            phased_column = next(phased_pileup)
    except StopIteration:
        phased_column = None
    return original_column, phased_column


def process_row(vcf_rows, vcf_pos, original_pileup, original_reads, phased_pileup, phased_reads, suppdb, detailed_output):
    if 'PASS' in [vcf_row['FILTER'] for vcf_row in vcf_rows]:
        # process this snp
        original_column, phased_column = pileup_step(
            vcf_pos, original_pileup, phased_pileup)
        if not (original_column or phased_column):
            return None, original_reads, None, phased_reads
        else:
            original_reads = check_column_support(
                vcf_rows, vcf_pos, original_column, original_reads, suppdb, "base", detailed_output)
            phased_reads = check_column_support(
                vcf_rows, vcf_pos, phased_column, phased_reads, suppdb, "signal", detailed_output)
    return original_pileup, original_reads, phased_pileup, phased_reads


def call_contig(contig, vcf_fns, original_fn, phased_fn, suppdb, detailed_output):
    with pysam.AlignmentFile(original_fn, 'r') as original_bam, pysam.AlignmentFile(phased_fn, 'r') as phased_bam:
        vcf_handles = []
        vcf_readers = []
        for vcf_fn in vcf_fns:
            handle = open(vcf_fn, 'r')
            vcf_handles.append(handle)
            vcf_readers.append(load_vcf(handle))

        original_pileup = original_bam.pileup(contig)
        original_reads = dict()

        phased_pileup = phased_bam.pileup(contig)
        phased_reads = dict()

        if original_pileup or phased_pileup:
            # find right location in vcf - maybe an index is better for this
            vcf_rows = [next(vcf_reader) for vcf_reader in vcf_readers]
            for i, reader in enumerate(vcf_readers):
                if not vcf_rows[i]['CHROM'] == contig:
                    for vcf_row in reader:
                        if vcf_row['CHROM'] != contig:
                            # not there yet
                            continue
                        vcf_rows[i] = vcf_row
                        break
                vcf_rows[i]['POS'] = int(vcf_rows[i]['POS']) - 1
            # process the files
            while True:
                try:
                    vcf_pos = min(
                        [vcf_row['POS'] for vcf_row in vcf_rows if vcf_row['CHROM'] == contig])
                except ValueError:
                    # empty list, all on wrong contig
                    break
                original_pileup, original_reads, phased_pileup, phased_reads = process_row(
                    vcf_rows, vcf_pos, original_pileup, original_reads, phased_pileup, phased_reads, suppdb, detailed_output)
                if not (original_pileup or phased_pileup):
                    # out of reads
                    break
                num_failed = 0
                for i in range(len(vcf_rows)):
                    try:
                        if vcf_rows[i]['CHROM'] == contig and vcf_rows[i]['POS'] <= vcf_pos:
                            vcf_rows[i] = next(vcf_readers[i])
                            if vcf_rows[i]['CHROM'] != contig:
                                vcf_rows[i]['POS'] = float('inf')
                                num_failed += 1
                            else:
                                vcf_rows[i]['POS'] = int(
                                    vcf_rows[i]['POS']) - 1
                    except StopIteration:
                        vcf_rows[i]['CHROM'] = None
                        vcf_rows[i]['POS'] = float('inf')
                        num_failed += 1
                if num_failed == len(vcf_rows):
                    # no more variants
                    break
    return original_reads, phased_reads

args = parse_args()

with open("{}.suppdb".format(args.bam), 'rb') as handle:
    suppdb = pickle.load(handle)

func = functools.partial(call_contig, vcf_fns=args.vcf, original_fn=args.bam,
                         phased_fn=args.phased, suppdb=suppdb, detailed_output=args.detailed_output)
with pysam.AlignmentFile(args.bam, 'r') as bam:
    original_reads, phased_reads = dict(), dict()
    contigs = [bam.get_reference_name(i) for i in range(bam.nreferences)]
if args.detailed_output:
    print("\t".join(["chr", "pos", "read_name", "caller", "genotype", "qual"]))

threads = 1 if args.detailed_output else min(args.threads, len(contigs))
if threads > 1:
    pool = multiprocessing.Pool(threads)
    data = pool.imap_unordered(func, contigs)
    pool.close()
    pool.join()
else:
    data = [func(contig) for contig in contigs]

for original_contig_reads, phased_contig_reads in data:
    original_reads.update(original_contig_reads)
    phased_reads.update(phased_contig_reads)

reads = set(original_reads.keys()).union(set(phased_reads.keys()))
signal_alt_header = "\t".join(["signal_alt{}".format(
    i + 1) for i in range(len(args.vcf))]) if len(args.vcf) > 1 else "signal_alt"
base_alt_header = "\t".join(["base_alt{}".format(
    i + 1) for i in range(len(args.vcf))]) if len(args.vcf) > 1 else "base_alt"
print("\t".join(['read', 'genotype', 'signal_ref', signal_alt_header,
                 'signal_coverage', 'base_ref', base_alt_header, 'base_coverage']), file=args.outfile)
for read in reads:
    try:
        signal_calls = phased_reads[read]
    except KeyError:
        signal_calls = [0] * (len(args.vcf) + 2)
    try:
        base_calls = original_reads[read]
    except KeyError:
        base_calls = [0] * (len(args.vcf) + 2)

    base_coverage = base_calls[-1]
    base_calls = base_calls[:-1]
    signal_coverage = signal_calls[-1]
    signal_calls = signal_calls[:-1]
    base_max = max(base_calls)
    signal_max = max(signal_calls)

    if signal_coverage == 0 or signal_calls.count(signal_max) > 1:
        signal_call = "fail"
        signal_ratio = 0
    else:
        signal_call = signal_calls.index(signal_max)
        signal_ratio = signal_max / signal_coverage
        if signal_call == 0:
            signal_call = "ref"
        else:
            signal_call = "alt{}".format(
                signal_call) if len(args.vcf) > 1 else "alt"

    if base_coverage == 0 or base_calls.count(max(base_calls)) > 1:
        base_call = "fail"
        base_ratio = 0
    else:
        base_call = base_calls.index(max(base_calls))
        base_ratio = base_max / base_coverage
        if base_call == 0:
            base_call = "ref"
        else:
            base_call = "alt{}".format(base_call) if len(
                args.vcf) > 1 else "alt"

    if base_coverage < __min_coverage and signal_coverage < __min_coverage:
        genotype = "fail"
    elif base_call == signal_call:
        genotype = base_call
    elif signal_call == "fail":
        genotype = base_call
    elif base_call == "fail":
        genotype = signal_call
    elif base_coverage > __override_ratio * signal_coverage and base_call != "fail":
        genotype = base_call
    elif signal_coverage > __override_ratio * base_coverage and signal_call != "fail":
        genotype = signal_call
    elif base_ratio - 0.5 > __override_ratio * (signal_ratio - 0.5) and base_ratio > 0.5:
        genotype = base_call
    elif signal_ratio - 0.5 > __override_ratio * (base_ratio - 0.5) and signal_ratio > 0.5:
        genotype = signal_call
    else:
        genotype = "fail"

    if read in suppdb:
        for j in range(len(suppdb[read])):
            supp_read = "{}_{}".format(read, j)
            print("\t".join([supp_read, genotype, "\t".join([str(i) for i in signal_calls]), str(
                signal_coverage), "\t".join([str(i) for i in base_calls]), str(base_coverage)]), file=args.outfile)
    else:
        print("\t".join([read, genotype, "\t".join([str(i) for i in signal_calls]), str(
            signal_coverage), "\t".join([str(i) for i in base_calls]), str(base_coverage)]), file=args.outfile)
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
from __future__ import print_function
import csv
import numpy as np
import math
import argparse
import re


def parse_region(region):
    """
    Parse a region specification.
    :param region: String region specification
    :raises argparse.ArgumentTypeError: raises an error if format not recognised
    :returns contig: String contig / chromosome name
    :returns start: integer start position (0-based)
    :returns end: integer end position (1-based)
    >>> parseRegion("chr1:1000-2000")
    ("chr1", 1000, 2000)
    """
    r = ''.join(region.split())  # remove whitespace
    r = re.split(':|-', r)
    start = 0
    end = None
    if len(r) < 1:
        raise argparse.ArgumentTypeError(
            "Region {} must specify a reference name".format(region))
    elif len(r) > 3:
        raise argparse.ArgumentTypeError(
            "Region {} format not recognized".format(region))
    else:
        contig = r[0]
        try:
            start = int(re.sub(",|\.", "", r[1]))
        except IndexError:
            pass
        try:
            end = int(re.sub(",|\.", "", r[2]))
        except IndexError:
            pass
        finally:
            return contig, start, end


parser = argparse.ArgumentParser()
parser.add_argument('-m', '--methylation', required=True,
                    help="Methylation data, split by supplementaries, sorted by genomic location")
parser.add_argument('-p', '--phase', required=True,
                    help="Phased reads (haplotyping data)")
parser.add_argument('-b', '--binwidth', type=int, default=11,
                    help="Genomic distance over which to aggregate tests")
parser.add_argument('-o', '--overlap', type=int, default=5,
                    help="Genomic distance for bin overlap. Limited to binwidth/2 for streaming reasons, but this could be reengineered")
parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END",
                    help="Name and range of contigs to test (Default: entire genome)")
args = parser.parse_args()

__binwidth = args.binwidth
__overlap = min(__binwidth // 2, args.overlap)  # max = __binwidth/2

meth_fn = args.methylation
haplotype_fn = args.phase

regions = dict()
# format: { chr : [start, end] }
if args.region:
    for region in args.region:
        try:
            regions[region[0]]
        except KeyError:
            regions[region[0]] = []
        regions[region[0]].append([region[1], region[2]])

haplotypes = dict()
with open(haplotype_fn, 'r') as haplotype_handle:
    reader = csv.DictReader(haplotype_handle, delimiter="\t")
    for row in reader:
        haplotypes[row["read"]] = row["genotype"]


def get_genotype(read_name):
    try:
        return haplotypes[read_name]
    except KeyError:
        return "fail"


print("\t".join(["chr", "start", "end", "ref_mean", "ref_sd", "ref_site_count", "ref_read_count",
                 "alt_mean", "alt_sd", "alt_site_count", "alt_read_count"]))
chr = None
start = None
ref_reads = None
alt_reads = None
ref_meth = None
alt_meth = None
next_start = None
next_ref_reads = None
next_alt_reads = None
next_ref_meth = None
next_alt_meth = None
end = 0

with open(meth_fn, 'r') as meth_handle:
    reader = csv.DictReader(meth_handle, delimiter="\t")
    for row in reader:
        pos = (float(row["start"]) + float(row["end"])) / 2
        if regions:
            hit = False
            try:
                for region in regions[row['chromosome']]:
                    if pos >= region[0] and pos < region[1]:
                        hit = True
            except KeyError:
                pass
            if not hit:
                continue
        if pos >= end or row["chromosome"] != chr:
            if chr is not None:
                # print result
                if not (len(ref_meth) == 0 and len(alt_meth) == 0):
                    ref_mean = np.mean(ref_meth) if len(ref_meth) > 0 else "NA"
                    ref_sd = np.std(ref_meth) if len(ref_meth) > 0 else "NA"
                    alt_mean = np.mean(alt_meth) if len(alt_meth) > 0 else "NA"
                    alt_sd = np.std(alt_meth) if len(alt_meth) > 0 else "NA"
                    print("\t".join([chr, str(start), str(end), str(ref_mean), str(ref_sd), str(len(ref_meth)), str(len(
                        ref_reads)), str(alt_mean), str(alt_sd), str(len(alt_meth)), str(len(alt_reads))]))
            if chr is not None and chr == row["chromosome"] and pos < next_start:
                start = next_start
                ref_reads = next_ref_reads
                alt_reads = next_alt_reads
                ref_meth = next_ref_meth
                alt_meth = next_alt_meth
            else:
                start = int(row["start"])
                ref_reads = set()
                alt_reads = set()
                ref_meth = list()
                alt_meth = list()
                chr = row["chromosome"]
            next_start = start + __overlap
            end = start + __binwidth
            next_ref_reads = set()
            next_alt_reads = set()
            next_ref_meth = list()
            next_alt_meth = list()
        try:
            meth = 1 / (1. + math.exp(-float(row["log_lik_ratio"])))
        except OverflowError:
            meth = 0.0
        genotype = get_genotype(row["read_name"])
        if genotype == "ref":
            ref_meth.append(meth)
            ref_reads.add(row["read_name"])
        elif genotype == "alt":
            alt_meth.append(meth)
            alt_reads.add(row["read_name"])
        if int(row["start"]) >= next_start:
            if genotype == "ref":
                next_ref_meth.append(meth)
                next_ref_reads.add(row["read_name"])
            elif genotype == "alt":
                next_alt_meth.append(meth)
                next_alt_reads.add(row["read_name"])

# print last result
if chr is not None:
    ref_mean = np.mean(ref_meth) if len(ref_meth) > 0 else "NA"
    ref_sd = np.std(ref_meth) if len(ref_meth) > 0 else "NA"
    alt_mean = np.mean(alt_meth) if len(alt_meth) > 0 else "NA"
    alt_sd = np.std(alt_meth) if len(alt_meth) > 0 else "NA"
    print("\t".join([chr, str(start), str(end), str(ref_mean), str(ref_sd), str(len(ref_meth)), str(len(
        ref_reads)), str(alt_mean), str(alt_sd), str(len(alt_meth)), str(len(alt_reads))]))
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import sys
import re


def count_cpg(file, path=True):
    substring = "CG"
    print("chr\tstart\tend")
    f = open(file) if path else file
    line = f.readline()
    while line:
        if line.startswith('>'):
            chrm = line[1:].rstrip()
        else:
            index = 0
            linelen = len(line)
            # Find first non-N base
            # index = re.search(r'[^N]', line).start()
            # print("start\t{}\tend\t{}".format(index+1, linelen))
            while index < linelen:
                index = line.find(substring, index)
                if index == -1:
                    break
                print("{}\t{}\t{}".format(
                    chrm, index + 1, index + len(substring)))
                index += len(substring)
        line = f.readline()
    if path:
        f.close()


if __name__ == "__main__":
    if len(sys.argv) != 2:
        if sys.stdin:
            count_cpg(sys.stdin, path=False)
        else:
            raise TypeError("missing file input")
    else:
        count_cpg(sys.argv[1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(DSS)))
suppressMessages(suppressPackageStartupMessages(library(bsseq)))

meth_fn <- "../nanopore/b6xcast.minion.methylation"

forward_ref_df <- read.table(paste0(meth_fn, ".ref_summary.tsv"),
                             header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
  mutate(start=start+1,
         end=end+2,
         sampleName="b6xcast.mat") %>%
  dplyr::rename(chr=chromosome,
                percentMeth=methylated_frequency) %>%
  arrange(chr, start)
forward_alt_df <- read.table(paste0(meth_fn, ".alt_summary.tsv"),
                             header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
  mutate(start=start+1,
         end=end+2,
         sampleName="b6xcast.pat") %>%
  dplyr::rename(chr=chromosome,
                percentMeth=methylated_frequency) %>%
  arrange(chr, start)

meth_fn <- "../nanopore/castxb6.promethion.methylation"

reverse_ref_df <- read.table(paste0(meth_fn, ".ref_summary.tsv"),
                              header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
    mutate(start=start+1,
           end=end+2,
         sampleName="castxb6.pat") %>%
    dplyr::rename(chr=chromosome,
                  percentMeth=methylated_frequency) %>%
    arrange(chr, start)
reverse_alt_df <- read.table(paste0(meth_fn, ".alt_summary.tsv"),
                              header=TRUE, sep="\t", stringsAsFactors = FALSE) %>%
    mutate(start=start+1,
           end=end+2,
         sampleName="castxb6.mat") %>%
    dplyr::rename(chr=chromosome,
                  percentMeth=methylated_frequency) %>%
    arrange(chr, start)

tidy_df <- function(df, sampleName) {
  Cov <- rlang::sym(paste0("Cov.",sampleName))
  M <- rlang::sym(paste0("M.",sampleName))
  df %>%
    select(chr, start, end, 
           !!Cov:=called_sites, 
           !!M:=called_sites_methylated)
}

make_bs <- function(df) {
  M <- df %>% 
    select(starts_with("M.")) %>%
    rename_all(function(x) gsub("M.", "", x)) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x)) %>%
    as.matrix()
  Cov <- df %>% 
    select(starts_with("Cov.")) %>%
    rename_all(function(x) gsub("Cov.", "", x)) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x)) %>%
    as.matrix()
  BSseq(chr = df$chr, 
        pos = df$start,
        Cov = Cov,
        M = M,
        sampleNames = colnames(M))
}

bs <- tidy_df(forward_ref_df, "b6xcast.mat") %>%
  full_join(tidy_df(forward_alt_df, "b6xcast.pat"),
            by=c("chr", "start", "end")) %>%
  full_join(tidy_df(reverse_ref_df, "castxb6.pat"),
            by=c("chr", "start", "end")) %>%
  full_join(tidy_df(reverse_alt_df, "castxb6.mat"),
            by=c("chr", "start", "end")) %>%
  group_by(chr, start) %>%
  summarise_if(is.numeric, sum) %>%
  ungroup() %>%
  make_bs()
pData(bs)$strain <- c("b6xcast", "b6xcast", "castxb6", "castxb6")
pData(bs)$parent <- c("mat", "pat", "pat", "mat")

loci.idx <- which(rowSums(getCoverage(bs, type="Cov")==0) == 0)
dml_parent <- DMLtest(BSobj = bs[loci.idx,], 
               group1=c("b6xcast.mat", "castxb6.mat"), 
               group2=c("b6xcast.pat", "castxb6.pat"), 
               equal.disp = TRUE, smoothing=TRUE)
dml_strain <- DMLtest(BSobj = bs[loci.idx,], 
               group1=c("b6xcast.mat", "castxb6.pat"), 
               group2=c("castxb6.mat", "b6xcast.pat"), 
               equal.disp = TRUE, smoothing=TRUE)

dmr_parent <- callDMR(dml_parent, p.threshold=1e-5, dis.merge=1500)
dmr_strain <- callDMR(dml_strain, p.threshold=1e-5, dis.merge=1500)

save(dml_parent, dml_strain, dmr_parent, dmr_strain, file="../RData/paired_DSS.RData")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
suppressMessages(suppressPackageStartupMessages(library(magrittr)))
suppressMessages(suppressPackageStartupMessages(library(dplyr)))
suppressMessages(suppressPackageStartupMessages(library(readr)))

args <- commandArgs(trailingOnly = TRUE)
gtf <- rtracklayer::import(args[1])
gene_list <- gtf %>%
  as_data_frame() %>%
  rename(chr=seqnames) %>%
  select(chr, start, end, strand, gene_id, gene_name) %>%
  group_by(chr, gene_id, gene_name, strand) %>%
  summarise(start=min(start),
            end=max(end))
gene_list %>%
  write_tsv(sub("gtf$", "genes.tsv", args[1]))
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(ggExtra)))
suppressMessages(suppressPackageStartupMessages(library(broom)))
suppressMessages(suppressPackageStartupMessages(library(data.table)))

n_min <- 7
args = commandArgs(trailingOnly=TRUE)
infile <- args[1]
outdir <- args[2]

load(file.path(outdir, "summary_df.RData"))
load(file.path(outdir, "haplotype_df.RData"))

fit_loess <- function(chr, read_name, start, percentMeth, ...) {
  x = data.frame(start=start, percentMeth=percentMeth)
  fit <- loess(percentMeth ~ start, x, control=loess.control(surface = "interpolate",
                                                             statistics = "none",
                                                             trace.hat = "approximate",
                                                             cell=0.5), ...)
  fit <- list(kd=fit$kd,
              start=min(start),
              end=max(start),
              chr=chr,
              read_name=read_name)
  fit
}

process_file <- function(filepath) {
  con = file(filepath, "r")
  reads=list()
  header = readLines(con, n = 1)

  # process first line
  line = readLines(con, n = 1)
  line <- strsplit(line, "\\t")[[1]]
  chr=line[1]
  read_name = line[4]
  pos=as.integer(line[2])
  meth = 1/(1+exp(-as.numeric(line[5])))
  i=1
  start=c(pos)
  percentMeth=c(meth)

  # process the rest
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    line <- strsplit(line, "\\t")[[1]]
    pos=as.integer(line[2])
    name = line[4]
    meth = 1/(1+exp(-as.numeric(line[5])))
    if (name != read_name) {
      # new line
      if (length(start) > n_min){
        len <- max(start) - min(start)
        span <- 0.1 + 8e-11*((max(-len + 1e5, 0)))^2
        reads[[i]] = fit_loess(chr, read_name, start, percentMeth, span=span)
        i <- i + 1
      }
      chr=line[1]
      read_name = name
      start=c(pos)
      percentMeth=c(meth)
    } else {
      start=c(start, pos)
      percentMeth=c(percentMeth, meth)
    }
  }
  close(con)
  reads
}

fit_reads <- process_file(paste0(infile, ".methylation.sorted.by_read.tsv"))
save(fit_reads, file=file.path(outdir, "fit_reads.RData"))
# load(file.path(outdir, "fit_reads.RData"))

fit_reads_df <- data_frame(read_name=sapply(fit_reads, function(x) { x$read_name })) %>% 
  mutate(id=row_number()) %>%
  left_join(summary_df, by="read_name") %>%
  arrange(chr, start, end) %>%
  left_join(haplotype_df %>% select(read_name, genotype), by="read_name") %>%
  dplyr::select(chr, start, end, id, read_name, genotype, qual)
save(fit_reads_df, file=file.path(outdir, "fit_reads_df.RData"))
12
13
shell:
    "python call_variant_proportion.py -b {input.bam} -p {input.phased_bam} -v {input.cast_vcf} {input.fvb_vcf} -o {output}"
25
26
shell:
    "Rscript render_notebook.R ../notebooks/nanopolish_fvb_resolution.Rmd &> {params.log}"
35
36
shell:
    "python merge_recombined_haplotypes.py -i {input.phase} -s {input.summary} -a alt2 -r $(cat {input.regions}) -o {output}"
6
7
shell:
    "wget -q -O {output} {params.url}"
14
15
shell:
    "wget -q -O {output} {params.url}"
22
23
shell:
    "wget -q -O {output} {params.url}"
30
31
shell:
    "wget -q -O {output} {params.url}"
39
40
shell:
    "md5sum -c {input.md5} && touch {output}"
48
49
shell:
    "gunzip {input.tsv} && touch {output.signpost}"
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
from __future__ import print_function
from Bio import SeqIO
from Bio import Seq
import re
import sys
import csv
import argparse


def load_vcf(handle):
    header = next(handle)
    while header.startswith("##"):
        header = next(handle)
    header = header.strip("#").strip("\n").split("\t")
    reader = csv.DictReader(handle, delimiter="\t", fieldnames=header)
    return reader


__ambiguity_codes = {
    'N': {'A', 'C', 'G', 'T'},
    'V': {'A', 'C', 'G'},
    'H': {'A', 'C', 'T'},
    'D': {'A', 'G', 'T'},
    'B': {'C', 'G', 'T'},
    'M': {'A', 'C'},
    'K': {'G', 'T'},
    'R': {'A', 'G'},
    'Y': {'C', 'T'},
    'W': {'A', 'T'},
    'S': {'C', 'G'},
    'A': {'A'},
    'C': {'C'},
    'G': {'G'},
    'T': {'T'}
}


def get_ambiguity_base(bases):
    bases = set(bases)
    for ambiguity_base, possible_bases in __ambiguity_codes.items():
        if bases == possible_bases:
            return ambiguity_base
    raise Exception("Unrecognises bases: ['" + "','".join(list(bases)) + "']")


def disambiguate_base(base):
    base = str(base)
    return list(__ambiguity_codes[base])


def parse_region(region):
    """
    Parse a region specification.
    :param region: String region specification
    :raises argparse.ArgumentTypeError: raises an error if format not recognised
    :returns contig: String contig / chromosome name
    :returns start: integer start position (0-based)
    :returns end: integer end position (1-based)
    >>> parseRegion("chr1:1000-2000")
    ("chr1", 1000, 2000)
    """
    region = ''.join(region.split())  # remove whitespace
    region = re.split(':|-', region)
    start = 0
    end = None
    if len(region) < 1:
        raise argparse.ArgumentTypeError(
            "Region must specify a reference name")
    elif len(region) > 3:
        raise argparse.ArgumentTypeError("Region format not recognized")
    else:
        contig = region[0]
        try:
            start = int(re.sub(",|\.", "", region[1]))
        except IndexError:
            pass
        try:
            end = int(re.sub(",|\.", "", region[2]))
        except IndexError:
            pass
        finally:
            return contig, start, end


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', required=True,
                        help='filename of input genome fasta')
    parser.add_argument('-o', '--output', required=True,
                        help='filename of output masked genome fasta')
    parser.add_argument('-v', '--vcf', required=True,
                        help='filename of vcf listing snps')
    parser.add_argument('-a', '--alternate-only', default=False, action='store_true',
                        help='replace reference base with alternate allele, rather than retaining both (Default: false)')
    parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END",
                        help="Name and range of contigs to mask (Default: entire genome)")
    parser.add_argument('-b', '--boundary', type=int, default=0, metavar="INT",
                        help="Width of region boundary where both alleles are included")
    return parser.parse_args()


def contig_to_int(contig):
    try:
        return(int(contig))
    except ValueError:
        if contig == 'X':
            return max_autosome + 1
        elif contig == 'Y':
            return max_autosome + 2
        elif contig == 'MT':
            return max_autosome + 3
        else:
            return max_autosome + 4


def cmp_contigs(contig):
    return(contig_to_int(contig), contig)


args = parse_args()
regions = dict()
# format: { chr : [start, end, alternate_only] }
if args.region:
    for region in args.region:
        try:
            regions[region[0]]
        except KeyError:
            regions[region[0]] = []
        region_start, region_end = region[1], region[2]
        if region_start > 0:
            region_start = region_start + args.boundary // 2
        if region_end is not None:
            region_end = region_end - args.boundary // 2
        regions[region[0]].append(
            [region_start, region_end, args.alternate_only])
        if args.boundary > 1:
            if region[1] > 0:
                regions[region[0]].append(
                    [region[1] - args.boundary // 2, region[1] + args.boundary // 2, False])
            if region[2] is not None:
                regions[region[0]].append(
                    [region[2] - args.boundary // 2, region[2] + args.boundary // 2, False])

genome = dict()
with open(args.input, 'r') as handle:
    for record in SeqIO.parse(handle, "fasta"):
        record.seq = record.seq.tomutable()
        genome[record.id] = record

with open(args.vcf, 'r') as handle:
    vcf = load_vcf(handle)
    for row in vcf:
        snp = True
        ref_base = [row['REF']]
        alt_bases = row['ALT'].split(',')
        for base in alt_bases + ref_base:
            if len(base) > 1:
                snp = False
        if not snp:
            # don't touch indels
            continue

        pos = int(row['POS']) - 1
        if regions:
            alternate_only = None
            try:
                for r in regions[row['CHROM']]:
                    if pos >= r[0] and pos < r[1]:
                        alternate_only = r[2]
            except KeyError:
                pass
            if alternate_only is None:
                # no region hit
                continue
        else:
            # genome wide
            alternate_only = args.alternate_only

        record = genome[row['CHROM']]

        bases = alt_bases
        if not alternate_only:
            bases = bases + disambiguate_base(record.seq[pos])

        record.seq[pos] = get_ambiguity_base(bases)
        genome[row['CHROM']] = record

contigs = list()
max_autosome = 0
for contig in genome.keys():
    contigs.append(contig)
    try:
        if int(contig) > max_autosome:
            max_autosome = int(contig)
    except ValueError:
        pass

contigs = sorted(contigs, key=cmp_contigs)

with open(args.output, 'w') as handle:
    for contig in contigs:
        SeqIO.write(genome[contig], handle, "fasta")
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
from __future__ import print_function
import re
import csv
import argparse

__min_coverage = 5
__override_ratio = 3


def parse_region(region):
    """
    Parse a region specification.
    :param region: String region specification
    :raises argparse.ArgumentTypeError: raises an error if format not recognised
    :returns contig: String contig / chromosome name
    :returns start: integer start position (0-based)
    :returns end: integer end position (1-based)
    >>> parseRegion("chr1:1000-2000")
    ("chr1", 1000, 2000)
    """
    region = ''.join(region.split())  # remove whitespace
    region = re.split(':|-', region)
    start = 0
    end = None
    if len(region) < 1:
        raise argparse.ArgumentTypeError(
            "Region must specify a reference name")
    elif len(region) > 3:
        raise argparse.ArgumentTypeError("Region format not recognized")
    else:
        contig = region[0]
        try:
            start = int(re.sub(",|\.", "", region[1]))
        except IndexError:
            pass
        try:
            end = int(re.sub(",|\.", "", region[2]))
        except IndexError:
            pass
        finally:
            return contig, start, end


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', required=True,
                        help='filename of input three-way haplotypes tsv')
    parser.add_argument('-o', '--output', required=True,
                        help='filename of output haplotypes tsv')
    parser.add_argument('-s', '--summary', required=True,
                        help='filename of summary generated by build_supplementary_db')
    parser.add_argument('-a', '--alt-reference', required=True, choices=[
                        'alt1', 'alt2'], help='Name of the allele for which alternate is recombined with the reference')
    parser.add_argument('-r', '--region', type=parse_region, nargs="+", metavar="CHR:START-END",
                        help="Name and range of contigs where reference is replaced by alternate reference")
    parser.add_argument('-b', '--boundary', type=int, default=50000, metavar="INT",
                        help="Width of region boundary where both alleles are included")
    return parser.parse_args()

args = parse_args()

signal_alt_ref = "signal_" + args.alt_reference
base_alt_ref = "base_" + args.alt_reference
signal_alt = "signal_" + "alt1" if args.alt_reference == "alt2" else "alt2"
base_alt = "base_" + "alt1" if args.alt_reference == "alt2" else "alt2"

regions = dict()
# format: { chr : [start, end, alternate_only] }
if args.region:
    for region in args.region:
        try:
            regions[region[0]]
        except KeyError:
            regions[region[0]] = []
        region_start, region_end = region[1], region[2]
        if region_start > 0:
            region_start = region_start + args.boundary // 2
        if region_end is not None:
            region_end = region_end - args.boundary // 2
        regions[region[0]].append([region_start, region_end, True])
        if args.boundary > 1:
            if region[1] > 0:
                regions[region[0]].append(
                    [region[1] - args.boundary // 2, region[1] + args.boundary // 2, False])
            if region[2] is not None:
                regions[region[0]].append(
                    [region[2] - args.boundary // 2, region[2] + args.boundary // 2, False])

with open(args.input, 'r') as haplotype_handle, open(args.summary, 'r') as summary_handle, open(args.output, 'w') as output:
    print("read\tgenotype\tsignal_ref\tsignal_alt\tsignal_coverage\tbase_ref\tbase_alt\tbase_coverage", file=output)
    haplotype_reader = csv.DictReader(haplotype_handle, delimiter="\t")
    summary_reader = csv.DictReader(summary_handle, delimiter="\t", fieldnames=[
                                    'read', 'chr', 'start', 'end', 'qual'])
    summary = next(summary_reader)
    while True:
        try:
            haplotype = next(haplotype_reader)
            while summary['read'] < haplotype['read']:
                summary = next(summary_reader)
                while summary['read'] > haplotype['read']:
                    haplotype = next(haplotype_reader)
            # now they should be the same read
            for key in haplotype.keys():
                if (key.startswith("signal") or key.startswith("base")) and not key.endswith("coverage"):
                    haplotype[key] = float(haplotype[key])
        except StopIteration:
            break

        use_ref = True
        use_alt_ref = False
        if summary['chr'] in regions:
            start = int(summary['start'])
            end = int(summary['end'])
            for region in regions[summary['chr']]:
                if region[0] <= start and region[1] >= end:
                    use_alt_ref = True
                    if region[2]:
                        # alternate only
                        use_ref = False

        if use_ref and use_alt_ref:
            if haplotype['signal_ref'] + haplotype['base_ref'] > haplotype[signal_alt_ref] + haplotype[base_alt_ref]:
                use_alt_ref = False
            elif haplotype['signal_ref'] + haplotype['base_ref'] < haplotype[signal_alt_ref] + haplotype[base_alt_ref]:
                use_ref = False
            else:
                # pick the one with the lower range - more reliable, so
                # probably more accurate
                if abs(haplotype['signal_ref'] - haplotype['base_ref']) < abs(haplotype[signal_alt_ref] - haplotype[base_alt_ref]):
                    use_alt_ref = False
                else:
                    use_ref = False

        if use_ref:
            signal_ref = 'signal_ref'
            base_ref = 'base_ref'
        else:
            signal_ref = signal_alt_ref
            base_ref = base_alt_ref

        # remove excess counts - how?
        signal_coverage = haplotype[signal_ref] + haplotype[signal_alt]
        base_coverage = haplotype[base_ref] + haplotype[base_alt]

        # determine genotype
        signal_call = "fail"
        if signal_coverage > 0:
            signal_ratio = haplotype[signal_ref] / signal_coverage
            if signal_ratio > 0.5:
                signal_call = "ref"
            elif signal_ratio < 0.5:
                signal_call = "alt"

        base_call = "fail"
        if base_coverage > 0:
            base_ratio = haplotype[base_ref] / base_coverage
            if base_ratio > 0.5:
                base_call = "ref"
            elif base_ratio < 0.5:
                base_call = "alt"

        if base_coverage < __min_coverage and signal_coverage < __min_coverage:
            genotype = "fail"
        elif base_call == signal_call:
            genotype = base_call
        elif signal_call == "fail":
            genotype = base_call
        elif base_call == "fail":
            genotype = signal_call
        elif base_coverage > __override_ratio * signal_coverage and base_call != "fail":
            genotype = base_call
        elif signal_coverage > __override_ratio * base_coverage and signal_call != "fail":
            genotype = signal_call
        elif base_ratio - 0.5 > __override_ratio * (signal_ratio - 0.5) and base_ratio > 0.5:
            genotype = base_call
        elif signal_ratio - 0.5 > __override_ratio * (base_ratio - 0.5) and signal_ratio > 0.5:
            genotype = signal_call
        else:
            genotype = "fail"

        print("{read}\t{genotype}\t{signal_ref}\t{signal_alt}\t{signal_coverage}\t{base_ref}\t{base_alt}\t{base_coverage}".format(
            read=haplotype['read'],
            genotype=genotype,
            signal_ref=haplotype[signal_ref],
            base_ref=haplotype[base_ref],
            signal_alt=haplotype[signal_alt],
            base_alt=haplotype[base_alt],
            signal_coverage=haplotype['signal_coverage'],
            base_coverage=haplotype['base_coverage']
        ), file=output)
6
7
shell:
    "wget -q {params.url} -O {output}"
14
15
shell:
    "wget -q {params.url} -O {output}"
22
23
shell:
    "gunzip {input}"
30
31
shell:
    "gunzip {input}"
38
39
shell:
    "gunzip {input}"
53
54
shell:
    "bwa index {input} &> {params.log}"
61
62
shell:
    "samtools index {input}"
70
71
shell:
    "wget -q {params.url} -O {output}"
79
80
shell:
    "python mask_genome_variants.py -i {input.fasta} -o {output} -v {input.vcf}"
87
88
shell:
    "Rscript ensembl_gtf_to_tsv.R {input}"
95
96
shell:
    "python tsv_to_region.py {input} 10000 > {output}"
103
104
shell:
    "python count_cpg.py {input} > {output}"
6
7
shell:
    "wget -q -O {output} {params.url}"
14
15
shell:
    "wget -q -O {output} {params.url}"
22
23
shell:
    "wget -q -O {output} {params.url}"
30
31
shell:
    "wget -q -O {output} {params.url}"
38
39
shell:
    "wget -q -O {output} {params.url}"
46
47
shell:
    "wget -q -O {output} {params.url}"
55
56
shell:
    "md5sum -c {input.md5} && touch {output}"
64
65
66
shell:
    "mkdir -p {output} && "
    "tar xzf {input.tar} -C {output}"
73
74
75
76
77
78
79
shell:
    "find {input} -mindepth 2 -type d -links 2  -exec mv -t {input} {{}} \\+ && "
    "find {input} -name '*.basecalled.*.tar.gz'  -exec rm {{}} \\+ && "
    "for i in $(find {input} -name '*.tar.gz'); do "
    "  tar -xzf $i -C {input} && rm $i; "
    "done && "
    "touch {output}"
92
93
94
95
shell:
    "mkdir -p {output.outdir} && "
    "for i in {input.indir}; do mv $i {output.outdir}; done && "
    "touch {output.outdir_clean}"
109
110
shell:
    "read_fast5_basecaller.py -c r94_450bps_linear.cfg -o fastq -i {input.indir} -r -s {params.outdir} -t {threads} -q 0"
124
125
shell:
    "read_fast5_basecaller.py -c r941_450bps_linear_prom.cfg -o fastq -i {input.indir} -r -s {params.outdir} -t {threads} -q 0"
132
133
shell:
    "find {input} -name '*.fastq' -exec cat {{}} \\+ > {output}"
146
147
148
shell:
    "bwa mem -t {threads} {input.genome} {input.reads} | "
    "samtools sort -T {output}.samtools.tmp -@ {threads} -o {output} &> {log}"
158
159
shell:
    "nanopolish index -d {input.fast5} {input.reads} &> {params.log}"
172
173
shell:
    "nanopolish call-methylation -t {threads} -r {input.reads} -b {input.bam} -g {input.genome} > {output}"
187
188
189
shell:
    "nanopolish phase-reads -t {threads} -r {input.reads} -b {input.bam} -g {input.genome} {input.vcf} | "
    "samtools sort -T {output}.samtools.tmp -@ {threads} -o {output}"
11
12
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
24
25
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
37
38
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
50
51
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
62
63
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
77
78
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
101
102
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
115
116
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
130
131
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
143
144
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
175
176
shell:
    "Rscript render_notebook.R {input.notebook} &> {params.log}"
185
186
script:
    "prepare_visualisation.R"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
suppressMessages(suppressPackageStartupMessages(library(GenomicRanges)))
suppressMessages(suppressPackageStartupMessages(library(tidyverse)))
suppressMessages(suppressPackageStartupMessages(library(data.table)))

gtf <- rtracklayer::import("../genome_data/Mus_musculus.GRCm38_90.chr.gtf")
knownGene <- gtf %>%
  as_data_frame() %>%
  rename(chr=seqnames) %>%
  select(chr, start, end, strand, type, gene_id, gene_name, exon_number) %>%
  filter(type %in% c("exon", "five_prime_utr", "three_prime_utr")) %>%
  filter(end > start) %>%
  mutate(type=fct_recode(type, utr="three_prime_utr", utr="five_prime_utr")) %>%
  unique() %>%
  group_by(gene_name, start) %>%
  filter(end == min(end)) %>%
  ungroup() %>%
  arrange(gene_name, start, end) %>%
  makeGRangesFromDataFrame(keep.extra.columns=TRUE)
save(knownGene, file="../RData/knownGene.RData")

rna_seq <- read_tsv("../rna_seq/all_runs_with_reverse_coverage.tsv", 
                    col_names=c("chr", "pos", "fwd1_mat", "fwd1_pat", 
                                "fwd2_mat", "fwd2_pat", "fwd3_mat", "fwd3_pat", 
                                "fwd4_mat", "fwd4_pat", "rev1_pat", "rev1_mat", 
                                "rev2_pat", "rev2_mat", "rev3_pat", "rev3_mat", 
                                "rev4_pat", "rev4_mat"),
                    col_types = 'ciiiiiiiiiiiiiiiii') %>%
  mutate(fwd_mat=(fwd1_mat+fwd2_mat+fwd3_mat+fwd4_mat)/4,
         fwd_pat=(fwd1_pat+fwd2_pat+fwd3_pat+fwd4_pat)/4,
         rev_pat=(rev1_pat+rev2_pat+rev3_pat+rev4_pat)/4,
         rev_mat=(rev1_mat+rev2_mat+rev3_mat+rev4_mat)/4) %>% 
  select(chr, pos, fwd_mat, fwd_pat, rev_mat, rev_pat)
save(rna_seq, file="../RData/rna_seq_with_reverse.RData")
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
suppressMessages(library(tidyverse))

min_coverage <- 5
args = commandArgs(trailingOnly=TRUE)
infile <- args[1]
outfile <- args[2]

haplotype_df <- read_tsv(infile, col_types='ccddiddi') %>%
  mutate(signal_coverage = signal_ref + signal_alt,
         signal_ratio = signal_ref / signal_coverage,
         base_coverage = base_ref + base_alt,
         base_ratio=base_ref / base_coverage) %>% 
  mutate(info=case_when(.$signal_coverage < min_coverage & .$base_coverage < min_coverage ~ "low_coverage",
                        .$signal_ratio < 0.5 & .$base_ratio < 0.5 ~ "pass",
                        .$signal_ratio > 0.5 & .$base_ratio > 0.5 ~ "pass",
                        .$base_coverage > 3*.$signal_coverage ~ "signal_low_coverage",
                        .$signal_coverage > 3*.$base_coverage ~ "base_low_coverage",
                        abs(0.5-.$signal_ratio)*3 < abs(0.5-.$base_ratio) ~ "signal_uncertain",
                        abs(0.5-.$base_ratio)*3 < abs(0.5-.$signal_ratio) ~ "base_uncertain",
                        TRUE ~ "fail"),
         base_genotype=ifelse(base_ratio>0.5, "ref", "alt"),
         signal_genotype=ifelse(signal_ratio>0.5, "ref", "alt"),
         genotype="fail",
         genotype=ifelse(startsWith(info, "signal") | info == "pass", base_genotype, genotype),
         genotype=ifelse(startsWith(info, "base"), signal_genotype, genotype)) %>%
  rename(read_name=read)
save(haplotype_df, file=outfile)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
suppressMessages(library(tidyverse))

args = commandArgs(trailingOnly=TRUE)
infile <- args[1]
outfile <- args[2]

summary_df <- read_tsv(infile,
                       col_names=c("read_name", "chr", "start", "end", "qual"), 
                       col_types='cciid',
                       skip=1) %>% 
  unique()
save(summary_df, file=outfile)
1
2
3
4
5
rmd_files <- commandArgs(trailingOnly = TRUE)
purrr::map(rmd_files, 
           ~ rmarkdown::render(., 
                               output_format='html_document', 
                               output_file=sub("Rmd$", "html", .)))
12
13
shell:
    "wget -q -O {output} {params.url}"
21
22
shell:
    "wget -q -O {output} {params.url}"
30
31
shell:
    "wget -q -O {output} {params.url}"
39
40
shell:
    "wget -q -O {output} {params.url}"
48
49
shell:
    "wget -q -O {output} {params.url}"
57
58
shell:
    "wget -q -O {output} {params.url}"
66
67
shell:
    "wget -q -O {output} {params.url}"
75
76
shell:
    "wget -q -O {output} {params.url}"
84
85
shell:
    "md5sum -c {input.md5} && touch {output}"
100
101
shell:
    "trim_galore --phred33 --fastqc --gzip --paired -o {params.outdir} {input.r1} {input.r2} &> {params.log}"
108
109
shell:
    "cp {input} {output}"
141
142
143
144
shell:
    "cd $(dirname {input.genome}) && "
    "SNPsplit_genome_preparation --vcf_file {input.vcf} --strain CAST_EiJ --reference_genome ./ &> {params.log} && "
    "cd ../scripts/"
176
177
178
179
shell:
    "cat {input.fasta} > {output.fasta} && "
    "mv {input.snp} {output.snp} && "
    "rm -r {params.tempdir}"
191
192
shell:
    "hisat2-build {input} {params.base} &> {params.log}"
208
209
210
shell:
    "hisat2 -p {threads} --no-softclip -x {params.base} -1 {input.r1} -2 {input.r2} | "
    "samtools sort -n -@ {threads} -T {output}.samtools.tmp -o {output} &> {log}"
222
223
shell:
    "SNPsplit --snp_file {input.snp} --paired --no_sort {input.bam} -o ../rna_seq/ &> {params.log}"
234
235
shell:
    "samtools sort -T {input}.samtools.tmp -@ {threads} -o {output} {input} &> {log}"
275
276
shell:
    "samtools depth {input.bam} > {output}"
11
12
shell:
    "wget -q -O {output} {params.url}"
20
21
shell:
    "wget -q -O {output} {params.url}"
29
30
shell:
    "wget -q -O {output} {params.url}"
38
39
shell:
    "wget -q -O {output} {params.url}"
46
47
shell:
    "cp {input} {output}"
57
58
59
60
shell:
    "cd $(dirname {input}) && "
    "bismark_genome_preparation --bowtie2 ./ &> {params.log} && "
    "cd ../scripts/"
78
79
80
shell:
    "bismark --gzip --bam --bowtie2 -p {threads} -B {params.basename} "
    "-o ../bisulfite/ {params.genome} -1 {input.r1} -2 {input.r2} &> {params.log}"
91
92
shell:
    "samtools sort -n -@ {threads} -T {input}.samtools.tmp -o {output} {input} &> {log}"
104
105
shell:
    "SNPsplit --paired --bisulfite --snp_file {input.snp} --no_sort {input.bam} -o ../bisulfite/ &> {params.log}"
119
120
121
122
shell:
    "bismark_methylation_extractor --ignore 13 --paired-end --multicore {threads} "
    "--comprehensive --merge_non_CpG --report --output $(dirname {output.txt}) --gzip "
    "--bedGraph {input} &> {params.log}"
132
133
shell:
    "python summarize_bisulfite_methylation.py {output} {input}"
143
144
shell:
    "python summarize_bisulfite_methylation.py {output} {input}"
151
152
shell:
    "python summarize_bisulfite_methylation.py {output} {input}"
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from __future__ import print_function
import csv
import sys
import pickle
import copy


def get_read_id(read, contig, position, suppdb):
    if read in suppdb:
        for i, alignment in enumerate(suppdb[read]):
            chr, start, end = alignment
            if contig == chr and position >= start and position <= end:
                return "{}_{}".format(read, i)
        raise Exception("Position {}:{} not internal to any alignment of read {} ({})".format(
            contig, position, read, suppdb[read]))
    else:
        return read

bam_fn = sys.argv[1]
if len(sys.argv) > 2:
    meth_fn = sys.argv[2]
    handle = open(meth_fn, 'r')
else:
    print("Reading from stdin.", file=sys.stderr)
    handle = sys.stdin

with open("{}.suppdb".format(bam_fn), 'rb') as db_handle:
    suppdb = pickle.load(db_handle)

reader = csv.DictReader(handle, delimiter="\t")
fieldnames = reader.fieldnames
if "strand" in reader.fieldnames:
    fieldnames = copy.copy(fieldnames)
    del fieldnames[fieldnames.index("strand")]
    del_strand = True
else:
    del_strand = False
writer = csv.DictWriter(sys.stdout, delimiter="\t",
                        fieldnames=fieldnames)
writer.writeheader()
for row in reader:
    row["read_name"] = get_read_id(
        row["read_name"], row["chromosome"], int(row["start"]), suppdb)
    if del_strand:
        del row["strand"]
    try:
        writer.writerow(row)
    except Exception:
        print(row, file=sys.stderr)
        raise
handle.close()
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from __future__ import print_function
import sys
import csv
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('-m', '--methylation', required=True,
                    help="Methylation data, split by supplementaries, sorted by genomic location")
parser.add_argument('-p', '--phase', required=True,
                    help="Phased reads (haplotyping data)")
args = parser.parse_args()

meth_fn = args.methylation
haplotype_fn = args.phase

haplotypes = dict()
genotypes = set(["fail"])
with open(haplotype_fn, 'r') as haplotype_handle:
    reader = csv.DictReader(haplotype_handle, delimiter="\t")
    for row in reader:
        haplotypes[row["read"]] = row["genotype"]
        genotypes.add(row["genotype"])


def get_genotype(read_name):
    try:
        return haplotypes[read_name]
    except KeyError:
        return "fail"

with open(meth_fn, 'r') as meth_handle:
    reader = csv.DictReader(meth_handle, delimiter="\t")
    out_handles = list()
    writers = dict()
    for genotype in genotypes:
        handle = open("{}.{}.tsv".format(meth_fn, genotype), 'w')
        out_handles.append(handle)
        writers[genotype] = csv.DictWriter(
            handle, fieldnames=reader.fieldnames, delimiter="\t")
        writers[genotype].writeheader()
    for row in reader:
        writers[get_genotype(row["read_name"])].writerow(row)
    for handle in out_handles:
        handle.close()
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import sys
import csv
import gzip
from functools import partial
__true = "+"
__false = "-"


def add(summary, chr, pos, meth):
    try:
        summary[chr]
    except KeyError:
        summary[chr] = dict()
    try:
        summary[chr][pos]
    except KeyError:
        summary[chr][pos] = [0., 0.]
    if meth:
        summary[chr][pos][0] += 1
    summary[chr][pos][1] += 1
    return summary


def summarize(in_handle, summary=None):
    if summary is None:
        summary = dict()
    header = next(in_handle)
    reader = csv.reader(in_handle, delimiter="\t")
    for line in reader:
        meth = line[1] == __true
        chr = line[2]
        pos = int(line[3])
        summary = add(summary, chr, pos, meth)
    return summary


def write_out_summary(output_fn, summary):
    with open(output_fn, 'w') as out_handle:
        for chr, positions in summary.items():
            for pos, data in positions.items():
                meth, total = tuple(data)
                out_handle.write("{}\t{}\t{:.3f}\t{}\t{}\n".format(
                    chr, pos, meth / total, meth, total))


output_fn = sys.argv[1]
input_fns = sys.argv[2:]
if len(sys.argv) == 3:
    input_fns = [sys.argv[2]]
else:
    input_fns = sys.argv[2:]

summary = None
for input_fn in input_fns:
    if input_fn.endswith("gz"):
        if sys.version_info >= (3,):
            open_fun = partial(gzip.open, mode='rt')
        else:
            open_fun = gzip.open
    else:
        open_fun = open
    with open_fun(input_fn) as handle:
        summary = summarize(handle, summary)

write_out_summary(output_fn, summary)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import sys
import csv
tsv_fn = sys.argv[1]
if len(sys.argv) > 2:
    overhang = int(sys.argv[2])
else:
    overhang = 0
output = []
with open(tsv_fn, 'r') as tsv_handle:
    reader = csv.DictReader(tsv_handle, delimiter="\t")
    for row in reader:
        output.append(
            "{}:{}-{}".format(row["chr"], int(row["start"]) - overhang, int(row["end"]) + overhang))

print(" ".join(output))
ShowHide 101 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/scottgigante/haplotyped-methylome
Name: haplotyped-methylome
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...