AMBIC Epigenomics project analysis

public public 1yr ago 0 bookmarks

This is the git repository for AMBIC Epigenomics project with Betenbaugh and Timp lab at JHU.
Use this repository for processing the raw data prior to analysis.
We are working on uploading scripts that are being used for our analysis of the data.

Getting Started

Prerequisites

The list of packages necessary for nanopore analysis :

R packages :

The list of packages necessary for ATAC-seq analysis :

Installing

All of these tools are available via conda (environment.yml) or via the website links.
To install via conda,

conda env update -n base --file ./environment.yml # to install to the base environment
conda env create -f ./environment.yml # to create a new environment

Snakemake pipeline

We are using snakemake to generate reproducible and easy-to-follow pipelines.
You can either install snakemake and follow the steps to run the pipeline or directly look at the snakemake scripts to determine what commands to perform.
Note : The snakemake workflow may not work completely as we have tested the workflow in limited settings. You may have to make edits to the code to make it work.

Preparing snakemake

To use snakemake, first modify the snakemake_config.yml file as appropriate for your environment.

Secondly, modify the nanopore_sample_info.csv and atacseq_sample_info.csv as necessary.

For nanopore sequencing data, we will start from basecalled data;
set up the data as follows :

  • fastq name : [sample].fastq.gz

  • summary name (optional) : [sample].summary.txt

  • move fastq and summary files to [workdir]/reads

  • fast5 in [workdir]/reads/[sample]/

For ATAC-seq data :

  • fastq name : [sample].fastq.gz

  • move fastq files into [workdir]/data/atacseq/fastq/

Running snakemake

To run snakemake, use the snakemake command in the root directory of this repo.

snakemake --cores 16

For running specific parts of the pipeline, any of the following (currently broken):

snakemake --cores 16 parse_nanopore
snakemake --cores 16 nanopore_methylation
snakemake --cores 16 nanopore_sv
snakemake --cores 16 parse_atacseq

For specific outputs,

snakemake --cores 16 path/to/output

e.g. :

snakemake --cores 16 methylation/[sample].cpg.meth.tsv.gz

will perform methylation calling on [sample] and any steps before that automatically, using 16 cores.

Code Snippets

 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import sys
import random
import argparse

def parse_args():
    '''
    Gives options
    '''
    parser = argparse.ArgumentParser(description='Saves reads below a alignment threshold and discards all others')
    parser.add_argument('-k', help='Alignment number cutoff')
    parser.add_argument('--paired-end', dest='paired_ended', action='store_true', help='Data is paired-end')
    args = parser.parse_args()
    alignment_cutoff = int(args.k)
    paired_ended = args.paired_ended

    return alignment_cutoff, paired_ended


if __name__ == "__main__":
    '''
    Runs the filtering step of choosing multimapped reads
    '''

    [alignment_cutoff, paired_ended] = parse_args()

    if paired_ended:
        alignment_cutoff = int(alignment_cutoff) * 2

    # Store each line in sam file as a list of reads, 
    # where each read is a list of elements to easily 
    # modify or grab things
    current_reads = [] 
    current_qname = ''

    for line in sys.stdin:

        read_elems = line.strip().split('\t')

        if read_elems[0].startswith('@'):
            sys.stdout.write(line)
            continue

        # Keep taking lines that have the same qname
        if read_elems[0] == current_qname:
            # Add line to current reads
            current_reads.append(line)
            pass
        else:
            # Discard if there are more than the alignment cutoff
            if len(current_reads) >= alignment_cutoff:
                current_reads = [line]
                current_qname = read_elems[0]
            elif len(current_reads) > 0:
                # Just output all reads, which are then filtered with
                # samtools
                for read in current_reads:
                    sys.stdout.write(str(read))

                # And then discard
                current_reads = [line]
                current_qname = read_elems[0]
            else:
                # First read in file
                current_reads.append(line)
                current_qname = read_elems[0]
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import os
import math
import sys
import argparse
import gzip
import numpy as np
from collections import namedtuple
import pysam
import re
import multiprocessing as mp
import time
from multiprocess_utils import listener,init_mp,close_mp
start_time = time.time()

def parseArgs() :
    # dir of source code
    srcpath=sys.argv[0]
    srcdir=os.path.dirname(os.path.abspath(srcpath))
    # parser
    parser = argparse.ArgumentParser(description='make wig file')
    parser.add_argument('-v','--verbose', action='store_true',default=False,
            help="verbose output")
    parser.add_argument('-i','--input',type=str,required=False,default="stdin",
            help="input file (methylation freq, default stdin)")
    parser.add_argument('-o','--output',type=argparse.FileType('w'),required=False, 
            default=sys.stdout,help="output file (default stout)")
    parser.add_argument('-t','--threshold',type=int,required=False,
            default=3,help="coverage thresdhold (default 3)")
    parser.add_argument('-c','--coverage',type=str,required=False,
            help="output path for coverage file (optional)")
    args = parser.parse_args()
    args.srcdir=srcdir
    return args

if __name__=="__main__":
    args=parseArgs()
    if args.coverage :
        covout = open(args.coverage,'w')
        print("track type=wiggle_0",file=covout)
        def printfxn(prechrom,chrom,pos,meth,cov,out,covout) :
            if chrom != prechrom :
                prechrom = fields[0]
                print("variableStep chrom={}".format(chrom),file=out)
                print("variableStep chrom={}".format(chrom),file=covout)
            print("{}\t{}".format(pos,meth/cov),file=out)
            print("{}\t{}".format(pos,cov),file=covout)
            return prechrom
    else :
        covout = None
        def printfxn(prechrom,chrom,pos,meth,cov,out,covout) :
            if chrom != prechrom :
                prechrom = fields[0]
                print("variableStep chrom={}".format(chrom),file=out)
            print("{}\t{}".format(pos,meth/cov),file=out)
            return prechrom
    if args.input == "stdin" :
        if args.verbose : print("reading from stdin",file=sys.stderr)
        in_fh = sys.stdin
    elif args.input.split(".")[-1] == "gz" :
        if args.verbose : 
            print("reading gzipped input {}".format(args.input),file=sys.stderr)
        in_fh = gzip.open(args.input,'rt')
    else :
        if args.verbose : print("reading {}".format(args.input),file=sys.stderr)
        in_fh = open(args.input)
    i = 0
    chrom=""
    # header
    print("track type=wiggle_0",file=args.output)
    for line in in_fh :
        fields = line.strip().split("\t")
        meth = int(fields[3])
        unmeth = int(fields[4])
        cov = meth+unmeth
        if cov < args.threshold : continue
        chrom = printfxn(chrom,fields[0],fields[1],meth,cov,args.output,covout)
        i += 1
        if args.verbose : 
            if i % 100000 == 0 :
                print("processed {} lines".format(i),file=sys.stderr)

    in_fh.close()
13
14
15
shell: 
	"trim_galore {input} "
	"-o {wildcards.dir}/fastq_trimmed &> {log}"
28
29
shell:
	"bowtie2-build {input} {params} --threads {threads} &> {log}"
44
45
46
47
48
shell: 
	"bowtie2 --very-sensitive -p {threads} -t --local " 
	"-x {params} -U {input.fq} 2> {log} | " 
	"samtools view -Sb - | "
	"samtools sort -o {output} -"
59
60
61
62
63
64
65
shell:
	"picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} "
	"-Djava.io.tmpdir={wildcards.dir}/tmp "
	"MarkDuplicates INPUT={input} OUTPUT={output} "
	"METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt "
	"VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true "
	"REMOVE_DUPLICATES=false &> {log}"
72
73
74
shell:
	"samtools view -F 1804 -b {input} > {output} && "
	"samtools index {output}"
81
82
83
84
85
shell:
	"bedtools bamtobed -i {input} | "
	"awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | "
	"sort -k1,1 -k2,2n -T {wildcards.dir}/tmp | "
	"bgzip > {output}"
93
94
95
96
shell:
	"gunzip -c {input} | "
	"sort -k1,1 -k2,2n -T {wildcards.dir}/tmp | "
	"bgzip > {output}"
107
108
109
110
111
112
113
shell:
	"macs2 callpeak -t {input} -f BED "
	"-n {wildcards.dir}/peaks/allsamples "
	"-g mm -p 0.01 --shift -75 --extsize 150 "
	"--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && "
	"{params}/scripts/atacseq_bed_to_saf.sh "
	"{wildcards.dir}/peaks/allsamples_peaks.narrowPeak > {output}"
126
127
128
shell:
	"featureCounts --verbose -a {input.peaks} "
	"-o {output} -F SAF -T {threads} {input.reads} &> {log}"
14
15
16
shell: 
	"trim_galore {input} "
	"-o {wildcards.dir}/fastq_trimmed &> {log}"
29
30
shell:
  "bowtie2-build {input} {params} --threads {threads} &> {log}"
44
45
46
47
shell: 
	"bowtie2 -k 4 -p {threads} -t --local " 
	"-x {params} -U {input.fq} 2> {log} | " 
	"samtools view -Sb - > {output}"
56
57
58
59
60
61
62
63
64
shell:
	"samtools sort -n -T "
	"{wildcards.dir}/bam/{wildcards.sample}.qsorting {input} | "
	"samtools view -h - | "
	"python {params}/scripts/assign_multimappers.py -k 4 | "
	"samtools view -F 1804 -Su - | "
	"samtools sort -T "
	"{wildcards.dir}/bam/{wildcards.sample}.mpsorting "
	"-o {output}"
75
76
77
78
79
80
81
shell:
	"picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} "
	"-Djava.io.tmpdir={wildcards.dir}/tmp "
	"MarkDuplicates INPUT={input} OUTPUT={output} "
	"METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt "
	"VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true "
	"REMOVE_DUPLICATES=false &> {log}"
88
89
90
shell:
	"samtools view -F 1804 -b {input} > {output} && "
	"samtools index {output}"
 99
100
101
102
103
104
shell:
	"[ -e {params} ]||mkdir {params} && "
	"bedtools bamtobed -i {input} | "
	"awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | "
	"sort -k1,1 -k2,2n -T {params} | "
	"bgzip > {output}"
113
114
115
116
117
118
shell:
	"macs2 callpeak -t {input} -f BED "
	"-n {wildcards.dir}/peaks/{wildcards.sample} "
	"-g mm -p 0.01 --shift -75 --extsize 150 "
	"--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && "
	"touch {output}"
130
131
132
133
134
135
136
shell:
	"[ -e {params.tmpdir} ]||mkdir {params.tmpdir} && cat {input} | "
	"sort -u -k1,1 -k2,2n -k3,3n -T {params.tmpdir} | "
	"bedtools merge -i stdin | "
	"awk 'OFS=\"\t\"{{ print $0,\"peak\"NR }}' > {output.peaks} && "
	"{params.codedir}/scripts/atacseq_bed_to_saf.sh "
	"{output.peaks} > {output.saf}"
149
150
151
shell:
	"featureCounts --verbose -a {input.peaks} -O "
	"-o {output} -F SAF -T {threads} {input.reads} &> {log}"
14
15
16
shell: 
	"trim_galore {input} "
	"-o {wildcards.dir}/fastq_trimmed &> {log}"
29
30
shell:
  "bowtie2-build {input} {params} --threads {threads} &> {log}"
44
45
46
47
shell: 
	"bowtie2 -k 4 -p {threads} -t --local " 
	"-x {params} -U {input.fq} 2> {log} | " 
	"samtools view -Sb - > {output}"
56
57
58
59
60
61
62
63
64
shell:
	"samtools sort -n -T "
	"{wildcards.dir}/bam/{wildcards.sample}.qsorting {input} | "
	"samtools view -h - | "
	"python {params}/scripts/assign_multimappers.py -k 4 | "
	"samtools view -F 1804 -Su - | "
	"samtools sort -T "
	"{wildcards.dir}/bam/{wildcards.sample}.mpsorting "
	"-o {output}"
75
76
77
78
79
80
81
shell:
	"picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} "
	"-Djava.io.tmpdir={wildcards.dir}/tmp "
	"MarkDuplicates INPUT={input} OUTPUT={output} "
	"METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt "
	"VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true "
	"REMOVE_DUPLICATES=false &> {log}"
88
89
90
shell:
	"samtools view -F 1804 -b {input} > {output} && "
	"samtools index {output}"
 99
100
101
102
103
104
shell:
	"[ -e {params} ]||mkdir {params} && "
	"bedtools bamtobed -i {input} | "
	"awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | "
	"sort -k1,1 -k2,2n -T {params} | "
	"bgzip > {output}"
114
115
116
117
118
shell:
	"[ -e {params} ]||mkdir {params} && "
	"gunzip -c {input} | "
	"sort -k1,1 -k2,2n -T {params} | "
	"bgzip > {output}"
129
130
131
132
133
134
135
shell:
	"macs2 callpeak -t {input} -f BED "
	"-n {wildcards.dir}/peaks/allsamples "
	"-g mm -p 0.01 --shift -75 --extsize 150 "
	"--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && "
	"{params}/scripts/atacseq_bed_to_saf.sh "
	"{wildcards.dir}/peaks/allsamples_peaks.narrowPeak > {output}"
148
149
150
shell:
	"featureCounts --verbose -a {input.peaks} -O "
	"-o {output} -F SAF -T {threads} {input.reads} &> {log}"
22
23
24
25
26
27
28
shell:
	"ngmlr -x ont --bam-fix "
	"-t {threads} -r {params} -q {input} 2> {log} | "
	"samtools view -q 20 -b - | "
	"samtools sort -T bam/{wildcards.sample}.sorting "
	"-o {output} && "
	"samtools index {output}"
35
36
shell:
	"minimap2 -x map-ont -d {output} {input}"
47
48
49
50
51
52
shell:
	"minimap2 --MD -L -t {threads} -a {input} 2> {log} | "
	"samtools view -q 20 -b - | "
	"samtools sort -T bam/{wildcards.sample}.sorting "
	"-o {output} && "
	"samtools index {output}"
65
66
shell:
	"nanopolish index --verbose -d {wildcards.pre} "
81
82
83
84
shell:
	"nanopolish call-methylation -v -t {threads} -q cpg "
	"-g {params} -r {input.fq} -b {input.bam} | " 
	"gzip > {output}"
94
95
96
97
98
shell:
	"python {params}/nanopore-methylation-utilities/mtsv2bedGraph.py -g {input.fa} -i {input.meth} | "
	"sort -T tmp -k1,1 -k2,2n | "
	"bgzip > {output} && "
	"tabix -p bed {output}"
110
111
112
113
114
	shell:
		"python {params}/nanopore-methylation-utilities/convert_bam_for_methylation.py "
        "-t {threads} -c {input.meth} -b {input.bam} | "
		"samtools sort -o {output} && "
		"samtools index {output}"
125
126
127
128
129
shell: 
	"python -u {params}/nanopore-methylation-utilities/parseMethylbed.py frequency -v " 
	"-i {input} -m {wildcards.mod} 2> {log} | " 
	"bgzip > {output} && " 
	"tabix -b 2 -e 2 {output}"
140
141
142
shell: 
	"python {params}/scripts/makeWig.py -v -i {input} " 
	"-o {output.methwig} -c {output.covwig} &> {output.log}"
150
151
shell: 
	"wigToBigWig {input} {output}"
167
168
169
170
shell: 
	"sniffles -m {input} -v {output} -t {threads} "
	"--tmp_file sv/{wildcards.sample}.sniffles.tmp " 
	"-s 2  &> {log}"
177
178
shell:
	"bcftools sort -o {output} -T sv {input}"
187
188
189
shell:
	"echo {input} | tr \" \" \"\\n\" > {output.raw} && "
	"SURVIVOR merge {output.raw} 1000 1 1 -1 -1 -1 {output.vcf}"
201
202
203
204
205
shell:
	"sniffles -m {input.bam} -v {output} -t {threads} "
	"--Ivcf {input.candidates} --tmp_file "
	"sv/{wildcards.sample}.SURVIVORsniffles.tmp " 
	"-s 2 -n -1 --genotype --cluster &> {log}"
214
215
216
shell:
	"echo {input} | tr \" \" \"\\n\" > {output.raw} && "
	"SURVIVOR merge {output.raw} 1000 -1 1 -1 -1 -1 {output.vcf}"
ShowHide 30 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/timplab/ambic-epigenome
Name: ambic-epigenome
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...