Basic ChIP-seq Analysis Workflow with Snakemake

public public 1yr ago 0 bookmarks

Introduction

This repository contains a Snakemake workflow for performing a basic ChIP-seq analysis pipeline. This workflow was adapted from this workflow for ChIP-seq data and modified to suit my preffered analysis workflow.

This workflow performs the following steps:

  1. Optional: Retrieval of publically available sequencing data from NCBI GEO/SRA.

  2. Optional: merge reads from the same sample that were sequenced on separate lanes or sequencing runs.

  3. Align raw reads to a reference genome. The workflow will automatically retrieve any publically available reference genome and build the bowtie2 index.

  4. Filter aligned reads based on alignment quality and discard multi-mapping reads. The filtering parameters can be customized.

  5. Optional: filter aligned reads to discard reads aligning to contigs or mitochondrial genome. This filtering can also be customized.

  6. Optional: Perform peak calling using MACS2.

  7. Generate bigWig files containing z-score normalized read depth to allow data visualization in the genome browser.

  8. Optional: spike-in normalization for experiments where exogenous control chromatin or cells are added to each sample as an internal control.

Running the workflow

Follow the steps below to run this workflow:

Step 1: Software installation

Ensure that you have a conda-based Python3 distribution installed. I recommend Mambaforge .

Snakemake and Snakedeploy are used to deploy and run this workflow. These can be installed using Mamba:

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

OR Conda:

conda create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

Step 2: Deploy the workflow

Activate the conda environment:

conda activate snakemake

Create a new directory for your analysis and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

Deploy the workflow:

snakedeploy deploy-workflow https://github.com/tjgibson/NGS-workflow-chipseq . --branch main

This command will create all the files necessary for running this workflow.

Step 3: Entering your sample information

The previous step should have created the file config/units.tsv in your local project directory. This is a template that can be modified to contain the information about the samples you wish to analyze. Before modifying it, take a look at the file to get a sense of how it is formatted. The fields of this file are described below:

sample_name: The name you want to use for each individual sample. I like to use some sort of naming convention like developmental-stage_genotype_treatment_IP_rep1, where each piece of important information about the sample is separated by an underscore. If a single piece of information, such as developmental stage, contains multiple words, separate them with a dash: L3-larva_rep1.

unit_name: If you sequenced the same sample across multiple lanes or sequencing runs, these will be listed as separate units. For example, control_IP_rep1_A and control_IP_rep1_B. If you only have a single fastq file for a sample, then the unit_name should be identical to the sample_name .

fq1: The file path to the location on your computer where the raw fastq file is stored. These can be located anywhere on your computer, but I suggest placing them in a directory called data , raw , or raw_data within your analysis directory. Fastq files should be compressed via gzip to save space.

fq2: If you performed paired-end sequencing, list the second fastq file here. For single-end data, leave this field blank.

sra: This workflow can perform automatic retrieval of publically available data from the Short Read Archive (SRA). Enter the SRA accession number (e.g. SRR0000001) here. If you provide both an SRA number and a local fastq file, the workflow will use the local file.

read_format: This should be set to SE for single-end or PE for paired-end.

call_peaks: This field indicates if peak calling should be performed using MACS2 and should be set to TRUE or FALSE .

input: If you have control samples to be used during peak calling (e.g. input or IgG), this column will indicate which control samples correspond to which IP samples. Enter the sample_name for the input/control that corresponds to the IP listed in each row. The name listed here must match that listed in the corresponding row of the sample_name column . See the example for more details. If call_peaks is set to TRUE and input is blank, MACS2 peak calling will be performed without using a control dataset.

peak_type: This specifies what type of peak calling will be performed by MACS2 and should be set to narrow (e.g. transcription factors, H3K27ac) or broad (e.g. H3K27me3, H3K9me3)

sample_group: Indicates the name to use for each group of replicates. This name will be used to name bigWig and peak files for merged replicates. For example, for samples control_IP_rep1 and control_IP_rep2, sample_group could be set to control_IP.

Step 4: Configuring the workflow

The workflow can be customized to suit your needs by editing the file config/config.yaml . Below is a description of the different options that can be customized within the config file.

Merging of technical replicates

If you have samples that were sequenced on multiple runs or sequencing lanes, change the following lines of the config.yaml file:

mergeReads:
 activate: True

This will result in merging of the fastq files prior to alignment.

Filtering of reads based on alignment to specific chromosomes

When working with Drosophila data (or other organisms with really good genome assemblies), I typically discard reads aligning to unplaced contigs or scaffolds. For Drosophila, this entails only retaining reads aligning to the major chromosome arms (chr2L, chr2R, Chr3L, chr3R, chrX, and chrY). This can be done with the following lines of the config file:

filter_chroms: True
keep_chroms: config/keep_chroms.txt

You can modify the keep_chroms.txt file to reflect a different filtering strategy. Make sure that the chromosome naming convention used in the file matches that of your chosen reference genome (see below). For example, chr2L for UCSC vs. 2L for Ensembl.

Setting the reference genome

This is where you set the reference genome to use for read alignment. Give the genome a name and provide a link to the fasta file for the genome. Reference genomes can be found through the UCSC Genome Browser , Ensembl , or organism-specific genome databases (e.g. Flybase ). The pipeline will automatically retrieve the reference genome from the provided link and build the bowtie2 index required for alignment. Modify the following lines as necessary for your desired reference genome:

ref_genome:
 name: dm6
 link: https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz

If you want to use a custom genome that is not publically available or a bowtie2 index that you have already built, create a folder called resources/ in your working directory and copy your fasta file or the bowtie2 index files to this directory. If using a custom fasta file, rename the file to be ref_genome.fasta.gz . If using a custom bowtie2 index, use the prefix genome for the index files (e.g. genome.1.bt2 ).

Setting the spike-in genome

This workflow can optionally perform spike-in normalization if your experiment included exogenous DNA that was added to each sample prior to immunoprecipitation. Similar to above, modify the following lines of the config file:

use_spikeIn: False
spikeIn_genome:
 name: yeast_w303
 link: http://sgd-archive.yeastgenome.org/sequence/strains/W303/W303_SGD_2015_JRIU00000000/W303_SGD_2015_JRIU00000000.fsa.gz

The workflow will combine the reference genome and spike-in genome into a custom fasta file and use this for alignment. Following alignment, the workflow counts the number of reads aligning to each genome (after removing multi-mapping reads, which will also remove reads that align to both genomes) and then filters out the spike-in reads from the final BAM file. The percentage of spike-in reads is used to calculate a scaling factor. The values in the z-score normalized bigWig files will be multiplied by this scaling factor to produce spike-in normalized bigWig files. The pipeline also outputs a table with the number of reference and spike-in reads, percentage of spike-in reads, and scaling factors. If you wish to use a different strategy for spike-in normalization (e.g. skip initial z-score normalization), you should be able to use the information in this table to impliment alternate spike-in normalizations.

Note: For IP samples that have a corresponding input control, the calculation of the scaling factors will take into account the percentage of spike-in reads in the input and IP. This is the strategy we used in this paper and worked well for the experiments described therein. If IP samples lack an input, than the scaling factor will only be based on the percentage of spike-in reads in the IP.

modifying parameters for specific steps

The final section of the config file allows customization of the parameters for individual steps in the workflow:

params:
 bowtie2_index: ""
 bowtie2_align: "-k 2 --very-sensitive --no-mixed --no-discordant -X 5000"
 filter_multireads: "-f bam -F '(mapping_quality >= 30) and ([XS] == null)'" 
 bigwigs_ind: "--binSize 10"
 bigwigs_merged: "--binSize 10"
 macs2_call_peaks_narrow: "-g dm --call-summits"
 macs2_call_peaks_broad: "-g dm"
 filter_peaks_by_replicates:
 min_overlap: 10
 replicate_threshold: 2

These can be modified as desired to alter these steps. For example, by default the workflow will run bowtie2 with parameters -k 2 --very-sensitive --no-mixed --no-discordant -X 5000 . The parameters listed here in the config file will be passed directly to the corresponding step when it is run. The parameters for filter_peaks_by_replicates indicates that peaks called by MACS2 will be filtered to only include peaks that were deteced in at least 2 replicates, considering a peak to be shared between replicates if the peak in each replicate overlap oneanother by at least 10 bp.

Step 5: Running the workflow

You are now ready to run the workflow. Before running the workflow, it is recommended to do a "dry run", in which Snakemake will check that all the required input and output files exist, and provide a summary of what the workflow will do. To do a dry run, invoke the following command:

snakemake -n

Examining the output of this dry run can help make sure that samples names and locations of raw files were entered correctly.

If everything looks good, run the workflow with the following command:

snakemake --use-conda -c 1

The -c 1 parameter tells snakemake to use a single core for running the workflow. Depending on the number of processors/cores available on your computer, this number can be increased, which will speed up execution by allowing multiple steps to run in parallel and allowing steps such as read alignment to use multiple cores.

This workflow can also be run on a computing cluster or using cloud computing services. See the relevant sections of the Snakemake documentation for more information: Cluster execution Cloud execution

Navigating the output

Once finished, the workflow will have produced various output files that will be located in a folder named results/ . The results will be organized into the following subdirectories:

aligned_reads: BAM files containing aligned, sorted, filtered reads and BAI bam indices. A number of different intermediate BAM files will be generated as the workflow runs. Upon completion, intermediate files will be deleted and only only the filtered, sorted BAM files will be retained.

bigwigs: z-score normalized bigWig files for individual replicates and merged replicates. If spike-in normalization is indicated in the config file, spike-in normalized bigWigs will also appear here.

peaks: narrowPeak/broadPeak and bed files containing peaks called by MACS2. This includes peaks called on individual replicates, merged replicates, and peaks filtered by replicates according to the criteria in the config file.

Combining workflows for multiple data types

It may be desirable to process multiple data types (e.g. ChIP-seq, RNA-seq, ATAC-seq) and then integrate those results into downstream analysis. This can be done in a single Snakemake workflow that contains multiple "modules" for the different data types. For an example of how to build a complex workflow including multiple datasets as well as custom downstream analysis, see this repository for an example.

To-do

In the future, I hope to add some of the following features:

  • Add small, test fastq files to repo to allow for automated testing of rules

  • Create an HTML summary file containing read alignment and filtering statistics, number of peaks called, and other useful information about the workflow.

  • The optional step to take unaligned reads, BLAST them, and provide a summary of which organisms the unaligned reads likely originated from.

  • Add validation to ensure that unit table and config file are formatted correctly.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "-U {}".format(*snakemake.input.sample)
else:
    reads = "-1 {} -2 {}".format(*snakemake.input.sample)

shell(
    "(bowtie2 --threads {snakemake.threads} {extra} "
    "-x {snakemake.params.index} {reads} "
    "| samtools view -Sbh -o {snakemake.output[0]} -) {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Daniel Standage"
__copyright__ = "Copyright 2020, Daniel Standage"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
indexbase = snakemake.output[0].replace(".1.bt2", "")
shell(
    "bowtie2-build --threads {snakemake.threads} {snakemake.params.extra} "
    "{snakemake.input.reference} {indexbase}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import sys
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

in_contr = snakemake.input.get("control")
params = "{}".format(snakemake.params)
opt_input = ""
out_dir = ""

ext = "_peaks.xls"
out_file = [o for o in snakemake.output if o.endswith(ext)][0]
out_name = os.path.basename(out_file[: -len(ext)])
out_dir = os.path.dirname(out_file)

if in_contr:
    opt_input = "-c {contr}".format(contr=in_contr)

if out_dir:
    out_dir = "--outdir {dir}".format(dir=out_dir)

if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output):
    if any(
        out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak"))
        for out in snakemake.output
    ):
        sys.exit(
            "Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n"
            "For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n"
        )
    else:
        if " --broad" in params:
            sys.exit(
                "If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n"
                "Remove --broad option from params if these files are needed.\n"
            )

if any(
    out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output
):
    if "--broad " not in params and not params.endswith("--broad"):
        params += " --broad "

if any(
    out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg"))
    for out in snakemake.output
):
    if all(p not in params for p in ["--bdg", "-B"]):
        params += " --bdg "
else:
    if any(p in params for p in ["--bdg", "-B"]):
        sys.exit(
            "If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n"
        )

shell(
    "(macs2 callpeak "
    "-t {snakemake.input.treatment} "
    "{opt_input} "
    "{out_dir} "
    "-n {out_name} "
    "{params}) {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell

in_file = snakemake.input[0]
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

if in_file.endswith(".sam") and ("-S" not in extra or "--sam-input" not in extra):
    extra += " --sam-input"

shell(
    "sambamba view {extra} -t {snakemake.threads} "
    "{snakemake.input[0]} > {snakemake.output[0]} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell("samtools idxstats {snakemake.input.bam} > {snakemake.output[0]} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools index {threads} {snakemake.params} {snakemake.input[0]} {snakemake.output[0]} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools merge {threads} {snakemake.params} "
    "{snakemake.output[0]} {snakemake.input} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_name, out_ext = os.path.splitext(snakemake.output[0])

tmp_dir = snakemake.params.get("tmp_dir", "")
if tmp_dir:
    prefix = os.path.join(tmp_dir, os.path.basename(out_name))
else:
    prefix = out_name

# Samtools takes additional threads through its option -@
# One thread for samtools
# Other threads are *additional* threads passed to the argument -@
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools sort {extra} {threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]} "
    "{log}"
)
 9
10
script:
		"../scripts/fasterq-dump.py"
21
22
script:
		"../scripts/fasterq-dump.py"
33
34
shell:
    "cat {input} > {output} 2> {log}"
48
49
wrapper:
	"v1.1.0/bio/bowtie2/align"
62
63
wrapper:
    "v1.1.0/bio/samtools/sort"
76
77
wrapper:
    "v1.1.0/bio/samtools/index"
12
13
shell:
	"bamCoverage --bam {input.bam} -o {output} -p {threads} {params.extra}"
24
25
wrapper:
	"v1.1.0/bio/samtools/merge"		
38
39
wrapper:
    "v1.1.0/bio/samtools/index"
52
53
shell:
	"bamCoverage --bam {input.bam} -o {output} -p {threads} {params.extra}"
63
64
script:
	"../scripts/zscore_normalize_bw.R"
73
74
script:
	"../scripts/zscore_normalize_bw.R"
85
86
script:
	"../scripts/compute_scaling_factors.R"
97
98
script:
	"../scripts/spikeIn_normalize_bw.R"
108
109
script:
	"../scripts/spikeIn_normalize_bw.R"
16
17
wrapper:
	"v1.1.0/bio/macs2/callpeak"
34
35
wrapper:
	"v1.1.0/bio/macs2/callpeak"
52
53
wrapper:
	"v1.1.0/bio/macs2/callpeak"
70
71
wrapper:
	"v1.1.0/bio/macs2/callpeak"
84
85
script:
	"../scripts/filter_peaks_by_replicates.R"
 9
10
wrapper:
	"v1.1.0/bio/samtools/idxstats" 
23
24
wrapper:
	"v1.1.0/bio/sambamba/view"
37
38
wrapper:
	"v1.1.0/bio/samtools/index"
48
49
wrapper:
	"v1.1.0/bio/samtools/idxstats"
61
62
shell:
	"samtools view -bh -L {input.keep_chroms} --output-fmt BAM -o {output} {input.bam} 2>> {log}"
74
75
wrapper:
	"v1.1.0/bio/sambamba/view"
89
90
wrapper:
    "v1.1.0/bio/samtools/index"
100
101
wrapper:
	"v1.1.0/bio/samtools/idxstats" 
SnakeMake From line 100 of rules/filter.smk
12
13
shell:
	"curl {params.link} > {output} 2> {log}"
SnakeMake From line 12 of rules/ref.smk
27
28
shell:
	"curl {params.link} > {output} 2> {log}"
SnakeMake From line 27 of rules/ref.smk
41
42
43
44
45
46
shell:
	"""
	seqkit replace -p "(.+)" -r "spikeIn_\$1" -o resources/tmp_spikeIn.fasta.gz {input.spikeIn} 2> {log}
	cat {input.ref} resources/tmp_spikeIn.fasta.gz > {output} 2>> {log}
	rm resources/tmp_spikeIn.fasta.gz
	"""
56
57
shell:
	"mv {input} {output} 2> {log}"
SnakeMake From line 56 of rules/ref.smk
72
73
74
75
shell:
	"seqkit grep -f {input.keep_chroms} {input.genome}"
	" | seqkit fx2tab -nil"
	" |  awk -v OFS='\t' '{{print $1, 1, $2}}' > {output}"
90
91
wrapper:
	"v1.1.0/bio/bowtie2/build"
SnakeMake From line 90 of rules/ref.smk
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
library(tidyverse)

# import data ------------------------------------------------------------------
# read in all relevant stat files
stat_files <- snakemake@input[["mapping_stats"]]
names(stat_files) <- stat_files

mapping_stats <- stat_files %>%
  map_dfr(read_tsv, .id = "source", col_names = c("chromosome", "chromosome_size", "mapped_reads", "unmapped_reads")) %>%
  mutate(sample_name = gsub("_unireads.idxstats", "", basename(source)))


# add column marking reference or spike-in chromosomes
mapping_stats <- mapping_stats %>%
  mutate(reference_genome = case_when(str_detect(chromosome, "spikeIn")  ~ "spikeIn",
                                      !str_detect(chromosome, "spikeIn")  ~ "reference")) 


# compute % reads mapping to reference and spike-in genomes
mapping_stats <- mapping_stats %>% 
  group_by(sample_name, reference_genome) %>% 
  summarise(total_reads = sum(mapped_reads)) %>%
  pivot_wider(names_from = reference_genome, values_from = total_reads) %>%
  mutate(total_mapped_reads = reference + spikeIn) %>%
  mutate(percent_reference = reference / total_mapped_reads * 100, percent_spikeIn = spikeIn / total_mapped_reads * 100)


# read in unit table
sample_table <- snakemake@config[["units"]] %>% 
  read_tsv() %>% 

  # remove redundant technical replicates from sample_table
  distinct(sample_name, .keep_all = TRUE)


# mark samples as inputs or IPs
sample_table <- sample_table %>% 
  mutate(chip_sample_type = case_when(
    (sample_name %in% .$input) ~ "input",
    !(sample_name %in% .$input) ~ "IP"
  )) %>% 
  mutate(has_input = !is.na(input))


# create new column indicating input/IP pairs
tmp <- sample_table %>% select(input, sample_name) %>% dplyr::rename(IP = sample_name, sample_name = input)
sample_table <- sample_table %>% 
  left_join(tmp, by = "sample_name") %>% 
  mutate(IP_group = case_when(
    is.na(IP) ~ sample_name,
    !is.na(IP) ~ IP
  ))




# prepare sample_table and mapping_stats tables for join
individual_mapping_stats <- mapping_stats %>% 
  select(sample_name, percent_reference, percent_spikeIn)

# join sample_table and mapping stats
individual_scaling_factors <- sample_table %>%
  left_join(individual_mapping_stats, by = "sample_name")


# compute normalization factors for individual samples -------------------------
sample_has_input <- individual_scaling_factors %>% 
  filter(chip_sample_type == "IP") %>% 
  select(has_input) %>% 
  pull()

if (all(sample_has_input)) {

  individual_scaling_factors <- individual_scaling_factors %>% 
    pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% 
    mutate(enrichment = IP / input) %>%
    mutate(scaling_factor = 1 / enrichment) %>%
    mutate(norm_scaling_factor = scaling_factor / max(scaling_factor))

} else {
  warning("not all IPs have corresponding input sample, calculating scaling factors based on %spikeIn for IP samples")
  individual_scaling_factors <- individual_scaling_factors %>% 
    pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% 
    mutate(scaling_factor = 1 / IP) %>%
    mutate(norm_scaling_factor = scaling_factor / max(scaling_factor))
}

individual_scaling_factors %>% 
  write_tsv(snakemake@output[[1]])

# prepare data for merged samples -----------------------------
merged_mapping_stats <- mapping_stats %>% 
  select(sample_name, reference, spikeIn)

# combine replicate reads
merged_mapping_stats <- sample_table %>% 
  left_join(merged_mapping_stats, by = "sample_name") %>% 
  group_by(sample_group) %>% 
  summarise(spikeIn = sum(spikeIn), reference = sum(reference))




# create new column indicating input/IP pairs
tmp <- sample_table %>% select(input, sample_group) %>% dplyr::rename(IP = sample_group, sample_name = input)
sample_table <- sample_table %>% 
  select(-IP, -IP_group) %>% 
  left_join(tmp, by = "sample_name") %>% 
  mutate(IP_group = case_when(
    is.na(IP) ~ sample_group,
    !is.na(IP) ~ IP
  ))



# join sample_table and mapping stats for merged reads
merged_scaling_factors <- merged_mapping_stats %>% 
  left_join(select(sample_table, sample_group, chip_sample_type,IP_group, has_input), by = "sample_group") %>% 
  distinct(.keep_all = TRUE) %>%
  mutate(total_mapped_reads = reference + spikeIn) %>%
  mutate(percent_reference = reference / total_mapped_reads * 100, percent_spikeIn = spikeIn / total_mapped_reads * 100)

# compute normalization factors for merged samples -------------------------
sample_has_input <- merged_scaling_factors %>% 
  filter(chip_sample_type == "IP") %>% 
  select(has_input) %>% 
  pull()
if (all(sample_has_input)) {

  merged_scaling_factors <- merged_scaling_factors %>% 
    pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% 
    mutate(enrichment = IP / input) %>%
    mutate(scaling_factor = 1 / enrichment) %>%
    mutate(norm_scaling_factor = scaling_factor / max(scaling_factor))

} else {
  warning("not all IPs have corresponding input sample, calculating scaling factors based on %spikeIn for IP samples")
  merged_scaling_factors <- merged_scaling_factors %>% 
    pivot_wider(names_from = chip_sample_type, values_from = percent_spikeIn, id_cols = IP_group) %>% 
    mutate(scaling_factor = 1 / IP) %>%
    mutate(norm_scaling_factor = scaling_factor / max(scaling_factor))
}

merged_scaling_factors %>% 
  write_tsv(snakemake@output[[2]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.snakemake import get_mem


log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


# Parse memory
mem_mb = get_mem(snakemake, "MiB")


# Outdir
outdir = os.path.dirname(snakemake.output[0])
if outdir:
    outdir = f"--outdir {outdir}"


# Output compression
compress = ""
mem = f"-m{mem_mb}" if mem_mb else ""

for output in snakemake.output:
    out_name, out_ext = os.path.splitext(output)
    if out_ext == ".gz":
        compress += f"pigz -p {snakemake.threads} {out_name}; "
    elif out_ext == ".bz2":
        compress += f"pbzip2 -p{snakemake.threads} {mem} {out_name}; "


with tempfile.TemporaryDirectory() as tmpdir:
    mem = f"--mem {mem_mb}M" if mem_mb else ""

    shell(
        "(fasterq-dump --temp {tmpdir} --threads {snakemake.threads} {mem} "
        "{extra} {outdir} {snakemake.wildcards.accession}; "
        "{compress}"
        ") {log}"
    )
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(rtracklayer))
suppressPackageStartupMessages(library(GenomicRanges))

# define functions -------------------------------------------------------------
# function to merge multiple sets of ChIP peaks into a non-overlapping master set
# supports multiple methods
# peaks should be a GRangesList
# method should be either "overlap" or "merge": 
# overlap method will determine overlaps among all peak sets and combine non-overlapping peaks
# merge method will combine overlapping peaks into a single peak
combine_peaks <- function(peaks, method = "overlap", min_overlap = 1L) {
  require(GenomicRanges)

  # check arguments
  if (length(peaks) < 2) {
    stop("Only one set of peaks provided. For merging peaks, provide multiple peak sets.")
  }

  if (!inherits(peaks, "GRangesList") ) {
    stop("one or more peak set is not a Granges object. peaks must be a GRangesList")
  }

  if (!any(method == c("overlap", "merge"))) {
    stop("Invalid method provided. method should be 'overlap' or 'merge' ")
  }

  if (!is_integer(min_overlap)) {
    stop("min_overlap must be an integer")
  }


  # merge peaks using "overlap" method
  if (method == "overlap") {
    combined_peaks <- peaks[[1]]
    for (i in 2:length(peaks)) {
      i_set <- peaks[[i]]
      combined_peaks <- c(combined_peaks,
                          subsetByOverlaps(i_set, combined_peaks, minoverlap = min_overlap, invert = TRUE))
    }

  }
  # merge peaks using "merge" method
  if (method == "merge") {
    combined_peaks <- GenomicRanges::reduce(unlist(peaks))

  }
  mcols(combined_peaks) <- NULL
  return(combined_peaks)

}

# function to build an overlap table
# takes a Granges list object
# combines all nonoverlapping peaks from all samples, then builds a logical table indicating which peaks from each sample overlap each peak on the master list
peak_overlap_table <- function(peaks, method = "overlap", min_overlap = 1L, combine_peaks = TRUE) {

  # check arguments
  if (length(peaks) < 2) {
    stop("Only one set of peaks provided. For merging peaks, provide multiple peak sets.")
  }

  if (!inherits(peaks, "GRangesList") ) {
    stop("one or more peak set is not a Granges object. peaks must be a GRangesList")
  }

  if (!any(method == c("overlap", "merge"))) {
    stop("Invalid method provided. method should be 'overlap' or 'merge' ")
  }

  if (!is_integer(min_overlap)) {
    stop("min_overlap must be an integer")
  }


  # set names
  if(is.null(names(peaks))) {
    names(peaks) <- paste0("sample_", seq((peaks)))
  }


  if (combine_peaks) {
    # get a master set of all nonoverlapping peaks from all peak sets
    all_peaks_gr <- combine_peaks(peaks, method = method, min_overlap = min_overlap)

  } else {
    all_peaks_gr <- peaks[[1]]
  }
  all_peaks_df <- as.data.frame(all_peaks_gr) %>%
    dplyr::select(1:5)

  # build a logical overlap table indicating which peaks were detected in which samples
  for (i in 1:length(peaks)) {
    sample_name <- names(peaks)[i]
    i_peaks <- peaks[[i]]

    all_peaks_df[,sample_name] <- FALSE
    overlaps <- findOverlaps(all_peaks_gr, i_peaks, minoverlap = min_overlap)@from
    all_peaks_df[overlaps,sample_name] <- TRUE
  }
  return(all_peaks_df)
}

# read peaks into overlap table ------------------------------------------------
merged_peak_file <- snakemake@input[["merged_peaks"]]
names(merged_peak_file) <- "merged_peaks"

replicate_peak_files <- snakemake@input[["ind_peaks"]]
names(replicate_peak_files) <- paste0("rep", seq(replicate_peak_files), "_peaks")

peak_files <- c(merged_peak_file, replicate_peak_files)

overlap_table <- peak_files %>% 
  map(rtracklayer::import) %>% 
  GRangesList() %>% 
  peak_overlap_table(min_overlap = as.integer(snakemake@params[["min_overlap"]]))

# filter to include only peaks detected in multiple replicates -----------------
filtered_peaks <- overlap_table %>% 
  mutate(reps_w_peak = rowSums(.[7:ncol(.)])) %>% 
  filter(merged_peaks, reps_w_peak >= snakemake@params[["replicate_threshold"]]) %>% 
  makeGRangesFromDataFrame()

# export filtered peaks to file  -----------------------------------------------
if (snakemake@wildcards[["peak_type"]] == "narrow") {
  keep_cols <- c(1:3, 6:7, 5, 8:11)
} else if (snakemake@wildcards[["peak_type"]] == "broad") {
  keep_cols <- c(1:3, 6:7, 5, 8:10)
} else {
  stop("Peak format should be 'narrow' or 'broad'. Check input file format")
}

output <- merged_peak_file %>% 
  rtracklayer::import() %>% 
  subsetByOverlaps(filtered_peaks) %>% 
  as.data.frame() %>% 
  select(all_of(keep_cols)) %>% 
  mutate(strand = ".")

write_tsv(output, snakemake@output[[1]], col_names = FALSE)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
suppressPackageStartupMessages(library(GenomicRanges))
suppressPackageStartupMessages(library(rtracklayer))
suppressPackageStartupMessages(library(tidyverse))

# import data --------------------------------------------------------------------------------------------------
# get filename of bigwig that will be used for normalization
bw <- snakemake@input[["bw"]]

# read in normalization factors for merged replicates
scaling_factors <- read_tsv(snakemake@input[["scaling_factors"]])

# get sample name to use for finding the correct scaling factor
file_bn <- gsub(".bw", "", basename(bw))

# rescale bigwig file based on spike-in normalization factor ---------------------------------------------------

# get scaling factor
scaling_factor <- scaling_factors %>%
  filter(IP_group == file_bn) %>%
  select(norm_scaling_factor) %>%
  pull()


# normalize bigwig
scaled.gr <- import(bw) %>%
  as.data.frame() %>%
  mutate(score = score * scaling_factor) %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# set seqinfo of normalized Granges object
seqlevels(scaled.gr) <- seqlevels(BigWigFile(bw))
seqinfo(scaled.gr) <- seqinfo(BigWigFile(bw))


# export normalized bigwig -----------------------------------------------------
export(scaled.gr, snakemake@output[[1]], format = "bw")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(rtracklayer))
suppressPackageStartupMessages(library(GenomicRanges))


zscore_bw <- function(bw) {
  require(tidyverse)
  require(rtracklayer)
  require(GenomicRanges)

  # import bigwig file to Granges
  if (typeof(bw) == "character") {
    message("reading bigwig file")
    bw <- import(bw)
  }

  # if using a spike-in, filter the seqlevels to only the reference genome
  if (snakemake@config[["use_spikeIn"]]) {
    message("removing spikeIn chromosomes")
    ref_chroms <- seqlevels(bw)[!grepl("spikeIn_", seqlevels(bw))]
    bw <- keepSeqlevels(bw, ref_chroms, pruning.mode = "coarse")
  }

  if (snakemake@config[["filter_chroms"]]) {
    message("filtering reference chromosomes")
    keep_chroms <- read_tsv(snakemake@config[["keep_chroms"]], col_names = c("chromosome"))
    ref_chroms <- seqlevels(bw)[seqlevels(bw) %in% keep_chroms$chromosome]
    bw <- keepSeqlevels(bw, ref_chroms, pruning.mode = "coarse")
  }


  # for large regions with the same score, expand into equal sized bins
  message("binning genome")
  min_binsize <- min(width(bw))
  all_bins <- tileGenome(seqinfo(bw), tilewidth=min_binsize,cut.last.tile.in.chrom=TRUE)

  message("getting scores for all bins")
  # add the coverage/score for both input and IP
  all_bins <- subsetByOverlaps(all_bins, bw)
  overlaps <- findOverlaps(all_bins, bw)
  all_bins$score[overlaps@from] <- bw$score[overlaps@to]

  # perform z-score normalization
  message("performing z-score normalization")
  all_bins$zscore <- scale(all_bins$score)[,1]
  all_bins$score <- NULL
  all_bins$score <- all_bins$zscore
  all_bins$zscore <- NULL
  # collapse adjacent bins with same score
  collapsed <- unlist(GenomicRanges::reduce(split(all_bins, ~score)))
  collapsed$score <- as.numeric(names(collapsed))
  names(collapsed) <- NULL
  all_bins <- collapsed

  #set seqinfo for z-score normalized version
  seqinfo(all_bins) <- seqinfo(bw)

  return(all_bins)
}


# perform z-score normalization and write new bigwig files ---------------------
zscore.gr <- zscore_bw(snakemake@input[[1]])
export(zscore.gr, snakemake@output[[1]])
ShowHide 39 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/tjgibson/NGS-workflow-chipseq
Name: ngs-workflow-chipseq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...