Pipeline for the analysis of PE ChIP-seq data

public public 1yr ago Version: v1.2.1 0 bookmarks

Snakemake pipepine for (Illumina) paired-end ChIP-Seq data

DOI

Aim

Snakemake pipeline made for reproducible analysis of paired-end Illumina ChIP-seq data. The desired output of this pipeline are:

  • Quality controls files to check for the quality of the reads. Reads are processed by programs such as fastp and deeptools in order to produce graph that are easily readable and inform quickly about the quality of the experiment.

  • Portable visualization files bigwig files , these files have a .bw extension and are relatively small (compared to sam and bam files). There are used to visualize the read coverage over a genome and can be simply be drag and drop in a genome viewer such as IGV , JBrowse or UCSC .

  • Peaks informations files, these bed files gather the information about the peak calling produces by the MACS2 algorithm. The .narrowPeak and .broadPeak files can be used to visualize the peak location as the bigwig files but also used downstream analysis such as the intersection of peaks between replicates and different treatments.

  • Deeptools visualization of the reads over genomic features such as the genes transcriptions start sites or other features.

Informations about the output of the pipeline can be found in the part Main outputs of this documentation.

Content of the repository

  • Snakefile containing the "recipes" of rules to generate the desired outputs from the input files.

  • config/ , folder containing the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules. Adapt the config file and its reference in the Snakefile. Please also pay attention to the parameters selected for deeptools, for convenience and faster test the bins have been defined at 1000bp , do not forget to adapt it to your analysis.

  • Fastq/ , folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk : seqtk sample -s100 read1.fq 5000 > sub1.fqseqtk sample -s100 read2.fq 5000 > sub2.fq . RAW fastq or fastq.gz files should be placed here before running the pipeline.

  • envs/ , folder containing the environment needed for the Snakefile to run. To use Snakemake, it is required to create and activate an environment containing snakemake (here : envs/global_env.yaml ). The Snakefile will load the required environment for each rules automatically, no modification is required here only if a rule is added.

  • units.tsv , is a tab separated value files containing information about the experiment name, the condition of the experiment (control or treatment) and the path to the fastq files relative to the Snakefile . Change this file according to your samples.

  • rules/ , folder containing the rules called by the snakefile to run the pipeline, this improves the clarity of the Snakefile and might help modifying the file in the future.

Usage

Conda environment

First, you need to create an environment for the use of Snakemake with Conda package manager .

  1. Create a virtual environment named "chipseq" from the global_env.yaml file with the following command: conda env create --name chipseq --file envs/global_env.yaml

  2. Then, activate this virtual environment with source activate chipseq

The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.

Test run on small fastq test files

To test this pipeline, you will need Snakemake installed on your local machine. If you have activated the 'chipseq' environment (see above), then Snakemake 5.2.0 is already installed and in use.
If the singularity software is available on your machine and you want to use 10 CPUs ( --cores 10 ), then run snakemake --use-conda --use-singularity --cores 10

Configuration file

The config.yaml file specifies custom options for Snakemake:

  • Sample information units: units.tsv indicates where to find the sample names. Paired-end files should be provided such as the forward read is to be found in the fq1 column while the reverse read in the fq2 column. Please pay attention at respecting the spaces between the filenames, these should be tab sepatation or 4 spaces .

  • The directories:

    • working_dir : a temporary directory that can be removed after the run

    • result_dir : a directory that contains the desired output files at the end of the run

  • The link to the reference genomes used for the mapping, please note that if you have a favorite version of the genome that you use in a genome browser you should use the same or update your reference genome in the genome browser in order to avoid compatibility problems.

  • Parameters for various tools in the pipeline, every tools in the pipeline offer a range of possible change in their parameters. It would be too long to describe them here but you can find these parameters in the manual of the tools.

This configuration file is then used to build parameters in the main Snakefile .

Snakemake execution

The Snakemake pipeline/workflow management system reads a master file (often called Snakefile ) to list the steps to be executed and defining their order. It has many rich features. Read more here .

Test run

Use the command snakemake -np after activating the chipseq environment to perform a dry run that prints out the rules and commands. If there is no RED message, everything is fine and you can proceed to the Real run

Real run

Within the folder containing the Snakefile (if unchanged Snakemake_ChIPseq_PE/ , simply type Snakemake --use-conda --use-singularity and provide the number of cores with --cores 10 for ten cores for instance.

Please pay attention to --use-conda , it is required for the installation and loading of the dependencies used by the rules of the pipeline.

Main outputs

The main output are :

  • fastqc : Provide informations about the quality of the sequences provided and generate a html file to visualize it. More information to be found here

  • bed : Provide information generated by the MACS2 algorithm for the locations and significance of peaks. These files can be used for direct visualization of the peaks location using IGV or as an input for further analysis using the bedtools

  • bigwig files : Provides files allowing fast displays of read coverages track on any type of genome browsers.

  • plotFingerprint contains interesting figures that answer the question: "Did my ChIP work???" . Explanation of the plot and the options available can be found here

  • PLOTCORRELATION folder contain pdf files displaying the correlation between the samples tested in the ChIP experiment, many options in the plotcorrelation rules can be changed via the configuration file. More information about this plot can be found here

  • HEATMAP folder contain pdf files displaying the content of the matrix produced by the computeMatrix rule under the form of a heatmap. Many option for the computeMatrix and the plotHeatmap rules can be changed in the configuration file. More information about this figure can be found here .

  • plotProfile folder contain pdf files displaying profile plot for scores over sets of genomic region, again the genomic region are define in the matrix made previously. Again there are many options to change the plot and more information can be found here

Optionals outputs of the pipelines are bamCompare , bedgraph and bed files for broad peaks calling .

Directed Acyclic Graph of the Snakemake ChIP-seq PE pipeline

This graph has been produced with the command: snakemake --rulegraph |dot -Tpng > dag.png

Code Snippets

20
21
22
23
24
25
26
27
28
shell:
    "bamCoverage --bam {input.bam} \
    -o {output} \
    --effectiveGenomeSize {params.EFFECTIVEGENOMESIZE} \
    --extendReads {params.EXTENDREADS} \
    --binSize {params.binSize} \
    --smoothLength {params.smoothLength} \
    --ignoreForNormalization {params.ignoreForNormalization} \
    &>{log}"
52
53
54
55
56
57
58
59
60
61
62
shell:
    "bamCompare -b1 {input.treatment} \
    -b2 {input.control}  \
    --binSize {params.binSize} \
    -o {output.bigwig} \
    --normalizeUsing {params.normalizeUsing} \
    --operation {params.operation} \
    --smoothLength {params.smoothLength} \
    --ignoreForNormalization {params.ignoreForNormalization} \
    --scaleFactorsMethod {params.scaleFactorsMethod} \
    &>{log}"
78
79
80
81
82
83
84
85
86
shell:
    "multiBamSummary bins \
    --bamfiles {input} \
    --numberOfProcessors {threads}\
    --binSize {params.binSize} \
    --centerReads \
    --extendReads \
    -o {output} \
    2> {log}"
104
105
106
107
108
109
110
111
112
113
shell:
    "plotCorrelation \
                --corData {input} \
                --corMethod {params.corMethod} \
                --whatToPlot {params.whatToPlot} \
                --skipZeros \
                --colorMap {params.color} \
                --plotFile {output} \
                --plotNumbers \
                2> {log}"
133
134
135
136
137
138
139
140
141
142
143
144
shell:
    "computeMatrix \
    reference-point \
    --referencePoint TSS \
    -S {input.bigwig} \
    -R {input.bed} \
    --afterRegionStartLength {params.upstream} \
    --beforeRegionStartLength {params.downstream} \
    --numberOfProcessors {threads} \
    --binSize {params.binSize} \
    -o {output} \
    2> {log}"
162
163
164
165
166
167
168
169
shell:
    "plotHeatmap \
    --matrixFile {input} \
    --outFileName {output} \
    --kmeans {params.kmeans} \
    --colorMap {params.color} \
    --legendLocation best \
    --outFileSortedRegions {params.cluster}"
188
189
190
191
192
193
shell:
    "plotFingerprint \
    -b {input.treatment} {input.control} \
    --extendReads {params.EXTENDREADS} \
    --binSize {params.binSize} \
    --plotFile {output}"
211
212
213
214
215
216
217
218
shell:
    "plotProfile \
    --matrixFile {input} \
    --outFileName {output.pdf} \
    --outFileSortedRegions {output.bed} \
    --kmeans {params.kmeans} \
    --startLabel {params.startLabel} \
    --endLabel {params.endLabel}"
5
shell: "wget -O {output} {GENOME_FASTA_URL}"
12
shell: "wget -O {output} {GFF_URL}"
20
21
shell:
    "python scripts/gff_to_gtf.py {input} {output}"    
18
19
20
21
shell:
    """
    macs2 callpeak -t {input.treatment} -c {input.control} {params.format} {params.genomesize} --name {params.name} --nomodel --bdg -q {params.qvalue} --outdir results/bed/ &>{log}
    """
40
41
42
43
shell:
    """
    macs2 callpeak -t {input.treatment} -c {input.control} {params.format} --broad --broad-cutoff 0.1 {params.genomesize} --name {params.name} --nomodel --bdg -q {params.qvalue} --outdir results/bed/ &>{log}
    """
26
27
28
29
30
31
32
33
34
35
36
37
shell:
    "trimmomatic PE {params.phred} -threads {threads} "
    "{input.reads} "
    "{output.forward_reads} "
    "{output.forwardUnpaired} "
    "{output.reverse_reads} "
    "{output.reverseUnpaired} "
    "ILLUMINACLIP:{input.adapters}:{params.seedMisMatches}:{params.palindromeClipTreshold}:{params.simpleClipThreshhold} "
    "LEADING:{params.LeadMinTrimQual} "
    "TRAILING:{params.TrailMinTrimQual} "
    "SLIDINGWINDOW:{params.windowSize}:{params.avgMinQual} "
    "MINLEN:{params.minReadLen} &>{log}"
53
54
shell:
    "fastp -i {input.fwd} -I {input.rev} -h {output} 2>{log}"
70
shell:"bowtie2-build --threads {threads} {input} {params}"
92
93
94
95
shell:
    """
    bowtie2 {params.bowtie} --threads {threads} -x {params.index} -1 {input.forward} -2 {input.reverse} -U {input.forwardUnpaired},{input.reverseUnpaired} --un-conc-gz {params.unmapped} | samtools view -Sb - > {output.mapped} 2>{log}
    """
109
shell:"samtools sort -@ {threads} -o {output} {input} &>{log}"
122
123
124
125
126
shell:
    """
    samtools rmdup {input} {output.bam} &>{log}
    samtools index {output.bam}
    """
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import sys

args = sys.argv

f = open(args[1])
of = open(args[2], "w")

keep = set(["gene", "transcript", "exon"])
for line in f:
    if line.startswith("#"):
        continue
    cols = line.split("\t")
    if cols[2] == "mRNA":
        cols[2] = "transcript"
    if cols[2] not in keep:
        continue

    attribs = cols[8].split(";")
    attribs = [x.replace("=", "\t").split("\t") for x in attribs]
    newAttribs = {}
    for t in attribs:
        if t[0] == "ID":
            _, ID = t[1].split(":")
            if cols[2] == "gene":
                newAttribs['gene_id'] = ID
            elif cols[2] == "transcript":
                newAttribs['transcript_id'] = ID
                x = ID.rindex(".")
                newAttribs['gene_id'] = ID[:x]
            elif cols[2] == "exon":
                x = ID.rindex(".")
                ID = ID[:x]
                newAttribs['transcript_id'] = ID
                x = ID.rindex(".")
                ID = ID[:x]
                newAttribs['gene_id'] = ID
    cols[8] = " ".join("{} \"{}\";".format(x, y) for x, y in newAttribs.items())
    of.write("\t".join(cols))
    of.write("\n")
f.close()
of.close()
106
shell:"rm -rf {WORKING_DIR}"
SnakeMake From line 106 of master/Snakefile
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/KoesGroup/Snakemake_ChIPseq_PE
Name: snakemake_chipseq_pe
Version: v1.2.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...