A snakemake workflow to process ATAC-seq data

public public 1yr ago 0 bookmarks

snakeatac.png

Yet another snakemake workflow for ATAC-seq data processing. This pipeline was created from code developed by:

For SLURM setup we reference:

Workflow Overview

Snakemake pipelines promote experimental reproducibility. For this project, you should have the following inputs customized for your analysis:

  1. A config.yaml that describes the run parameters and location of reference data.

  2. A tab-delimited sample meta file file that describes the experiments to download from SRA and how to group them.

  3. A unique output directory.

A detailed overview of the steps in the ATAC-seq data processing are found on the maxATAC wiki site .

This version of snakeATAC is geared towards use with maxATAC and TOBIAS for making TF binding predictions.

Installation

This pipeline uses Anaconda and Snakemake. Follow the Snakemake install instructions for the best experience. Below is a brief overview of how to install Snakemake.

Create environment

Create a conda environment and download mamba :

conda create -n snakeatac -c conda-forge -c bioconda mamba snakemake

Activate the snakeatac environment:

conda activate snakeatac

Clone the snakeATAC repository

In your favorite directory clone the snakeATAC repo:

git clone https://github.com/tacazares/snakeATAC.git

Set up run-specific parameters

If you are running this pipeline for your first time, you will need to install all the conda environments used and perform a dry-run to make sure that everything was installed right.

  1. Adjust the config.yaml and the tab-delimited sample meta file for your specific experiment.

  2. Change to the working directory for snakeATAC. By default, Snakemake will look for a file called Snakefile with the rules and run information. You can use a custome Snakefile with the -s flag followed by the path to the file.

    cd ./snakeATAC/
    
  3. Next, use the --conda-create-envs-only flag to create the environments.

    snakemake --cores 14 --use-conda --conda-frontend mamba --conda-create-envs-only --configfile ./inputs/config.yaml
    
  4. Test the workflow and scripts are correctly set up by performing a dry-run with the --dry-run flag.

    snakemake --cores 14 --use-conda --conda-frontend mamba --configfile ./inputs/config.yaml --dry-run
    

Test snakeATAC

The ./snakeATAC/inputs/GM12878_sample.tsv contains information for a test run to process GM12878 OMNI ATAC-seq data.

After install, you can run the full run using your favorite HPC system.

snakemake --cores 14 --use-conda --conda-frontend mamba --configfile ./inputs/config.yaml

Use Snakemake to submit jobs through SLURM

If you want to use Snakemake to submit jobs to slurm, you will need to follow the instruction described by jdblischak/smk-simple-slur repo . The directory and scripts are included in this repository, but you will need to adjust the account information. You can also adjust any defaults that you wish to use with your job submissions. NOTE: You will need to use chmod +x status-sacct.sh to make the script executable.

Example .bat file to drive the snakeATAC workflow

#!/bin/bash
#SBATCH -D ./outputs
#SBATCH -J dmnd_snake 
#SBATCH -t 96:00:00
#SBATCH --ntasks=8
#SBATCH --mem=16gb
#SBATCH --account={YOUR_ACCOUNT}
#SBATCH --output ./outputs/snakeatac-%j.out
#SBATCH --error ./outputs/snakeatac-%j.err
# Load modules
module load python/3.7-2019.10
# Load the snakemake/mamba env
source activate mamba
# go to a particular directory
cd ./snakeATAC
# make things fail on errors
set -o nounset
set -o errexit
set -x
### run your commands here!
# Develop from the below links
# https://bluegenes.github.io/snakemake-via-slurm/
# https://github.com/jdblischak/smk-simple-slurm
snakemake -s /snakeATAC/Snakefile \
--use-conda \
--conda-frontend mamba \
--configfile /snakeATAC/inputs/config.yaml \
--profile simple/

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


outdir = os.path.dirname(snakemake.output[0])
if outdir:
    outdir = f"--outdir {outdir}"


compress = ""
for output in snakemake.output:
    out_name, out_ext = os.path.splitext(output)
    if out_ext == ".gz":
        compress += f"pigz -p {snakemake.threads} {out_name}; "
    elif out_ext == ".bz2":
        compress += f"pbzip2 -p{snakemake.threads} {out_name}; "


with tempfile.TemporaryDirectory() as tmp:
    shell(
        "(fasterq-dump --temp {tmp} --threads {snakemake.threads} "
        "{extra} {outdir} {snakemake.wildcards.accession}; "
        "{compress}"
        ") {log}"
    )
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
myargs <- commandArgs(trailingOnly=TRUE)
bamfile <- myargs[1]
species <- myargs[2]

print("loading packages (ATACseqQC, ggplot, etc)...")
suppressPackageStartupMessages(library(ggplot2, quietly=TRUE))
suppressPackageStartupMessages(library(Rsamtools, quietly=TRUE))
suppressPackageStartupMessages(library(ATACseqQC, quietly=TRUE))
suppressPackageStartupMessages(library(ChIPpeakAnno, quietly=TRUE))
suppressPackageStartupMessages(library("GenomicAlignments", quietly=TRUE))

if (species == "mm") {
  suppressPackageStartupMessages(library(TxDb.Mmusculus.UCSC.mm10.knownGene, quietly=TRUE))
  suppressPackageStartupMessages(library(BSgenome.Mmusculus.UCSC.mm10, quietly=TRUE))
  txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
  bsgenome <- BSgenome.Mmusculus.UCSC.mm10
  genome <- Mmusculus
  print("species is 'mm' using mm10 for analysis")
  ### Note: Everything below is deprecated until I can figure out a way to port a 
  ### static/local package with snakemake
  # Note: phastCons60way was manually curated from GenomicAlignments, built, and installed as an R package
  # score was obtained according to: https://support.bioconductor.org/p/96226/
  # package was built and installed according to: https://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicScores/inst/doc/GenomicScores.html
  # (section 5.1: Building an annotation package from a GScores object)
  #suppressWarnings(suppressPackageStartupMessages(library(GenomicScores, lib.loc="/users/dia6sx/snakeATAC/scripts/", quietly=TRUE)))
  #suppressWarnings(suppressPackageStartupMessages(library(phastCons60way.UCSC.mm10, lib.loc="/users/dia6sx/snakeATAC/scripts/", quietly=TRUE)))
} else if (species == "hs") {
  suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg38.knownGene, quietly=TRUE))
  suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38, quietly=TRUE))
  txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
  bsgenome <- BSgenome.Hsapiens.UCSC.hg38
  genome <- Hsapiens
  print("species is 'hs' using hg38 for analysis")
} else {
  print(paste("params ERROR: ATACseqQC is not configured to use species =", species))
  print("exiting...")
  quit(status=1)
}

doATACseqQC <- function(bamfile, txdb, bsgenome, genome) {
    # Fragment size distribution
    print(paste("generating output for ",strsplit(basename(bamfile),split='\\.')[[1]][1],"...",sep=""))
    print("calculating Fragment size distribution...")
    bamfile.labels <- gsub(".bam", "", basename(bamfile))
    loc_to_save_figures <- paste(dirname(dirname(bamfile)),"/qc/ATACseqQC",sep="")
    if (file.exists(loc_to_save_figures)) {
        print("Warning: old figures will be overwritten")
    } else {
        dir.create(loc_to_save_figures)
    }
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_fragment_size_distribution.png",sep="")
    png(png_file)
    fragSizeDist(bamfile, bamfile.labels)
    dev.off()

    # Adjust the read start sites
    print("adjusting read start sites...")
    ## bamfile tags to be read in
    possibleTag <- list("integer"=c("AM", "AS", "CM", "CP", "FI", "H0", "H1", "H2", 
                                    "HI", "IH", "MQ", "NH", "NM", "OP", "PQ", "SM",
                                    "TC", "UQ"), 
                    "character"=c("BC", "BQ", "BZ", "CB", "CC", "CO", "CQ", "CR",
                                "CS", "CT", "CY", "E2", "FS", "LB", "MC", "MD",
                                "MI", "OA", "OC", "OQ", "OX", "PG", "PT", "PU",
                                "Q2", "QT", "QX", "R2", "RG", "RX", "SA", "TS",
                                "U2"))
    bamTop100 <- scanBam(BamFile(bamfile, yieldSize = 100),
                     param = ScanBamParam(tag=unlist(possibleTag)))[[1]]$tag
    tags <- names(bamTop100)[lengths(bamTop100)>0]
    ## files will be output into outPath
    ## shift the coordinates of 5'ends of alignments in the bam file
    outPath <- paste(dirname(dirname(bamfile)),"/alignments_shifted", sep="")
    seqinformation <- seqinfo(txdb)
    gal <- readBamFile(bamfile, tag=tags, asMates=TRUE, bigFile=TRUE)
    shiftedBamfile <- file.path(outPath, paste(bamfile.labels,"_shifted.bam",sep=""))
    # check if shifted Bam file exists from previous run
    if (file.exists(shiftedBamfile)) {
        print("Shifted Bamfile found.")
        print("Loading in...")
        gal <- readBamFile(shiftedBamfile, tag=tags, asMates=TRUE, bigFile=TRUE)
        ## This step is mostly for formating so splitBam can
        ## take in bamfile. Implementing shift of 0 bp on positive strand
        ## and 0 bp on negative strand because shifted Bamfile should
        ## already have these shifts
        gal1 <- shiftGAlignmentsList(gal, positive = 0L, negative = 0L)
    } else {
        # shifted bam file does not exist check if
        # old shifted alignments directory exists
        # if so remove and create new one
        if (file.exists(outPath)){
            unlink(outPath,recursive=TRUE)
        }
        dir.create(outPath)
        print("*** creating shifted bam file ***")
        gal1 <- shiftGAlignmentsList(gal, outbam=shiftedBamfile)
    }

    # Promoter/Transcript body (PT) score
    print("calculating Promoter/Transcript body (PT) score...")
    txs <- transcripts(txdb)
    pt <- PTscore(gal1, txs)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_ptscore.png",sep="")
    png(png_file)
    plot(pt$log2meanCoverage, pt$PT_score, 
        xlab="log2 mean coverage",
        ylab="Promoter vs Transcript",
        main=paste(bamfile.labels,"PT score"))
    dev.off()

    # Nucleosome Free Regions (NFR) score
    print("calculating Nucleosome Free Regions (NFR) score")
    nfr <- NFRscore(gal1, txs)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_nfrscore.png",sep="")
    png(png_file)
    plot(nfr$log2meanCoverage, nfr$NFR_score, 
        xlab="log2 mean coverage",
        ylab="Nucleosome Free Regions score",
        main=paste(bamfile.labels,"\n","NFRscore for 200bp flanking TSSs",sep=""),
        xlim=c(-10, 0), ylim=c(-5, 5))
    dev.off()

    # Transcription Start Site (TSS) Enrichment Score
    print("calculating Transcription Start Site (TSS) Enrichment score")
    tsse <- TSSEscore(gal1, txs)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_tss_enrichment_score.png",sep="")
    png(png_file)
    plot(100*(-9:10-.5), tsse$values, type="b", 
        xlab="distance to TSS",
        ylab="aggregate TSS score",
        main=paste(bamfile.labels,"\n","TSS Enrichment score",sep=""))
    dev.off()

    # Split reads, Heatmap and coverage curve for nucleosome positions
    print("splitting reads by fragment length...")
    genome <- genome
    outPath <- paste(dirname(dirname(bamfile)),"/alignments_split", sep="")
    TSS <- promoters(txs, upstream=0, downstream=1)
    TSS <- unique(TSS)
    ## estimate the library size for normalization
    librarySize <- estLibSize(bamfile)
    ## calculate the signals around TSSs.
    NTILE <- 101
    dws <- ups <- 1010
    splitBamfiles <- paste(outPath,"/",c("NucleosomeFree", 
                                             "mononucleosome",
                                             "dinucleosome",
                                             "trinucleosome"),".bam",sep="")
    # check if split Bam files exists from previous run
    if (all(file.exists(splitBamfiles))) {
        print("*** split bam files found! ***")
        print("Loading in...")
        sigs <- enrichedFragments(bamfiles=splitBamfiles,
                                    index=splitBamfiles, 
                                    TSS=TSS,
                                    librarySize=librarySize,
                                    TSS.filter=0.5,
                                    n.tile = NTILE,
                                    upstream = ups,
                                    downstream = dws)
    } else {
        # split bam files do not exist check if
        # old split alignments directory exists
        # if so remove and create new one
        if (file.exists(outPath)){
            unlink(outPath,recursive=TRUE)
        }
        print("*** creating split bam files ***")
        dir.create(outPath)
        ## split the reads into NucleosomeFree, mononucleosome, 
        ## dinucleosome and trinucleosome.
        ## and save the binned alignments into bam files.
        objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome, outPath = outPath)
        #objs <- splitBam(bamfile, tags=tags, outPath=outPath,
            #        txs=txs, genome=genome,
            #       conservation=phastCons60way.UCSC.mm10,
            #      seqlev=paste0("chr", c(1:19, "X", "Y")))
        sigs <- enrichedFragments(gal=objs[c("NucleosomeFree", 
                                        "mononucleosome",
                                        "dinucleosome",
                                        "trinucleosome")], 
                                    TSS=TSS,
                                    librarySize=librarySize,
                                    TSS.filter=0.5,
                                    n.tile = NTILE,
                                    upstream = ups,
                                    downstream = dws)
    }
    ## log2 transformed signals
    sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))
    ## plot heatmap
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_nucleosome_pos_heatmap.png",sep="")
    png(png_file)
    featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
                        zeroAt=.5, n.tile=NTILE)
    dev.off()
    ## get signals normalized for nucleosome-free and nucleosome-bound regions.
    out <- featureAlignedDistribution(sigs, 
                                    reCenterPeaks(TSS, width=ups+dws),
                                    zeroAt=.5, n.tile=NTILE, type="l", 
                                    ylab="Averaged coverage")
    ## rescale the nucleosome-free and nucleosome signals to 0~1
    range01 <- function(x){(x-min(x))/(max(x)-min(x))}
    out <- apply(out, 2, range01)
    png_file <- paste(loc_to_save_figures,"/",bamfile.labels,"_TSS_signal_distribution.png",sep="")
    png(png_file)
    matplot(out, type="l", xaxt="n", 
            xlab="Position (bp)", 
            ylab="Fraction of signal",
            main=paste(bamfile.labels,"\n","TSS signal distribution",sep=""))
    axis(1, at=seq(0, 100, by=10)+1, 
        labels=c("-1K", seq(-800, 800, by=200), "1K"), las=2)
    abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")
    dev.off()

    print("QC Finished.")
    print("Generated QC figures can be found in qc folder under ATACseQC")
    print(paste("*** removing temp files in",outPath,"***"))
    unlink(outPath,recursive=TRUE)
    outPath <- paste(dirname(dirname(bamfile)),"/alignments_shifted", sep="")
    print(paste("*** removing temp files in",outPath,"***"))
    unlink(outPath,recursive=TRUE)
}

doATACseqQC(bamfile, txdb, bsgenome, genome)
 7
 8
 9
10
11
12
13
14
15
16
read_counts=$(gunzip -c ${1} | wc -l)

# Sort and merge input peaks. Then intersect with list of tags
# Bedtools docs for -u: Write original A entry (tag) once if any overlaps found in B (peaks). In other words, just report the fact at least one overlap was found in B
reads_in_peaks=$(bedtools sort -i ${2} | bedtools merge -i - | bedtools intersect -u -a ${1} -b - | wc -l)

echo ${reads_in_peaks} > ${3}
echo ${read_counts} >> ${3}

echo $(bc -l <<< "${reads_in_peaks}/${read_counts}") >> ${3}
10
11
12
13
14
15
16
mapped_reads=$(samtools view -c -F 260 ${1})
reads_factor=$(bc -l <<< "1/${mapped_reads}")
rpm_factor=$(bc -l <<< "${reads_factor} * ${2}")

bedtools genomecov -i ${3} -g ${4} -bg -scale ${rpm_factor} | LC_COLLATE=C sort -k1,1 -k2,2n > ${5}

bedGraphToBigWig ${5} ${4} ${6}
 9
10
11
12
13
bedtools bamtobed -i ${1} | \
awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 4, $2 + 5, $4, $5, $6; else print $1, $3 - 5, $3 - 4, $4, $5, $6}' | \
bedtools slop  -i - -g ${2} -b ${3} | \
bedtools intersect -a - -b ${4} -v | \
pigz > ${5}
12
13
14
15
shell:
	"""
	Rscript ./scripts/doATACseqQC.R {input} {params.species} > {log}
	"""
25
26
27
28
shell:
    """
    bowtie2 {params.bowtie} -p {threads} -x {config[idx_bt2]} -1 {input[0]} -2 {input[1]} -S {output} 2> {log}
    """
46
47
48
49
50
shell:
    """
    samtools view -@ {threads} -b -u -q 30 {input} | \
    samtools sort -@ {threads} -n -o {output} -
    """
69
70
71
72
73
74
75
76
shell:
    """
    # Add mate information to reads for deduplication
    samtools fixmate -@ {threads} -r -m {input} - 2> {log} | \
    samtools sort -@ {threads} -o {output.fixmate_bam} - 2> {log}

    samtools index -@ {threads} {output.fixmate_bam} 2> {log}
    """
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
shell:
    """
    # use markdup to remove PCR duplicates
    samtools markdup -@ {threads} -r -s {input} - 2> {log} | \
    samtools sort -@ {threads} -o {output.dedup_bam} - 2> {log}

    samtools index -@ {threads} {output.dedup_bam} 2> {log}

    # remove unmapped reads with -F 4
    samtools view -@ {threads} -b -f 3 -F 4 {output.dedup_bam} {params} | \
    samtools sort -@ {threads} -o {output.final_bam} - 2> {log}

    samtools index -@ {threads} {output.final_bam} 2> {log}

    """
135
136
137
138
shell:
    """        
    sh ./scripts/shift_reads.sh {input} {params.chrom_sizes} {params.slop} {params.blacklist} {output}
    """
163
164
165
166
shell:
    """
    sh ./scripts/generate_bigwig.sh {input.bam} {params.millions_factor} {input.tn5_bed} {params.chrom_sizes} {output.tn5_bedgraph} {output.tn5_bigwig}
    """
183
184
185
186
187
shell:
    """
    zcat {input.tn5_bed} | grep -w "+" | pigz > {output.tn5_bed_pos}
    zcat {input.tn5_bed} | grep -w "-" | pigz > {output.tn5_bed_neg}
    """
214
215
216
217
218
shell:
    """
    sh ./scripts/generate_bigwig.sh {input.bam} {params.millions_factor} {input.tn5_bed_pos} {params.chrom_sizes} {output.tn5_bedgraph_pos} {output.tn5_bigwig_pos}
    sh ./scripts/generate_bigwig.sh {input.bam} {params.millions_factor} {input.tn5_bed_neg} {params.chrom_sizes} {output.tn5_bedgraph_neg} {output.tn5_bigwig_neg}
    """
243
244
245
246
shell:
    """
    macs2 callpeak -t {input} --name {params.NAME} -g {params.species} --outdir {params.PEAK_DIR} --nomodel --shift -{params.shift_size} --extsize {params.ext_size} --keep-dup=all -q 0.{params.qvalue}
    """
266
267
268
269
shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """
289
290
291
292
shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """
312
313
314
315
shell:
    """
    sh ./scripts/frip.sh {input[0]} {input[1]} {output}
    """
337
338
339
340
shell:
    """
    bamPEFragmentSize -b {input} -p {threads} --binSize {params.bin_size} --blackListFileName {params.blacklist} --outRawFragmentLengths {output}
    """
358
359
360
361
shell:
    """
    samtools idxstats {input} | cut -f 1,3 > {output}
    """
380
381
382
383
shell:
    """
    maxatac average -i {input} --prefix {params.prefix} --output {params.outdir}
    """
402
403
404
405
shell:
    """
    maxatac normalize --signal {input} --prefix {params.prefix} --output {params.outdir}
    """
425
426
427
428
shell:
    """
    maxatac predict -tf {params.TF_model} -s {input} --prefix {params.prefix} --output {params.outdir}
    """
21
22
wrapper:
    "0.77.0/bio/sra-tools/fasterq-dump"
38
39
40
41
42
shell:
    """
    cat {input.R1} > {output.R1_OUT}
    cat {input.R2} > {output.R2_OUT}
    """
61
62
wrapper:
    "0.77.0/bio/fastqc"
86
87
88
89
shell:
    """
    trim_galore -q 30 -paired -j 4 -o {params.TRIM_DIR} {input.fq1} {input.fq2} 2> {log}
    """
108
109
wrapper:
    "0.77.0/bio/fastqc"
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
shell:
    """
    # Select Parameter Description

    # runThreadN: Number of threads to use
    # readFilesIn: Fastq files
    # outFileNamePrefix: Prefix of the outputDIR and the name to use
    # outSAMtype: Output the file as a BAM file that is sorted
    # outSAMunmapped: Output unmapped reads in the SAM file with special flag
    # outSAMattributes: Standard SAM attributes
    # alignIntronMax: Allow only 1 max intron. This is specific to ATAC-seq
    # STAR was designed for RNA transcripts so we want to ignore some parameters
    # alignMatesGapMax: Allow a maximum of 2000 bp gap.
    # alignEndsType: This aligns the full read and considers the whole read in score calculation.
    # outMultimapperOrder: Output multimapped reads in random order
    # outFilterMultimapNmax: Max 1000 multimapped reads
    # outSAMmultNmax: 1
    # outFilterMismatchNoverLmax: .1 mismatch
    # outFilterMatchNmin: 20
    # alignSJDBoverhangMin: 999
    # alignEndsProtrude: 10 ConcordantPair 
    # outFilterScoreMinOverLread: 0.66
    # outFilterMatchNminOverLread: 0.66
    # readFilesCommand: zcat

    echo "Align Files with STAR: In Progress"

    STAR --genomeDir {params.STAR_INDEX} \
    --runThreadN {params.THREADS} \
    --readFilesIn {input[0]} {input[1]} \
    --outFileNamePrefix {params.Prefix} \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes Standard \
    --alignIntronMax 1 \
    --alignMatesGapMax 2000 \
    --alignEndsType EndToEnd \
    --outMultimapperOrder Random \
    --outFilterMultimapNmax 999 \
    --outSAMmultNmax 1 \
    --outFilterMismatchNoverLmax 0.1 \
    --outFilterMatchNmin 20 \
    --alignSJDBoverhangMin 999 \
    --alignEndsProtrude 10 ConcordantPair  \
    --outFilterScoreMinOverLread 0.66 \
    --outFilterMatchNminOverLread 0.66 \
    --readFilesCommand zcat
    """
84
85
86
87
88
89
shell:
"""
samtools merge -@ {threads} {output} {input}

samtools index -@ 4 {output}
"""
108
109
110
111
shell:
    """
    macs2 callpeak -t {input} --name {params.NAME} -g {params.species} --outdir {params.PEAK_DIR} --nomodel --shift -100 --extsize 200 --broad
    """
114
115
shell:
    "TOBIAS ATACorrect --bam ${file} --genome Homo_sapiens.GRCh38.fa --peaks ${peaks} --blacklist hg38_maxatac_blacklist_V2.bed --outdir /TOBIAS/${base_filename} --cores 4"
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/tacazares/snakeATAC
Name: snakeatac
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...