KAS-Seq Analysis Workflow

public public 1yr ago Version: v0.2 0 bookmarks

Workflow to perform KAS-seq analysis

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description, covering the purpose and domain. Insert your code into the respective folders, i.e. scripts ,

Code Snippets

22
23
wrapper:
    "0.72.0/bio/bwa/mem"
35
36
wrapper:
    "0.72.0/bio/samtools/index"
52
53
wrapper:
    "0.72.0/bio/picard/markduplicates"
71
72
wrapper:
    "0.72.0/bio/bwa/index"
14
15
16
17
18
19
20
21
22
shell:
    """
    bamCoverage \
            -b {input.bam:q} \
            -o {output:q} \
            --normalizeUsing RPGC \
            --effectiveGenomeSize {params.effectiveGenomeSize} \
            > {log:q} 2>&1
    """
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
shell:
    """
    computeMatrix \
        scale-regions \
        --regionsFileName {input.regions:q} \
        --upstream 3000 \
        --downstream 3000 \
        --regionBodyLength 6000 \
        --scoreFileName {input.score:q} \
        --outFileName {output.matrix_gz:q} \
        --outFileNameMatrix {output.matrix_tab:q} \
        --outFileSortedRegions {output.matrix_bed:q} \
        --numberOfProcessors {threads} \
        {params.extra} \
        > {log:q} 2>&1
    """
86
87
wrapper:
    "0.72.0/bio/deeptools/plotheatmap"
109
110
wrapper:
    "0.72.0/bio/deeptools/plotprofile"
36
37
wrapper:
    "0.72.0/bio/macs2/callpeak"
SnakeMake From line 36 of rules/macs.smk
49
50
51
52
53
shell:
    """
    echo -e 'chr\tstart\tend\tname\tscore\tstrand\tfold_enrichment\t-log10(pvalue)\t-log10(qvalue)' > {output:q}
    cat {input:q} >> {output:q}
    """
SnakeMake From line 49 of rules/macs.smk
10
11
wrapper:
    "0.72.0/bio/samtools/faidx"
24
25
script:
    "../scripts/save_txdb.R"
78
79
script:
    "../scripts/peak_annotation.R"
109
110
script:
    "../scripts/peak_profile.R"
134
135
script:
    "../scripts/peak_overlap_enrichment.R"
12
13
wrapper:
    "0.72.0/bio/fastqc"
25
26
wrapper:
    "0.72.0/bio/samtools/stats"
SnakeMake From line 25 of rules/qc.smk
47
48
49
50
51
52
53
54
55
56
57
58
shell:
    """
    plotCoverage \
            --bamfiles {input.bamfiles:q} \
            --plotFile {output.plotfile:q} \
            --outRawCounts {output.coverage:q} \
            --numberOfProcessors {threads} \
            --smartLabels \
            --plotTitle "Read Coverage (deduplicated)" \
            {params} \
            > {log:q} 2>&1 
    """
SnakeMake From line 47 of rules/qc.smk
78
79
80
81
82
83
84
85
86
87
88
shell:
    """
    multiBamSummary \
            bins \
            --bamfiles {input.bamfiles:q} \
            --outFileName {output.multibamsummary:q} \
            --numberOfProcessors {threads} \
            --smartLabels \
            {params} \
            > {log:q} 2>&1
    """
SnakeMake From line 78 of rules/qc.smk
103
104
105
106
107
108
109
110
111
112
113
114
shell:
    """
    plotCorrelation \
            --corData {input.cordata:q} \
            --corMethod spearman \
            --whatToPlot heatmap \
            --plotFile {output.plotfile:q} \
            --plotTitle "Alignment correlation (Spearman)" \
            --outFileCorMatrix {output.cormatrix:q} \
            {params} \
            > {log:q} 2>&1
    """
SnakeMake From line 103 of rules/qc.smk
136
137
138
139
140
141
142
143
144
145
146
147
shell:
    """
    bamPEFragmentSize \
            --bamfiles {input.bamfiles:q} \
            --histogram {output.histogram:q} \
            --plotTitle "Paired End Fragment Size (deduplicated)" \
            --numberOfProcessors {threads} \
            --table {output.table:q} \
            --outRawFragmentLengths {output.rawfragmentlengths:q} \
            {params} \
            > {log:q} 2>&1
    """
SnakeMake From line 136 of rules/qc.smk
203
204
wrapper:
    "0.72.0/bio/deeptools/plotfingerprint"
SnakeMake From line 203 of rules/qc.smk
241
242
wrapper:
    "0.72.0/bio/multiqc"
14
15
16
17
18
shell:
    """
    ln -sfr {input.fq1:q} {output.fastq1:q}
    ln -sfr {input.fq2:q} {output.fastq2:q}
    """
SnakeMake From line 14 of rules/trim.smk
30
31
32
33
shell:
    """
    ln -s {input.fq1:q} {output.fastq:q} >> {log:q} 2>&1
    """
SnakeMake From line 30 of rules/trim.smk
54
55
wrapper:
    "0.72.0/bio/cutadapt/pe"
SnakeMake From line 54 of rules/trim.smk
71
72
wrapper:
    "0.72.0/bio/cutadapt/se"
SnakeMake From line 71 of rules/trim.smk
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
library(BiocManager, quietly = TRUE)
library(ChIPseeker, quietly = TRUE)
library(org.Hs.eg.db, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(readr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("KAS-Seq Peak Annotation and Comparison")
p <- add_argument(p, "--peak_files",
                  help = "Peak files to annotate and compare",
                  nargs = Inf)
p <- add_argument(p, "--txdb_file",
                  help = "File to load txdb using AnnotationDbi::loadDb()")
p <- add_argument(p, "--annotation_file",
                  help = paste0("GFF3 or GTF file of gene annotations used to ",
                                "build txdb"))
p <- add_argument(p, "--txdb",
                  help = "Name of txdb package to install from Bioconductor")

# Add an optional arguments
p <- add_argument(p, "--names", help = "Sample names for each peak file",
                  nargs = Inf)
p <- add_argument(p, "--output_dir",
                  help = "Directory for output files",
                  default = "peak_annotation")
p <- add_argument(p, "--annotation_distribution_plot",
                  help = "Peak annotation distribution barplot filename",
                  default = "annotationDistributionPlot.pdf")
p <- add_argument(p, "--peak_annotation_list_rdata",
                  help = "Peak annotation list Rdata file",
                  default = "peak_anno_list.Rdata")
p <- add_argument(p, "--venn_diagram",
                  help = paste0("Venn digagram of annotated genes per sample ",
                                "pdf filename"),
                  default = "annotationVennDiagram.pdf")

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    "--peak_files", snakemake@input[["peak_files"]],
    "--txdb_file", snakemake@input[["txdb_file"]],
    "--names", snakemake@params[["names"]],
    "--output_dir", snakemake@params[["output_dir"]],
    "--annotation_distribution_plot",
    snakemake@output[["annotation_distribution_plot"]],
    "--venn_diagram", snakemake@output[["venn_diagram"]],
    "--peak_annotation_list_rdata",
    snakemake@output[["peak_annotation_list_rdata"]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  peak_files <- c("results_2020-12-03/macs2/D701-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D702-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak")
  names <- c("D701", "D702", "D703", "D704", "D705")
  annotation_file <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf"
  txdb_file <- "txdb.db"
  argv <- parse_args(p, c("--peak_files", peak_files,
                          "--names", names,
                          "--txdb_file", txdb_file))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Set names
if (!anyNA(argv$names)) {
  peak_file_names <- argv$names
} else {
  peak_file_names <- sapply(argv$peak_files, basename)
}
names(argv$peak_files) <- peak_file_names

# Output directory
if (!dir.exists(argv$output_dir)) {
  dir.create(argv$output_dir, recursive = TRUE)
}

# Get txdb object
if (!is.na(argv$txdb)) {
  # Load (install if needed) txdb from bioconductor
  library(pacman, quietly = TRUE)
  pacman::p_load(argv$txdb, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb))
} else if (!is.na(argv$txdb_file)) {
  # Load txdb
  library(AnnotationDbi, quietly = TRUE)
  txdb <- AnnotationDbi::loadDb(argv$txdb_file)
} else if (!is.na(argv$annotation_file)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file)
} else {
  stop("Must specify one of --txdb, --txdb_file, or --annotation_file")
}


# Peak Annotation
# TODO Provide config parameter for annoDb
peak_anno_list <- lapply(argv$peak_files, annotatePeak, TxDb = txdb,
                       tssRegion = c(-3000, 3000), annoDb = "org.Hs.eg.db",
                       verbose = FALSE)
lapply(names(peak_anno_list), function(name) {
  filebase <- file.path(argv$output_dir, basename(argv$peak_files[[name]]))
  write_tsv(as.data.frame(peak_anno_list[[name]]),
            file = paste0(filebase, ".annotated.tsv.gz"))
  sink(file = paste0(filebase, ".annotated.summary.txt"))
  print(peak_anno_list[[name]])
  sink()
})
saveRDS(peak_anno_list,
        file = argv$peak_annotation_list_rdata)

# Peak annotation distribution plot
peak_anno_dist_plot <- plotAnnoBar(peak_anno_list)
save_plot(
  filename = argv$annotation_distribution_plot,
  plot = peak_anno_dist_plot,
  base_height = 14,
  base_width = 14
)

# nolint start
# Functional profiles comparison
# NOTE: Not all data return enrichment
# genes = lapply(peak_anno_list, function(i) as.data.frame(i)$geneId)
# names(genes) = sub("_", "\n", names(genes))
# compKEGG <- compareCluster(geneCluster   = genes,
#                            fun           = "enrichKEGG",
#                            pvalueCutoff  = 0.05,
#                            pAdjustMethod = "BH")
# dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")
# nolint end

# Venn Diagram of gene annotated per sample
# Note: vennplot does not return a ggplot object
pdf(
  file = argv$venn_diagram,
  height = 14,
  width = 14
)
genes <- lapply(peak_anno_list, function(i) as.data.frame(i)$geneId)
vennplot(genes)
dev.off()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
library(BiocManager, quietly = TRUE)
library(ChIPseeker, quietly = TRUE)
library(readr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("KAS-Seq Peak Overlap Enrichment Analysis")
p <- add_argument(p, "--query_peak_file",
                  help = "Query peak files to compare targets against")
p <- add_argument(p, "--target_peak_files",
                  help = "Target peak files to compare against query peaks",
                  nargs = Inf)
p <- add_argument(p, "--txdb_file",
                  help = "File to load txdb using AnnotationDbi::loadDb()")
p <- add_argument(p, "--annotation_file",
                  help = paste0("GFF3 or GTF file of gene annotations used ",
                                "to build txdb"))
p <- add_argument(p, "--txdb",
                  help = "Name of txdb package to install from Bioconductor")

# Add an optional arguments
p <- add_argument(p, "--nshuffle", help = "Number of shuffle iterations",
                  default = 1000)
p <- add_argument(p, "--p_adjust_method", help = "pvalue adjustment method",
                  default = "BH")
p <- add_argument(p, "--output_tsv_file",
                  help = "Path and filename of output tsv file",
                  default = "peak_overlap_enrichment.tsv")
p <- add_argument(p, "--cores",
                  help = "number of cores (threads) to use",
                  default = NA)
p <- add_argument(p, "--chromlist",
                  help = paste0("List of chromosomes to include, others will ",
                                "be filtered out, new peaks stored in ",
                                "[peak_file].filtered"),
                  nargs = Inf, default = NA)

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    "--query_peak_file", snakemake@input[["query_peak_file"]],
    "--target_peak_files", snakemake@input[["target_peak_files"]],
    "--txdb_file", snakemake@input[["txdb_file"]],
    "--nshuffle", snakemake@params[["nshuffle"]],
    "--p_adjust_method", snakemake@params[["p_adjust_method"]],
    "--output_tsv_file", snakemake@output[["output_tsv_file"]],
    "--cores", snakemake@threads,
    "--chromlist", snakemake@params[["chromlist"]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  query_peak_file <- "results_2020-12-03/macs2/D701-lane1_peaks.broadPeak"
  target_peak_files <- c("results_2020-12-03/macs2/D702-lane1_peaks.broadPeak",
                         "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak",
                         "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak",
                         "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak")
  nshuffle <- 50
  txdb_file <- "txdb.db"
  chromlist <- paste(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
                       "12", "13", "14", "15", "16", "17", "18", "19", "20",
                       "21", "22", "23"), sep = ", ")
  argv <- parse_args(p, c("--query_peak_file", query_peak_file,
                          "--target_peak_files", target_peak_files,
                          "--nshuffle", nshuffle,
                          "--txdb_file", txdb_file,
                          "--chromlist", chromlist))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Get txdb object
if (!is.na(argv$txdb)) {
  # Load (install if needed) txdb from bioconductor
  library(pacman, quietly = TRUE)
  pacman::p_load(argv$txdb, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb))
} else if (!is.na(argv$txdb_file)) {
  # Load txdb
  library(AnnotationDbi, quietly = TRUE)
  txdb <- AnnotationDbi::loadDb(argv$txdb_file)
} else if (!is.na(argv$annotation_file)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file)
} else {
  stop("Must specify one of --txdb, --txdb_file, or --annotation_file")
}



# Number of cores
if (is.na(argv$cores)) {
  cores <- detectCores() - 1
} else {
  cores <- argv$cores
}

# Peak Overlap Enrichment
# Running with a single core can cause:
# Error in sample.int(length(x), size, replace, prob) :
#    cannot take a sample larger than the population when 'replace = FALSE'
# Running with multiple cores throws only a warning:
# Warning message:
# In mclapply(seq_along(idx), function(j) { :
#    all scheduled cores encountered errors in user code
# This results in incorrect output, thus we throw and error when we encounter
# that warning message
withCallingHandlers(
  peak_overlap_enrichment <- enrichPeakOverlap(
    queryPeak     = argv$query_peak_file,
    targetPeak    = unlist(argv$target_peak_files),
    TxDb          = txdb,
    pAdjustMethod = argv$p_adjust_method,
    nShuffle      = argv$nshuffle,
    chainFile     = NULL,
    verbose       = TRUE,
    mc.cores      = cores,
  ),
  warning = function(w) {
    if (grepl("encountered errors in user code", w$message)) {
      stop(paste0("Errors encounted executing enrichPeakOverlap: ", w$message))
    } else {
      message(w$message)
    }
  }
)

write_tsv(peak_overlap_enrichment, argv$output_tsv_file)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
library(BiocManager, quietly = TRUE)
library(ChIPseeker, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(readr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("Profile of peaks binding to TSS regions")
p <- add_argument(p, "--peak_files",
                  help = "Peak files to annotate and compare",
                  nargs = Inf)
p <- add_argument(p, "--txdb_file",
                  help = "File to load txdb using AnnotationDbi::loadDb()")
p <- add_argument(p, "--annotation_file",
                  help = paste0("GFF3 or GTF file of gene annotations used ",
                                "to build txdb"))
p <- add_argument(p, "--txdb",
                  help = "Name of txdb package to install from Bioconductor")

# Add an optional arguments
p <- add_argument(p, "--names", help = "Sample names for each peak file",
                  nargs = Inf)
p <- add_argument(p, "--tag_profile_plot",
                  help = "Average peak profile plot filename",
                  default = "avgProfilePlot.pdf")
p <- add_argument(p, "--tag_heatmap_plot",
                  help = "Tag heatmap plot PDF filename",
                  default = "tagHeatmapPlot.pdf")
p <- add_argument(p, "--tag_matrix_list",
                  help = paste0("List of outputs from getTagMatrix ",
                                "CHiPseeker function"),
                  default = "tag_matrix_list.Rdata")

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    "--peak_files", snakemake@input[["peak_files"]],
    "--txdb_file", snakemake@input[["txdb_file"]],
    "--names", snakemake@params[["names"]],
    "--tag_profile_plot", snakemake@output[["tag_profile_plot"]],
    "--tag_heatmap_plot", snakemake@output[["tag_heatmap_plot"]],
    "--tag_matrix_list", snakemake@output[["tag_matrix_list"]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  input_file <- c("results_2020-12-03/macs2/D701-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D702-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D703-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D704-lane1_peaks.broadPeak",
                  "results_2020-12-03/macs2/D705-lane1_peaks.broadPeak")
  names <- c("D701", "D702", "D703", "D704", "D705")
  annotation_file <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf"
  txdb_file <- "txdb.db"
  argv <- parse_args(p, c("--peak_files", input_file,
                          "--names", names,
                          "--txdb_file", txdb_file))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Set names
if (!anyNA(argv$names)) {
  peak_file_names <- argv$names
} else {
  peak_file_names <- sapply(argv$peak_files, basename)
}
names(argv$peak_files) <- peak_file_names

# Get txdb object
if (!is.na(argv$txdb)) {
  # Load (install if needed) txdb from bioconductor
  library(pacman, quietly = TRUE)
  pacman::p_load(argv$txdb, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb))
} else if (!is.na(argv$txdb_file)) {
  # Load txdb
  library(AnnotationDbi, quietly = TRUE)
  txdb <- AnnotationDbi::loadDb(argv$txdb_file)
} else if (!is.na(argv$annotation_file)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  txdb <- GenomicFeatures::makeTxDbFromGFF(argv$annotation_file)
} else {
  stop("Must specify one of --txdb, --txdb_file, or --annotation_file")
}


# Profile of peaks binding to TSS regions
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
tag_matrix_list <- lapply(argv$peak_files, getTagMatrix, windows = promoter)
saveRDS(tag_matrix_list, file = argv$tag_matrix_list)

# Average tag profile
tag_profile_plot <- plotAvgProf(tag_matrix_list, xlim = c(-3000, 3000))
save_plot(
  filename = argv$tag_profile_plot,
  plot = tag_profile_plot,
  base_height = 14,
  base_width = 14
)

# Tag heatmap
# Note: tagHeatMap does not return a ggplot object, we must save it differently
pdf(
  file = argv$tag_heatmap_plot,
  height = 14,
  width = 14
)
tagHeatmap(tag_matrix_list, xlim = c(-3000, 3000), color = NULL)
dev.off()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
library(BiocManager, quietly = TRUE)
library(AnnotationDbi, quietly = TRUE)
library(readr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(argparser, quietly = TRUE)

p <- arg_parser("Get a TXDB object and save for use in other scripts")
p <- add_argument(p, "txdb_input",
                  help = paste0("Name of txdb package to install from ",
                                "Bioconductor or annotation file (GFF3 ",
                                "or GTF) to use to build a TXDB object"))
p <- add_argument(p, "--chrom_info",
                  help = paste0("Chromosome info file with columns: ",
                  "chrom, length, is_circular (optional)"))
p <- add_argument(p, "--output",
                  help = "File to save txdb object in",
                  default = "txdb.db")

# Parse arguments (interactive, snakemake, or command line)
if (exists("snakemake")) {
  # Arguments via Snakemake
  argv <- parse_args(p, c(
    snakemake@input[["txdb_input"]],
    "--chrom_info", snakemake@input[["chrom_info"]],
    "--output", snakemake@output[[1]]
  ))
} else if (interactive()) {
  # Arguments supplied inline (for debug/testing when running interactively)
  print("Running interactively...")
  txdb_input <- "genomes/hg38/annotation/Homo_sapiens.GRCh38.101.gtf"
  chrom_info <- paste0("genomes/hg38/genome/fasta/",
                       "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.fai")
  argv <- parse_args(p, c(txdb_input,
                          "--chrom_info", chrom_info))
  print(argv)
} else {
  # Arguments from command line
  argv <- parse_args(p)
  print(argv)
}

# Build txdb if txdb_input is file
if (file.exists(argv$txdb_input)) {
  # Create txdb object from supplied annotation file
  library(GenomicFeatures, quietly = TRUE)
  if (!is.na(argv$chrom_info)) {
    # Get chrom info
    is_circular_column <-
      count_fields(argv$chrom_info, tokenizer_tsv())[1] > 2
    if (is_circular_column) {
      col_names <- c("chrom", "length", "is_circular")
      col_spec <- cols_only(X1 = col_character(),
                            X2 = col_integer(),
                            X3 = col_guess())
    } else {
      col_names <- c("chrom", "length")
      col_spec <- cols(X1 = col_character(),
                       X2 = col_integer())
    }
    chrom_info <- read_tsv(argv$chrom_info,
                           col_names = FALSE,
                           col_types = col_spec)
    if (is.logical(chrom_info$X3)) {
      chrom_info <-
        chrom_info %>% select(c(
          chrom = X1,
          length = X2,
          is_circular = X3
        ))
    } else {
      chrom_info <- chrom_info %>% select(c(chrom = X1, length = X2))
    }
    txdb <- makeTxDbFromGFF(argv$txdb_input, chrominfo = chrom_info)
  } else {
    txdb <- makeTxDbFromGFF(argv$txdb_input)
  }
} else {
  library(pacman, quietly = TRUE)
  # Load (install if needed) txdb from bioconductor
  pacman::p_load(argv$txdb_input, character.only = TRUE)
  txdb <- eval(parse(text = argv$txdb_input))
}

# Save txdb
AnnotationDbi::saveDb(txdb, argv$output)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " -j {snakemake.threads}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 1, "Input must contain 1 (single-end) element."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.extra}"
    " -j {snakemake.threads}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
import re

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

jsd_sample = snakemake.input.get("jsd_sample")
out_counts = snakemake.output.get("counts")
out_metrics = snakemake.output.get("qc_metrics")
optional_output = ""
jsd = ""

if jsd_sample:
    jsd += " --JSDsample {jsd} ".format(jsd=jsd_sample)

if out_counts:
    optional_output += " --outRawCounts {out_counts} ".format(out_counts=out_counts)

if out_metrics:
    optional_output += " --outQualityMetrics {metrics} ".format(metrics=out_metrics)

shell(
    "(plotFingerprint "
    "-b {snakemake.input.bam_files} "
    "-o {snakemake.output.fingerprint} "
    "{optional_output} "
    "--numberOfProcessors {snakemake.threads} "
    "{jsd} "
    "{snakemake.params}) {log}"
)
# ToDo: remove the 'NA' string replacement when fixed in deepTools, see:
# https://github.com/deeptools/deepTools/pull/999
regex_passes = 2

with open(out_metrics, "rt") as f:
    metrics = f.read()
    for i in range(regex_passes):
        metrics = re.sub("\tNA(\t|\n)", "\tnan\\1", metrics)

with open(out_metrics, "wt") as f:
    f.write(metrics)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_region = snakemake.output.get("regions")
out_matrix = snakemake.output.get("heatmap_matrix")

optional_output = ""

if out_region:
    optional_output += " --outFileSortedRegions {out_region} ".format(
        out_region=out_region
    )

if out_matrix:
    optional_output += " --outFileNameMatrix {out_matrix} ".format(
        out_matrix=out_matrix
    )

shell(
    "(plotHeatmap "
    "-m {snakemake.input[0]} "
    "-o {snakemake.output.heatmap_img} "
    "{optional_output} "
    "{snakemake.params}) {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_region = snakemake.output.get("regions")
out_data = snakemake.output.get("data")

optional_output = ""

if out_region:
    optional_output += " --outFileSortedRegions {out_region} ".format(
        out_region=out_region
    )

if out_data:
    optional_output += " --outFileNameData {out_data} ".format(out_data=out_data)

shell(
    "(plotProfile "
    "-m {snakemake.input[0]} "
    "-o {snakemake.output.plot_img} "
    "{optional_output} "
    "{snakemake.params}) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import sys
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

in_contr = snakemake.input.get("control")
params = "{}".format(snakemake.params)
opt_input = ""
out_dir = ""

ext = "_peaks.xls"
out_file = [o for o in snakemake.output if o.endswith(ext)][0]
out_name = os.path.basename(out_file[: -len(ext)])
out_dir = os.path.dirname(out_file)

if in_contr:
    opt_input = "-c {contr}".format(contr=in_contr)

if out_dir:
    out_dir = "--outdir {dir}".format(dir=out_dir)

if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output):
    if any(
        out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak"))
        for out in snakemake.output
    ):
        sys.exit(
            "Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n"
            "For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n"
        )
    else:
        if " --broad" in params:
            sys.exit(
                "If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n"
                "Remove --broad option from params if these files are needed.\n"
            )

if any(
    out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output
):
    if "--broad " not in params and not params.endswith("--broad"):
        params += " --broad "

if any(
    out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg"))
    for out in snakemake.output
):
    if all(p not in params for p in ["--bdg", "-B"]):
        params += " --bdg "
else:
    if any(p in params for p in ["--bdg", "-B"]):
        sys.exit(
            "If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n"
        )

shell(
    "(macs2 callpeak "
    "-t {snakemake.input.treatment} "
    "{opt_input} "
    "{out_dir} "
    "-n {out_name} "
    "{params}) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params
java_opts = get_java_opts(snakemake)

shell(
    "picard MarkDuplicates "  # Tool and its subcommand
    "{java_opts} "  # Automatic java option
    "{extra} "  # User defined parmeters
    "INPUT={snakemake.input} "  # Input file
    "OUTPUT={snakemake.output.bam} "  # Output bam
    "METRICS_FILE={snakemake.output.metrics} "  # Output metrics
    "{log}"  # Logging
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}")
ShowHide 36 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/lparsons/kas-seq-workflow
Name: kas-seq-workflow
Version: v0.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...