A SnakeMake workflow to analyse whole genome bisulfite sequencing data from allopolyploids.

public public 1yr ago Version: v3.0.0 0 bookmarks

ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance wOrkflow

ARPEGGIO is a snakemake workflow that analyzes whole genome bisulfite sequencing (WGBS) data coming from (allo)polyploid species. The workflow includes all basic steps in WGBS data analysis (trimming, quality check and alignment), a read sorting tool specific for allopolyploids, the most comprehensive statistical tool for Differential Methylation (DM) analysis and a set of downstream analyses to obtain a list of genes showing differential methylation.

Motivation

In the last decade, the use of High-Throughput Sequencing (HTS) technologies has become widespread across life sciences. With technology not being a bottleneck anymore, the new challenge with HTS data has shifted towards data analysis. To process and analyze WGBS data, many tools exist, but most of them were developed and/or tested with a focus on diploid model species. For polyploid species there are some complexities that are often not taken into account. One example is the large amount of duplicated genes in polyploids (homeologous genes) which might be challenging at the mapping step and influence downstream analyses. To help with the analysis of polyploid WGBS data we developed ARPEGGIO: an automated and reproducible workflow which aims at being easy to set up and use.

Why ARPEGGIO?

ARPEGGIO is easily setup with one configuration file and once ready, it will automatically analyse your WGBS data to provide a list of differentially methylated regions (DMRs). Thanks to Snakemake, a human readable, Python based language for workflows and Conda, a widely-used package manager, ARPEGGIO takes care of installing all the software needed fo the analyses and running all the steps in the workflow in the correct order. ARPEGGIO also ensures reproducibility of your analysis, you only need to share your configuration and your initial raw data.

What's new in ARPEGGIO?

Besides the workflow itself (which is already quite a lot of new), ARPEGGIO includes an allopolyploid specific read-sorting algorithm that has been adapted to deal with BS-seq data: EAGLE-RC . Check out the papers "Homeolog expression quantification methods for allopolyploids" and "EAGLE: Explicit Alternative Genome Likelihood Evaluator" by Kuo et al. for more details. Together with EAGLE-RC, there's also dmrseq : an R package for differential methylation analysis. This package has one of the most comprehensive approaches to deal with WGBS data problems: mainly statistical and computational. If you're curious check out "Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing" by Korthauer et al .

Workflow overview

Installation

To install this workflow you first need to install Snakemake via Conda . To further ensure reproducibility you can also install Singularity . Once everything is set up, run the following commands to clone the ARPEGGIO repository to your computer and run the workflow. With Conda only:

git clone https://github.com/supermaxiste/ARPEGGIO
cd ARPEGGIO
snakemake --use-conda

With Conda and Singularity:

git clone https://github.com/supermaxiste/ARPEGGIO
cd ARPEGGIO
snakemake --use-conda --use-singularity

Setup and run

Check out the Wiki to set up and run ARPEGGIO. If you're in a hurry you can also find a Quick Setup section. The Wiki will help you better understand the workflow design, input, output and summary files.

Troubleshooting and support

Google doesn't help? Are you stuck on an error that no one else seems to be having? Have you checked all the pages mentioning your problem but the solutions are not suitable? On the Wiki there's a list of common problems together with some general solutions. If that didn't work either, feel free to open an issue. Please make sure to describe your problem/errors and your trials in detail so that you can get the best help possible.

Credits

This project was inspired by ARMOR , if you work with RNA-seq data check it out!

Citing ARPEGGIO

Milosavljevic, S., Kuo, T., Decarli, S. et al. ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance workflOw. BMC Genomics 22, 547 (2021). https://doi.org/10.1186/s12864-021-07845-2

Code Snippets

24
25
shell:
    "bismark_genome_preparation {params.genome1} 2> {log}"
44
45
shell:
    "bismark_genome_preparation {params.genome2} 2> {log}"
78
79
shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores}  -o {params.output} --temp_dir {params.output} --genome {params.genome1} {input.fastq} 2> {log}"
112
113
shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores}  -o {params.output} --temp_dir {params.output} --genome {params.genome2} {input.fastq} 2> {log}"
149
150
shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.genome1} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}"
186
187
shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.genome2} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}"
50
51
shell:
    "wget https://github.com/tony-kuo/eagle/archive/v{params.eagle_version}.tar.gz -O {output.eagle_tar} 2> {log}"
65
66
shell:
    "tar -C {params.build_prefix} -vxf {input.eagle_tar} 2> {log}"
79
80
shell:
    "wget https://github.com/samtools/htslib/releases/download/{params.htslib_version}/{params.htslib_tar_name} -O {output.htslib_tar} 2> {log}"
94
95
shell:
    "tar -C {params.build_prefix} -vxf {input.htslib_tar} 2> {log}"
113
114
shell:
    "{params.eagle_mk_env} make -j {threads} -C {params.eagle_dir_path} {params.eagle_mk_flags} 2> {log}"
130
131
shell:
    "{params.eagle_mk_env} make -C {params.eagle_dir_path} install {params.eagle_mk_flags} 2> {log}"
28
29
shell:
    "deduplicate_bismark -s --output_dir {params} --bam {input} 2> {log}"
52
53
shell:
    "deduplicate_bismark -p --output_dir {params} --bam {input} 2> {log}"
26
27
shell:
    "Rscript scripts/CoverageFileGeneratorComplete.R {input.p1} {params.output} {params.sample_name} 2> {log}"
47
48
shell:
    "Rscript scripts/CoverageFileGeneratorComplete.R {input.p1} {params.output} {params.sample_name} 2> {log}"
62
63
shell:
    "cat {input.p1} {input.p2} > {output}"
80
81
shell:
    "Rscript scripts/CoverageFileGeneratorComplete.R {input} {params.output} {params.sample_name} 2> {log}"
104
105
shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}"
128
129
shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}"
152
153
shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}"
176
177
shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}"
200
201
shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}"
224
225
shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}"
249
250
shell:
    "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}"
271
272
shell:
    "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}"
293
294
shell:
    "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}"
23
24
shell:
    "Rscript scripts/significantGenesToBed.R {input} {params} 2> {log}"
49
50
shell:
    "Rscript scripts/significantGenesToBed.R {input} {params} 2> {log}"
69
70
shell:
    "bedtools intersect -a {params.anno1} -b {input.i1} -wo > {output.o1} 2> {log}"
86
87
shell:
    "bedtools intersect -a {params.anno2} -b {input.i2} -wo > {output.o2} 2> {log}"
105
106
107
108
shell:
    "bedtools intersect -a {params.anno1} -b {input} -wo > {output} 2> {log}" if (
    sum(samples.origin == "parent1") > 0
    ) else "bedtools intersect -a {params.anno2} -b {input} -wo > {output} 2> {log}"
122
123
shell:
    "bedtools intersect -a {params.anno1} -b {input} -wo > {output} 2> {log}"
137
138
shell:
    "bedtools intersect -a {params.anno2} -b {input} -wo > {output} 2> {log}"
157
158
shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}"
174
175
shell:
    "Rscript scripts/DMGeneSummary.R {input.i2} {input.dm2} {params.geneID2} {params.o2} 2> {log}"
195
196
197
198
shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}" if (
    sum(samples.origin == "parent1") > 0
    ) else "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID2} {params.o1} 2> {log}"
214
215
shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}"
231
232
shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID2} {params.o2} 2> {log}"
71
72
73
shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
141
142
143
shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
237
238
239
shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
333
334
335
shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
410
411
412
shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
487
488
489
shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
594
595
596
shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
701
702
703
shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"
740
741
shell:
    "coverage2cytosine -CX --genome_folder {params.genome1} -o {params.filename1} {input.f1} 2> {log}"
775
776
shell:
    "coverage2cytosine -CX --genome_folder {params.genome2} -o {params.filename2} {input.f2} 2> {log}"
818
819
shell:
    "coverage2cytosine -CX --genome_folder {params.genome1} -o {params.filename1} {input.f1} 2> {log}"
861
862
shell:
    "coverage2cytosine -CX --genome_folder {params.genome2} -o {params.filename2} {input.f2} 2> {log}"
25
26
shell:
    "fastqc -o {params.FastQC} -t {threads} {input.fastq} 2> {log}"
46
47
shell:
    "fastqc -o {params.FastQC} -t {threads} {input.fastq} 2> {log}"
69
70
shell:
    "fastqc -o {params.FastQC} -t {threads} {input} 2> {log}"
87
88
shell:
    "bismark_genome_preparation {input.control} 2> {log}"
121
122
shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores}  -o {params.output} --temp_dir {params.output} --genome {params.control} {input.fastq} 2> {log}"
155
156
shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.control} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}"
183
184
shell:
    "samtools sort {input.p1} > {output.o1}"
211
212
shell:
    "samtools sort {input.p2} > {output.o2} 2> {log}"
247
248
shell:
    "samtools sort {input} > {output}"
276
277
shell:
    "qualimap bamqc -bam {input} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"
305
306
shell:
    "qualimap bamqc -bam {input} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"
330
331
shell:
    "qualimap bamqc -bam {input.genome} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"
355
356
shell:
    "qualimap bamqc -bam {input.genome} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"
376
377
shell:
    "multiqc {params.inputdir} -f -o {params.multiqcdir} 2> {log}"
37
38
shell:
    "{input.eagle_bin} --ngi {params.phred} --ref1={params.genome1} --bam1={input.reads1} --ref2={params.genome2} --bam2={input.reads2} -o {params.output} --bs=3 > {params.list} 2> {log}"
68
69
shell:
    "{input.eagle_bin} --ngi --paired {params.phred} --ref1={params.genome1} --bam1={input.reads1} --ref2={params.genome2} --bam2={input.reads2} -o {params.output} --bs=3 > {params.list} 2> {log}"
29
30
31
shell:
    "trim_galore -q 20 --clip_R1 {params.trim_5_r1} --phred33 --length 20 -j {params.trim_cores} -o {params.FASTQtrimmeddir} --path_to_cutadapt cutadapt {input.fastq1} 2> {log}" if config[
    "RUN_TRIMMING"
69
70
71
shell:
    "trim_galore -q 20 --clip_R1 {params.trim_5_r1} --clip_R2 {params.trim_5_r2} --phred33 --length 20 -j {params.trim_cores} -o {params.FASTQtrimmeddir} --path_to_cutadapt cutadapt --paired {input.fastq1} {input.fastq2} 2> {log}" if config[
    "RUN_TRIMMING"
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
library(data.table)

comm_args <- commandArgs(trailingOnly = TRUE)
full_path <- normalizePath(comm_args[1])
## Read coverage file
coverage_file <- fread(full_path)
## Set working directory
out <- comm_args[2]
out <- normalizePath(out)
setwd(out)
# Save output name
output <- comm_args[3]
colnames(coverage_file) <- c("scaffold", "position", "strand", "mC", 
                             "uC", "context", "spec_context")
print("File has been read")

# Replace all NAs with 0s

coverage_file[is.na(coverage_file$mC),]$mC <- 0
coverage_file[is.na(coverage_file$uC),]$uC <- 0

# Save all positions with reads

with_reads <- (coverage_file$mC+coverage_file$uC)!=0

## We vectorize conditions to simplify our problem
## First check only contexts of interest: in our case CG, CHG and CHH (with reads)

CG_contexts <- coverage_file$context=="CG" & with_reads
CHG_contexts <- coverage_file$context=="CHG" & with_reads
CHH_contexts <- coverage_file$context=="CHH" & with_reads

## We now generate our cov files with the following format:
## <scaffold> <start_position> <end_position> <% methylation>
# <count_methylated> <count_unmethylated>

cov_CG <- cbind(coverage_file$scaffold[CG_contexts],
                coverage_file$position[CG_contexts],
                coverage_file$position[CG_contexts],
                (coverage_file$mC[CG_contexts]/(coverage_file$mC[CG_contexts]+coverage_file$uC[CG_contexts]))*100,
                coverage_file$mC[CG_contexts],
                coverage_file$uC[CG_contexts])

#Replace infinite values with 0 (not needed anymore)
#infinite <- cov_CG[,4]==Inf
#cov_CG[infinite,4] <- 0

cov_CHG <- cbind(coverage_file$scaffold[CHG_contexts],
                coverage_file$position[CHG_contexts],
                coverage_file$position[CHG_contexts],
                (coverage_file$mC[CHG_contexts]/(coverage_file$mC[CHG_contexts]+coverage_file$uC[CHG_contexts]))*100,
                coverage_file$mC[CHG_contexts],
                coverage_file$uC[CHG_contexts])

cov_CHH <- cbind(coverage_file$scaffold[CHH_contexts],
                 coverage_file$position[CHH_contexts],
                 coverage_file$position[CHH_contexts],
                 (coverage_file$mC[CHH_contexts]/(coverage_file$mC[CHH_contexts]+coverage_file$uC[CHH_contexts]))*100,
                 coverage_file$mC[CHH_contexts],
                 coverage_file$uC[CHH_contexts])

write.table(cov_CG, file = paste0(output,"_CG.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(cov_CHG, file = paste0(output,"_CHG.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(cov_CHH, file = paste0(output,"_CHH.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
library(data.table)
library(stringr)

comm_args <- commandArgs(trailingOnly = TRUE)

# First argument: intersection file

intersection <- comm_args[1]

# Second argument: DM regions (dmrseq)

regions <- comm_args[2]

# Third argument: geneID name

geneID <- comm_args[3]

# Fifth argument: output name

output_name <- comm_args[4]

# Check if intersection file is empty 

if (file.info(intersection)$size==0){
  print("There was no intersection between significant DMRs and the annotation file. Returning empty file")
  empty <- c("There was no intersection between significant DMRs and the annotation file")
  write.table(empty, file=paste0(output_name, ".txt"), quote = FALSE, sep = "\t", 
              row.names = FALSE, col.names = TRUE)
} else {

# Read, clean and name file columns

intersection_file <- fread(intersection)
dm_regions <- fread(regions)

# Pick columns of interest
intersection_file <- intersection_file[, c(1, 3, 4, 5, 9, 11, 12, 13)]

# Rename columns
colnames(intersection_file) <- c("seqname", "feature", "start", "end", 
                                 "attribute", "overlap_start", "overlap_end",
                                 "length")

# Select only genes
intersection_file <- intersection_file[intersection_file$feature=="gene",]

# Pick geneID from attribute column
geneID_name <- paste0(geneID, "=") 

# Separate subcolumns in attribute column
attribute_col <- as.data.frame(str_split_fixed(intersection_file$attribute, ";", n=Inf))

# Look for column with geneID keyword
geneID_column <- which(grepl(geneID_name, unlist(attribute_col[1,])))

# Remove geneID keyword from column
intersection_file$attribute <- gsub(geneID_name, "", attribute_col[,geneID_column])

# Create final file (missing 1 column)
DM_genes_summary <- as.data.frame(cbind(geneID=intersection_file$attribute, 
                                        seqname=intersection_file$seqname,
                                        start=intersection_file$start, 
                                        end=intersection_file$end,
                                        region_start=intersection_file$overlap_start, 
                                        region_end=intersection_file$overlap_end, 
                                        overlap_length=intersection_file$length))

# Match DM regions coordinates

dm_coordinates <- paste(DM_genes_summary$seqname, ":", DM_genes_summary$region_start,
                        "-", DM_genes_summary$region_end, sep = "")
dm_regions <- cbind(dm_regions, coordinates=paste(dm_regions$seqname, ":", dm_regions$start,
                                      "-", dm_regions$end, sep = ""))
# Find which line of the summary corresponds to which line of the dmrseq file
corresponding_match <- match(dm_coordinates, dm_regions$coordinates)

# Add column to dmrseq with methylation status

m_status <- ifelse(dm_regions$stat>0, "decrease", "increase")

# Add final column to summary file

DM_genes_summary <- cbind(DM_genes_summary, methylation_status=m_status[corresponding_match])

# write file to output

write.table(DM_genes_summary, file=paste0(output_name, ".txt"), quote = FALSE, sep = "\t", 
            row.names = FALSE, col.names = TRUE)

}
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
library(dmrseq)
library(data.table)
library(BiocParallel)

# Four command line arguements are needed: first is the number of samples
# for the first species analyzed, second is the number of samples for the
# second species analyzed, third is the output name with extension and 
# fourth is the number of cores. After those four, the cov files from the
# two species you want to compare need to be added (as many as you have).
# The order of the cov files MUST be diploid species first and polyploid species second

comm_args <- commandArgs(trailingOnly = TRUE)

# First argument: number of samples for one species
samples1 <- comm_args[1]

# Second argument: number of samples for the other species
samples2 <- comm_args[2]

# Second argument: output name (with extension)
output <- comm_args[3]

# Third argument: number of cores
cores <- comm_args[4]

# All other arguments: cov files (in the right order)
samples <- as.integer(samples1) + as.integer(samples2)
sample_counter <- 0
for (i in 1:samples){
  if (!is.na(comm_args[i+4])){
    assign(paste0("file_", i), comm_args[i+4])
    sample_counter <- sample_counter + 1
  }
}

# Create vector of cov files

cov_files <- c()
for (i in 1:sample_counter){
  cov_files[i] <- get(paste0("file_", i))
}

# Read cov files

bismarkBSseq <- read.bismark(files = c(cov_files),
                             rmZeroCov = TRUE, 
                             strandCollapse = FALSE,
                             verbose = TRUE)

# Specify conditions
sampleNames = c(rep("par", as.integer(samples1)), rep("kam", as.integer(samples2)))
pData(bismarkBSseq)$Species <- sampleNames

# Filtering step
loci.idx <- which(DelayedMatrixStats::rowSums2(getCoverage(bismarkBSseq, type="Cov")==0) == 0)
sample.idx <- which(pData(bismarkBSseq)$Species %in% c("par", "kam"))

bs.filtered <- bismarkBSseq[loci.idx, sample.idx]

register(MulticoreParam(cores))

# DMRseq function, normally takes around 1.5h
regions <- dmrseq(bs = bs.filtered, testCovariate = "Species")

# Save Robject

outputR <- paste0(substr(output, 1, nchar(output)-3), "Rdata")
save(regions, file = outputR)

#This took about 1.5 h
regions_dataframe <- as.data.frame(regions)

write.csv(regions_dataframe, file=output, quote = FALSE, row.names = FALSE, col.names = TRUE)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
library(data.table)

comm_args <- commandArgs(trailingOnly = TRUE)

# First argument: dmrseq output
dmrseq_output <- comm_args[1]

# Second argument: output name
output_name <- comm_args[2]


# Read file

candidate_regions <- fread(dmrseq_output)

# Select only significant regions

sig_regions <- candidate_regions[candidate_regions$qval < .05]

# If no regions are found, create an empty file and print message

if (nrow(sig_regions)==0){
  print("dmrseq didn't find any significant regions: all DMRs had q-value > 0.05. Returning empty file")
  empty <- c()
  write.table(empty, file=paste0(output_name, "_sorted.bed"), quote = FALSE, sep = "\t", 
              row.names = FALSE, col.names = FALSE)
} else {

# Select range of significant regions

sig_regions_bed <- sig_regions[,c("seqnames", "start", "end")]

# Some regions might have start > end, so we need to fix this

# Get index of wrong positions

wrong_pos_index <- which(sig_regions_bed$start > sig_regions_bed$end)

#select rows with wrong index

wrong_pos <- sig_regions_bed[wrong_pos_index,]

#switch starting and ending position in the original file

sig_regions_bed[wrong_pos_index,2] <- wrong_pos$end
sig_regions_bed[wrong_pos_index,3] <- wrong_pos$start

#To check whether anything is wrong use following command:
#wrong_pos_index <- which(sig_regions_bed$start>sig_regions_bed$end)

# output bed file

write.table(sig_regions_bed, file=paste0(output_name, ".bed"), quote = FALSE, sep = "\t", 
            row.names = FALSE, col.names = FALSE)

command <- paste("sort -k1,1 -k2,2n ", output_name, ".bed > ", output_name, "_sorted.bed", 
                 sep = "")

system(command = command)
}
ShowHide 63 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/supermaxiste/ARPEGGIO
Name: arpeggio
Version: v3.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...