A SnakeMake workflow to analyse whole genome bisulfite sequencing data from allopolyploids.

public 1yr ago Version: v3.0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance wOrkflow

ARPEGGIO is a snakemake workflow that analyzes whole genome bisulfite sequencing (WGBS) data coming from (allo)polyploid species. The workflow includes all basic steps in WGBS data analysis (trimming, quality check and alignment), a read sorting tool specific for allopolyploids, the most comprehensive statistical tool for Differential Methylation (DM) analysis and a set of downstream analyses to obtain a list of genes showing differential methylation.

Motivation

In the last decade, the use of High-Throughput Sequencing (HTS) technologies has become widespread across life sciences. With technology not being a bottleneck anymore, the new challenge with HTS data has shifted towards data analysis. To process and analyze WGBS data, many tools exist, but most of them were developed and/or tested with a focus on diploid model species. For polyploid species there are some complexities that are often not taken into account. One example is the large amount of duplicated genes in polyploids (homeologous genes) which might be challenging at the mapping step and influence downstream analyses. To help with the analysis of polyploid WGBS data we developed ARPEGGIO: an automated and reproducible workflow which aims at being easy to set up and use.

Why ARPEGGIO?

ARPEGGIO is easily setup with one configuration file and once ready, it will automatically analyse your WGBS data to provide a list of differentially methylated regions (DMRs). Thanks to Snakemake, a human readable, Python based language for workflows and Conda, a widely-used package manager, ARPEGGIO takes care of installing all the software needed fo the analyses and running all the steps in the workflow in the correct order. ARPEGGIO also ensures reproducibility of your analysis, you only need to share your configuration and your initial raw data.

What's new in ARPEGGIO?

Besides the workflow itself (which is already quite a lot of new), ARPEGGIO includes an allopolyploid specific read-sorting algorithm that has been adapted to deal with BS-seq data: EAGLE-RC . Check out the papers "Homeolog expression quantification methods for allopolyploids" and "EAGLE: Explicit Alternative Genome Likelihood Evaluator" by Kuo et al. for more details. Together with EAGLE-RC, there's also dmrseq : an R package for differential methylation analysis. This package has one of the most comprehensive approaches to deal with WGBS data problems: mainly statistical and computational. If you're curious check out "Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing" by Korthauer et al .

Workflow overview

Installation

To install this workflow you first need to install Snakemake via Conda . To further ensure reproducibility you can also install Singularity . Once everything is set up, run the following commands to clone the ARPEGGIO repository to your computer and run the workflow. With Conda only:

git clone https://github.com/supermaxiste/ARPEGGIO
cd ARPEGGIO
snakemake --use-conda

With Conda and Singularity:

git clone https://github.com/supermaxiste/ARPEGGIO
cd ARPEGGIO
snakemake --use-conda --use-singularity

Setup and run

Check out the Wiki to set up and run ARPEGGIO. If you're in a hurry you can also find a Quick Setup section. The Wiki will help you better understand the workflow design, input, output and summary files.

Troubleshooting and support

Google doesn't help? Are you stuck on an error that no one else seems to be having? Have you checked all the pages mentioning your problem but the solutions are not suitable? On the Wiki there's a list of common problems together with some general solutions. If that didn't work either, feel free to open an issue. Please make sure to describe your problem/errors and your trials in detail so that you can get the best help possible.

Credits

This project was inspired by ARMOR , if you work with RNA-seq data check it out!

Citing ARPEGGIO

Milosavljevic, S., Kuo, T., Decarli, S. et al. ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance workflOw. BMC Genomics 22, 547 (2021). https://doi.org/10.1186/s12864-021-07845-2

Code Snippets

shell:
    "bismark_genome_preparation {params.genome1} 2> {log}"

SnakeMake Bismark From line 24 of rules/alignment.smk

shell:
    "bismark_genome_preparation {params.genome2} 2> {log}"

SnakeMake Bismark From line 44 of rules/alignment.smk

shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores}  -o {params.output} --temp_dir {params.output} --genome {params.genome1} {input.fastq} 2> {log}"

SnakeMake Bismark From line 78 of rules/alignment.smk

shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores}  -o {params.output} --temp_dir {params.output} --genome {params.genome2} {input.fastq} 2> {log}"

SnakeMake Bismark From line 112 of rules/alignment.smk

shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.genome1} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}"

SnakeMake Bismark From line 149 of rules/alignment.smk

shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.genome2} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}"

SnakeMake Bismark From line 186 of rules/alignment.smk

shell:
    "wget https://github.com/tony-kuo/eagle/archive/v{params.eagle_version}.tar.gz -O {output.eagle_tar} 2> {log}"

SnakeMake From line 50 of rules/build_eagle.smk

shell:
    "tar -C {params.build_prefix} -vxf {input.eagle_tar} 2> {log}"

SnakeMake From line 65 of rules/build_eagle.smk

shell:
    "wget https://github.com/samtools/htslib/releases/download/{params.htslib_version}/{params.htslib_tar_name} -O {output.htslib_tar} 2> {log}"

SnakeMake From line 79 of rules/build_eagle.smk

shell:
    "tar -C {params.build_prefix} -vxf {input.htslib_tar} 2> {log}"

SnakeMake From line 94 of rules/build_eagle.smk

shell:
    "{params.eagle_mk_env} make -j {threads} -C {params.eagle_dir_path} {params.eagle_mk_flags} 2> {log}"

SnakeMake From line 113 of rules/build_eagle.smk

shell:
    "{params.eagle_mk_env} make -C {params.eagle_dir_path} install {params.eagle_mk_flags} 2> {log}"

SnakeMake From line 130 of rules/build_eagle.smk

shell:
    "deduplicate_bismark -s --output_dir {params} --bam {input} 2> {log}"

SnakeMake Bismark From line 28 of rules/deduplication.smk

shell:
    "deduplicate_bismark -p --output_dir {params} --bam {input} 2> {log}"

SnakeMake Bismark From line 52 of rules/deduplication.smk

shell:
    "Rscript scripts/CoverageFileGeneratorComplete.R {input.p1} {params.output} {params.sample_name} 2> {log}"

SnakeMake From line 26 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/CoverageFileGeneratorComplete.R {input.p1} {params.output} {params.sample_name} 2> {log}"

SnakeMake From line 47 of rules/differential_methylation_analysis.smk

shell:
    "cat {input.p1} {input.p2} > {output}"

SnakeMake From line 62 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/CoverageFileGeneratorComplete.R {input} {params.output} {params.sample_name} 2> {log}"

SnakeMake From line 80 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}"

SnakeMake From line 104 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}"

SnakeMake From line 128 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}"

SnakeMake From line 152 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}"

SnakeMake From line 176 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p1} {params.n_samples_allo} {output.comparison1} {threads} {input.p1} {input.allo} 2> {log}"

SnakeMake From line 200 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.n_samples_p2} {params.n_samples_allo} {output.comparison2} {threads} {input.p2} {input.allo} 2> {log}"

SnakeMake From line 224 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}"

SnakeMake From line 249 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}"

SnakeMake From line 271 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/dmrseq.R {params.samples_B} {params.samples_A} {output.comparison} {threads} {input.condB} {input.condA} 2> {log}"

SnakeMake From line 293 of rules/differential_methylation_analysis.smk

shell:
    "Rscript scripts/significantGenesToBed.R {input} {params} 2> {log}"

SnakeMake From line 23 of rules/downstream.smk

shell:
    "Rscript scripts/significantGenesToBed.R {input} {params} 2> {log}"

SnakeMake From line 49 of rules/downstream.smk

shell:
    "bedtools intersect -a {params.anno1} -b {input.i1} -wo > {output.o1} 2> {log}"

SnakeMake BEDTools From line 69 of rules/downstream.smk

shell:
    "bedtools intersect -a {params.anno2} -b {input.i2} -wo > {output.o2} 2> {log}"

SnakeMake BEDTools From line 86 of rules/downstream.smk

shell:
    "bedtools intersect -a {params.anno1} -b {input} -wo > {output} 2> {log}" if (
    sum(samples.origin == "parent1") > 0
    ) else "bedtools intersect -a {params.anno2} -b {input} -wo > {output} 2> {log}"

SnakeMake BEDTools From line 105 of rules/downstream.smk

shell:
    "bedtools intersect -a {params.anno1} -b {input} -wo > {output} 2> {log}"

SnakeMake BEDTools From line 122 of rules/downstream.smk

shell:
    "bedtools intersect -a {params.anno2} -b {input} -wo > {output} 2> {log}"

SnakeMake BEDTools From line 137 of rules/downstream.smk

shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}"

SnakeMake From line 157 of rules/downstream.smk

shell:
    "Rscript scripts/DMGeneSummary.R {input.i2} {input.dm2} {params.geneID2} {params.o2} 2> {log}"

SnakeMake From line 174 of rules/downstream.smk

shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}" if (
    sum(samples.origin == "parent1") > 0
    ) else "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID2} {params.o1} 2> {log}"

SnakeMake From line 195 of rules/downstream.smk

shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID1} {params.o1} 2> {log}"

SnakeMake From line 214 of rules/downstream.smk

shell:
    "Rscript scripts/DMGeneSummary.R {input.i1} {input.dm1} {params.geneID2} {params.o2} 2> {log}"

SnakeMake From line 231 of rules/downstream.smk

shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 71 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 141 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 237 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -s -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 333 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 410 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 487 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 594 of rules/methylation_extraction.smk

shell:
    "bismark_methylation_extractor -p -o {params.output} --gzip --genome_folder {params.genome} --multicore {params.bismark_cores} --no_overlap --comprehensive --bedGraph --scaffolds --CX {input} 2> {log}" if config[
    "UNFINISHED_GENOME"

SnakeMake Bismark From line 701 of rules/methylation_extraction.smk

shell:
    "coverage2cytosine -CX --genome_folder {params.genome1} -o {params.filename1} {input.f1} 2> {log}"

SnakeMake From line 740 of rules/methylation_extraction.smk

shell:
    "coverage2cytosine -CX --genome_folder {params.genome2} -o {params.filename2} {input.f2} 2> {log}"

SnakeMake From line 775 of rules/methylation_extraction.smk

shell:
    "coverage2cytosine -CX --genome_folder {params.genome1} -o {params.filename1} {input.f1} 2> {log}"

SnakeMake From line 818 of rules/methylation_extraction.smk

shell:
    "coverage2cytosine -CX --genome_folder {params.genome2} -o {params.filename2} {input.f2} 2> {log}"

SnakeMake From line 861 of rules/methylation_extraction.smk

shell:
    "fastqc -o {params.FastQC} -t {threads} {input.fastq} 2> {log}"

SnakeMake FastQC From line 25 of rules/quality_check.smk

shell:
    "fastqc -o {params.FastQC} -t {threads} {input.fastq} 2> {log}"

SnakeMake FastQC From line 46 of rules/quality_check.smk

shell:
    "fastqc -o {params.FastQC} -t {threads} {input} 2> {log}"

SnakeMake FastQC From line 69 of rules/quality_check.smk

shell:
    "bismark_genome_preparation {input.control} 2> {log}"

SnakeMake Bismark From line 87 of rules/quality_check.smk

shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores}  -o {params.output} --temp_dir {params.output} --genome {params.control} {input.fastq} 2> {log}"

SnakeMake Bismark From line 121 of rules/quality_check.smk

shell:
    "bismark --prefix {params.prefix} --multicore {params.bismark_cores} --genome {params.control} -1 {input.fastq1} -2 {input.fastq2} -o {params.output} --temp_dir {params.output} 2> {log}"

SnakeMake Bismark From line 155 of rules/quality_check.smk

shell:
    "samtools sort {input.p1} > {output.o1}"

SnakeMake SAMtools From line 183 of rules/quality_check.smk

shell:
    "samtools sort {input.p2} > {output.o2} 2> {log}"

SnakeMake SAMtools From line 211 of rules/quality_check.smk

shell:
    "samtools sort {input} > {output}"

SnakeMake SAMtools From line 247 of rules/quality_check.smk

shell:
    "qualimap bamqc -bam {input} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"

SnakeMake QualiMap From line 276 of rules/quality_check.smk

shell:
    "qualimap bamqc -bam {input} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"

SnakeMake QualiMap From line 305 of rules/quality_check.smk

shell:
    "qualimap bamqc -bam {input.genome} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"

SnakeMake QualiMap From line 330 of rules/quality_check.smk

shell:
    "qualimap bamqc -bam {input.genome} -outdir {params.output} -nt {threads} --java-mem-size=4G 2> {log}"

SnakeMake QualiMap From line 355 of rules/quality_check.smk

shell:
    "multiqc {params.inputdir} -f -o {params.multiqcdir} 2> {log}"

SnakeMake MultiQC From line 376 of rules/quality_check.smk

shell:
    "{input.eagle_bin} --ngi {params.phred} --ref1={params.genome1} --bam1={input.reads1} --ref2={params.genome2} --bam2={input.reads2} -o {params.output} --bs=3 > {params.list} 2> {log}"

SnakeMake From line 37 of rules/read_sorting.smk

shell:
    "{input.eagle_bin} --ngi --paired {params.phred} --ref1={params.genome1} --bam1={input.reads1} --ref2={params.genome2} --bam2={input.reads2} -o {params.output} --bs=3 > {params.list} 2> {log}"

SnakeMake From line 68 of rules/read_sorting.smk

shell:
    "trim_galore -q 20 --clip_R1 {params.trim_5_r1} --phred33 --length 20 -j {params.trim_cores} -o {params.FASTQtrimmeddir} --path_to_cutadapt cutadapt {input.fastq1} 2> {log}" if config[
    "RUN_TRIMMING"

SnakeMake Cutadapt Trim_Galore From line 29 of rules/trimming.smk

shell:
    "trim_galore -q 20 --clip_R1 {params.trim_5_r1} --clip_R2 {params.trim_5_r2} --phred33 --length 20 -j {params.trim_cores} -o {params.FASTQtrimmeddir} --path_to_cutadapt cutadapt --paired {input.fastq1} {input.fastq2} 2> {log}" if config[
    "RUN_TRIMMING"

SnakeMake Cutadapt Trim_Galore From line 69 of rules/trimming.smk

library(data.table)

comm_args <- commandArgs(trailingOnly = TRUE)
full_path <- normalizePath(comm_args[1])
## Read coverage file
coverage_file <- fread(full_path)
## Set working directory
out <- comm_args[2]
out <- normalizePath(out)
setwd(out)
# Save output name
output <- comm_args[3]
colnames(coverage_file) <- c("scaffold", "position", "strand", "mC", 
                             "uC", "context", "spec_context")
print("File has been read")

# Replace all NAs with 0s

coverage_file[is.na(coverage_file$mC),]$mC <- 0
coverage_file[is.na(coverage_file$uC),]$uC <- 0

# Save all positions with reads

with_reads <- (coverage_file$mC+coverage_file$uC)!=0

## We vectorize conditions to simplify our problem
## First check only contexts of interest: in our case CG, CHG and CHH (with reads)

CG_contexts <- coverage_file$context=="CG" & with_reads
CHG_contexts <- coverage_file$context=="CHG" & with_reads
CHH_contexts <- coverage_file$context=="CHH" & with_reads

## We now generate our cov files with the following format:
## <scaffold> <start_position> <end_position> <% methylation>
# <count_methylated> <count_unmethylated>

cov_CG <- cbind(coverage_file$scaffold[CG_contexts],
                coverage_file$position[CG_contexts],
                coverage_file$position[CG_contexts],
                (coverage_file$mC[CG_contexts]/(coverage_file$mC[CG_contexts]+coverage_file$uC[CG_contexts]))*100,
                coverage_file$mC[CG_contexts],
                coverage_file$uC[CG_contexts])

#Replace infinite values with 0 (not needed anymore)
#infinite <- cov_CG[,4]==Inf
#cov_CG[infinite,4] <- 0

cov_CHG <- cbind(coverage_file$scaffold[CHG_contexts],
                coverage_file$position[CHG_contexts],
                coverage_file$position[CHG_contexts],
                (coverage_file$mC[CHG_contexts]/(coverage_file$mC[CHG_contexts]+coverage_file$uC[CHG_contexts]))*100,
                coverage_file$mC[CHG_contexts],
                coverage_file$uC[CHG_contexts])

cov_CHH <- cbind(coverage_file$scaffold[CHH_contexts],
                 coverage_file$position[CHH_contexts],
                 coverage_file$position[CHH_contexts],
                 (coverage_file$mC[CHH_contexts]/(coverage_file$mC[CHH_contexts]+coverage_file$uC[CHH_contexts]))*100,
                 coverage_file$mC[CHH_contexts],
                 coverage_file$uC[CHH_contexts])

write.table(cov_CG, file = paste0(output,"_CG.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(cov_CHG, file = paste0(output,"_CHG.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(cov_CHH, file = paste0(output,"_CHH.cov"), sep="\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

R data.table From line 10 of scripts/CoverageFileGeneratorComplete.R

library(data.table)
library(stringr)

comm_args <- commandArgs(trailingOnly = TRUE)

# First argument: intersection file

intersection <- comm_args[1]

# Second argument: DM regions (dmrseq)

regions <- comm_args[2]

# Third argument: geneID name

geneID <- comm_args[3]

# Fifth argument: output name

output_name <- comm_args[4]

# Check if intersection file is empty 

if (file.info(intersection)$size==0){
  print("There was no intersection between significant DMRs and the annotation file. Returning empty file")
  empty <- c("There was no intersection between significant DMRs and the annotation file")
  write.table(empty, file=paste0(output_name, ".txt"), quote = FALSE, sep = "\t", 
              row.names = FALSE, col.names = TRUE)
} else {

# Read, clean and name file columns

intersection_file <- fread(intersection)
dm_regions <- fread(regions)

# Pick columns of interest
intersection_file <- intersection_file[, c(1, 3, 4, 5, 9, 11, 12, 13)]

# Rename columns
colnames(intersection_file) <- c("seqname", "feature", "start", "end", 
                                 "attribute", "overlap_start", "overlap_end",
                                 "length")

# Select only genes
intersection_file <- intersection_file[intersection_file$feature=="gene",]

# Pick geneID from attribute column
geneID_name <- paste0(geneID, "=") 

# Separate subcolumns in attribute column
attribute_col <- as.data.frame(str_split_fixed(intersection_file$attribute, ";", n=Inf))

# Look for column with geneID keyword
geneID_column <- which(grepl(geneID_name, unlist(attribute_col[1,])))

# Remove geneID keyword from column
intersection_file$attribute <- gsub(geneID_name, "", attribute_col[,geneID_column])

# Create final file (missing 1 column)
DM_genes_summary <- as.data.frame(cbind(geneID=intersection_file$attribute, 
                                        seqname=intersection_file$seqname,
                                        start=intersection_file$start, 
                                        end=intersection_file$end,
                                        region_start=intersection_file$overlap_start, 
                                        region_end=intersection_file$overlap_end, 
                                        overlap_length=intersection_file$length))

# Match DM regions coordinates

dm_coordinates <- paste(DM_genes_summary$seqname, ":", DM_genes_summary$region_start,
                        "-", DM_genes_summary$region_end, sep = "")
dm_regions <- cbind(dm_regions, coordinates=paste(dm_regions$seqname, ":", dm_regions$start,
                                      "-", dm_regions$end, sep = ""))
# Find which line of the summary corresponds to which line of the dmrseq file
corresponding_match <- match(dm_coordinates, dm_regions$coordinates)

# Add column to dmrseq with methylation status

m_status <- ifelse(dm_regions$stat>0, "decrease", "increase")

# Add final column to summary file

DM_genes_summary <- cbind(DM_genes_summary, methylation_status=m_status[corresponding_match])

# write file to output

write.table(DM_genes_summary, file=paste0(output_name, ".txt"), quote = FALSE, sep = "\t", 
            row.names = FALSE, col.names = TRUE)

}

R data.table stringr From line 6 of scripts/DMGeneSummary.R

library(dmrseq)
library(data.table)
library(BiocParallel)

# Four command line arguements are needed: first is the number of samples
# for the first species analyzed, second is the number of samples for the
# second species analyzed, third is the output name with extension and 
# fourth is the number of cores. After those four, the cov files from the
# two species you want to compare need to be added (as many as you have).
# The order of the cov files MUST be diploid species first and polyploid species second

comm_args <- commandArgs(trailingOnly = TRUE)

# First argument: number of samples for one species
samples1 <- comm_args[1]

# Second argument: number of samples for the other species
samples2 <- comm_args[2]

# Second argument: output name (with extension)
output <- comm_args[3]

# Third argument: number of cores
cores <- comm_args[4]

# All other arguments: cov files (in the right order)
samples <- as.integer(samples1) + as.integer(samples2)
sample_counter <- 0
for (i in 1:samples){
  if (!is.na(comm_args[i+4])){
    assign(paste0("file_", i), comm_args[i+4])
    sample_counter <- sample_counter + 1
  }
}

# Create vector of cov files

cov_files <- c()
for (i in 1:sample_counter){
  cov_files[i] <- get(paste0("file_", i))
}

# Read cov files

bismarkBSseq <- read.bismark(files = c(cov_files),
                             rmZeroCov = TRUE, 
                             strandCollapse = FALSE,
                             verbose = TRUE)

# Specify conditions
sampleNames = c(rep("par", as.integer(samples1)), rep("kam", as.integer(samples2)))
pData(bismarkBSseq)$Species <- sampleNames

# Filtering step
loci.idx <- which(DelayedMatrixStats::rowSums2(getCoverage(bismarkBSseq, type="Cov")==0) == 0)
sample.idx <- which(pData(bismarkBSseq)$Species %in% c("par", "kam"))

bs.filtered <- bismarkBSseq[loci.idx, sample.idx]

register(MulticoreParam(cores))

# DMRseq function, normally takes around 1.5h
regions <- dmrseq(bs = bs.filtered, testCovariate = "Species")

# Save Robject

outputR <- paste0(substr(output, 1, nchar(output)-3), "Rdata")
save(regions, file = outputR)

#This took about 1.5 h
regions_dataframe <- as.data.frame(regions)

write.csv(regions_dataframe, file=output, quote = FALSE, row.names = FALSE, col.names = TRUE)

R data.table dmrseq From line 4 of scripts/dmrseq.R

library(data.table)

comm_args <- commandArgs(trailingOnly = TRUE)

# First argument: dmrseq output
dmrseq_output <- comm_args[1]

# Second argument: output name
output_name <- comm_args[2]


# Read file

candidate_regions <- fread(dmrseq_output)

# Select only significant regions

sig_regions <- candidate_regions[candidate_regions$qval < .05]

# If no regions are found, create an empty file and print message

if (nrow(sig_regions)==0){
  print("dmrseq didn't find any significant regions: all DMRs had q-value > 0.05. Returning empty file")
  empty <- c()
  write.table(empty, file=paste0(output_name, "_sorted.bed"), quote = FALSE, sep = "\t", 
              row.names = FALSE, col.names = FALSE)
} else {

# Select range of significant regions

sig_regions_bed <- sig_regions[,c("seqnames", "start", "end")]

# Some regions might have start > end, so we need to fix this

# Get index of wrong positions

wrong_pos_index <- which(sig_regions_bed$start > sig_regions_bed$end)

#select rows with wrong index

wrong_pos <- sig_regions_bed[wrong_pos_index,]

#switch starting and ending position in the original file

sig_regions_bed[wrong_pos_index,2] <- wrong_pos$end
sig_regions_bed[wrong_pos_index,3] <- wrong_pos$start

#To check whether anything is wrong use following command:
#wrong_pos_index <- which(sig_regions_bed$start>sig_regions_bed$end)

# output bed file

write.table(sig_regions_bed, file=paste0(output_name, ".bed"), quote = FALSE, sep = "\t", 
            row.names = FALSE, col.names = FALSE)

command <- paste("sort -k1,1 -k2,2n ", output_name, ".bed > ", output_name, "_sorted.bed", 
                 sep = "")

system(command = command)
}