DeepG4ToolsComparison: A snakemake pipeline to run and evaluate G4’s DNA prediction tools

public public 1yr ago 0 bookmarks

DeepG4ToolsComparison: A snakemake pipeline to run and compare G4 DNA prediction tools with DeepG4

The predictions for differents tissues and cancer with DeepG4 is available here

The code to generate the precision/recall curve is available here .

Overview

It’s based on Snakemake to manage the workflow and Docker to isolate the application and run it with the appropriate tool versions.

Installation

Clone the repository :

git clone https://github.com/morphos30/DeepG4ToolsComparison.git
cd DeepG4ToolsComparison

Install the docker image and run it :

docker build . -t morphos30/g4docker -f Dockerfile/Dockerfile
docker run -it -v $(pwd):/DeepG4ToolsComparison morphos30/g4docker /bin/bash

Where $(pwd) is the working directory of DeepG4ToolsComparison on your computer.

Launch the pipeline :

cd /DeepG4ToolsComparison
snakemake --use-conda -j 30

You have to set the option --use-conda in order to install and run each tool in its proper environment.

Workflow specifications

Input

  • DNA sequences into bed format, split into positive set and negative set, written into the bed directory.

Note : if you want add a new dataset, edit the Snakefile file and add the bed files in the dictionnary EXPERIMENTS , without the .bed extension. Example :

TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42_Ctrl_gkmSVM.bed TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42.bed

EXPERIMENTS = { "TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42_Ctrl_gkmSVM":{"CTRL":"TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42_Ctrl_gkmSVM","EXP":"TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_0.8_42"}
}

Where CTRL is the negative set and EXP is the positive set.

  • DNA Accessibility (ATAC-seq/DNAse-seq/MNase-seq) in bigwig format or directly the averaged value for each sequence in a one-column tsv file.
ATACFILE = { "TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM":["ATAC_entinostat_mean.bw"]
}

or one-column tsv file in fasta/{Experiment_name}/{Experiment_name}_atac_merged.tsv . Example :

fasta/TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM/TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_atac_merged.tsv

head TestSet_Peaks_BG4_G4seq_HaCaT_GSE76688_hg19_201b_Ctrl_gkmSVM_atac_merged.tsv
0.01628741641898675
0.028752257447422012
0.028878783223623482
0.055516399884055316
0.02825982069785745
0.03582923041809851
0.023904436394151577
0.07724288611280039
0.01740800116454673
0.05779605688479145

Rulegraph :

Workflow output for each tools :

Outputs Tools Methods
ATACDeepG4_ATACnormBG ATACDeepG4 DeepG4 using accessibily (DeepG4 in paper)
ATACDeepG4_classictuningOH5 ATACDeepG4 DeepG4 without accessibility (DeepG4* in paper)
penguinn_retrained penguinn penguinn using custom model trained on BG4G4seq dataset
penguinn penguinn penguinn using default model
G4detector_retrained G4detector G4detector using custom model trained on BG4G4seq dataset
G4detector G4detector G4detector using default model
quadron_retrained quadron quadron using custom model trained on BG4G4seq dataset
quadron_score quadron quadron using default model

Code Snippets

10
11
shell:
  "Rscript "+AUC_script+" {input} {output.pdf}"
SnakeMake From line 10 of rules/AUC.smk
22
23
shell:
  "Rscript "+supp_metrics_script+" {input} {output.pdf}"
SnakeMake From line 22 of rules/AUC.smk
35
36
shell:
  "Rscript "+subseq_fasta+" {params.sub} {input.fas} {output.fasta_trimmed}"
82
83
script:
  "../scripts/Snakemake/generate_fasta_atac_from_bed_bgnorm.R"
11
12
shell:
  "python2 "+G4CatchAll_script+" --fasta {input.fas} > {output}"
26
27
shell:
  "Rscript "+G4CatchAll_tsv+" {input.table} {input.fas} {output} {params.calc}"
15
16
shell:
  "python "+G4detector_script+" test {input.fas} {params.model} {output}"
29
30
shell:
  "Rscript "+G4detector_to_tsv+" {input.table} {input.fas} {output}"
44
45
shell:
  "python "+G4detector_script+" test {input.fas} {params.model} {output}"
59
60
shell:
  "Rscript "+G4detector_to_tsv+" {input.table} {input.fas} {output}"
40
41
shell:
  "Rscript "+G4Hunter_tsv+" {input} {output} {params.calc} {params.score_selec}"
58
59
shell:
  "Rscript "+G4Hunter_retrained_script+" {params.model_path} {input} {output.score} {params.calc}"
12
13
14
15
16
17
18
19
20
21
22
23
24
25
run:
  from Bio import SeqIO
  with open(output[0],"w") as outfile:
    with open(input.fas, "rU") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            try:
              cmd = shell("echo '"+str(record.seq)+"' | timeout 20s "+GQRS_Mapper_script+" -csv -notitle",iterable=True)
              for line in cmd:
                if len(line) == 0:
                  outfile.write(record.id+",,,,,,,,,\n")
                else:
                  outfile.write(record.id+","+line+"\n")
            except:
              outfile.write(record.id+",,,,,,,,,\n")
39
40
shell:
  "Rscript "+GQRS_Mappper_tsv+" {input.table} {output} {params.calc}"
27
28
shell:
  "Rscript "+DeepG4OH4_script+" {input.fas} {input.atac_merged} {output} {params.model}"
46
47
shell:
  "Rscript "+penguinn_script_R+" {input.fas} {output} {params.model}"
61
62
shell:
  "Rscript "+penguinn_script_R+" {input.fas} {output} {params.model}"
13
14
shell:
  "python2 "+qparse_script+" -i {input.fas} -o {output}"
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
run:
  fasta = {}
  with open(output[0],"w") as outfile:
    with open(input.qparse_file, "r") as qparsefile:
        for line in qparsefile:
          line = line.strip()
          if not line:
            continue
          if line.startswith("#"):
            continue
          if line.startswith(">"):
            active_sequence_name = line[1:]
            if active_sequence_name not in fasta:
                fasta[active_sequence_name] = []
            continue
          sequence = line
          fasta[active_sequence_name].append(sequence)
    for i,v in fasta.items():
       outfile.write("\n".join([i+"\t"+j for j in v])+"\n")
56
57
shell:
  "Rscript "+qparse_tsv+" {input.fas} {input.qparse_file} {output} {params.calc}"
14
15
shell:
  "g4predict intra -M -f {input} -b {output}"
30
31
shell:
  "Rscript "+quadparser_tsv+" {input.table} {input.fas} {output} {params.calc}"
23
24
shell:
  "Rscript "+generate_quadron_dat+" {params.type} {params.Quadron_lib} {params.source} {threads} {input.fas} {output.quadron_table}"
43
44
shell:
  "Rscript "+generate_quadron_dat+" {params.type} {params.Quadron_lib} {params.source} {threads} {input.fas} {output.quadron_table}"
61
62
shell:
  "Rscript "+quadron_retrained_script+" {params.model_path} {input.quadron_table} {output.score}"
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
require(Biostrings)
if (!require(plyranges))
{
  BiocManager::install("plyranges")
}
require(plyranges)
if (!require(BSgenome.Hsapiens.UCSC.hg19))
{
  BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
}
library(BSgenome.Hsapiens.UCSC.hg19)
require(tidyverse)
seqlens <- seqlengths(BSgenome.Hsapiens.UCSC.hg19)
binbed <- snakemake@params[["random_regions"]] %>% read_bed()
#Functions (copied from DeepG4 package)
getScoreBW <- function (WIG, BED,forScan=F)
{
  res <- do.call("rbind",lapply(split(BED, droplevels(GenomeInfoDb::seqnames(BED))), function(zz) {
    cov <- WIG[[unique(as.character(GenomeInfoDb::seqnames(zz)))]]
    score <- IRanges::Views(cov, start = BiocGenerics::start(zz), end = BiocGenerics::end(zz))
    return(as.matrix(score))
  }))
  if(forScan){
    return(res)
  }else{
    return(rowMeans(res))
  }
}
NormBW <- function(x,binbed){
  ranges_bin <-  base::range(IRanges::subsetByOverlaps(x,binbed)$score,na.rm = TRUE, finite = TRUE)
  x$score <- scales::rescale(x$score,to = c(0,1),from = ranges_bin)
  x <- IRanges::coverage(x,weight = "score")
  return(x)
}


#read bed files
readBed <- . %>% read_tsv(col_names = F) %>% dplyr::select(1:3) %>% setNames(c("seqnames","start","end")) %>% as_granges()
G4pos.bed <- snakemake@input[["bed_pos"]] %>% read_bed()
if(unique(width(G4pos.bed)) != 201){
  G4pos.bed <- snakemake@input[["bed_pos"]] %>% readBed
}
G4neg.bed <- snakemake@input[["bed_ctrl"]] %>% read_bed()
if(unique(width(G4neg.bed)) != 201){
  G4neg.bed <- snakemake@input[["bed_ctrl"]] %>% readBed
}

G4pos.bed <- G4pos.bed %>% filter(seqnames != "chrY")
G4neg.bed <- G4neg.bed %>% filter(seqnames != "chrY")

#generate open chromatin dataset (ATAC or DNAse-seq)
my_seuil_bg <- snakemake@params[["seuil"]]
my_window_bg <- as.numeric(snakemake@params[["window"]])
ATAC_dataset <- snakemake@input[["atac_data"]]


G4all.bed <- c(G4pos.bed,G4neg.bed)
G4all.bed$order <- 1:length(G4all.bed)
G4all.bed <- BiocGenerics::sort(GenomeInfoDb::sortSeqlevels(G4all.bed))
G4all.atac <- ATAC_dataset %>% map(function(x){
  xbw <- x %>% import.bw %>% NormBW(binbed)
  res <- xbw %>% getScoreBW(G4all.bed)
  res[is.na(res)] <- 0
  return(res)
})%>% purrr::reduce(`+`) / length(ATAC_dataset)



G4all.bed.bg <- G4all.bed %>% anchor_center() %>% mutate(width=my_window_bg)%>%
  as_tibble() %>%
  left_join(enframe(seqlens),by = c("seqnames"="name")) %>%
  mutate(end = ifelse(end>value,value,end)) %>% dplyr::select(-width) %>% as_granges()

G4all.atac.bg <- ATAC_dataset %>% map(function(x){
  xbw <- x %>% import.bw() %>% NormBW(binbed)
  res <- xbw %>% getScoreBW(G4all.bed.bg)
  res[is.na(res)] <- 0
  return(res)
})%>% purrr::reduce(`+`) / length(ATAC_dataset)

my_test <- (G4all.atac/G4all.atac.bg)<my_seuil_bg
G4all.atac[my_test] <- 0


tibble(atac_seq=G4all.atac) %>% write_tsv(snakemake@output[["atac_merged"]],col_names = F)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
require(tidyverse)

# files <- list.files("results", recursive=TRUE, full.names=TRUE)
# files <- files[str_detect(files,"AUC/.*\\.tsv")]
res <- snakemake@input  %>% map(read_tsv) %>% map(gather,key = metric,value = value,-file) %>% bind_rows()

fres <- res %>% 
  mutate(Exp = str_extract(file,"GSE[0-9]+|breast-cancer-PDTX")) %>% 
  mutate(size = str_extract(file,"201b|500b|253b")) %>% 
  mutate(Ctrl = str_extract(file,"Ctrl(_[A-Za-z0-9]+|)")) %>% 
  mutate(Tool = str_extract(basename(file),"\\..+")) %>%
  mutate(Tool = str_remove(Tool,".8_42_Ctrl_(gkmSVM|neg).|.")) %>%
  mutate(cell_line = str_extract(file,"qG4|HaCaT|K562|HEKnp|H1975|A549|293T|HeLaS3")) %>% 
  mutate(TypeExp = str_extract(file,"Peaks_G4P_G4seq|TestSet_Peaks_qG4|TestSet_Peaks_G4seqpm_BG4|TestSet_Peaks_BG4_G4seq|Promoters_qG4|Promoters_G4seq_BG4|Promoters_BG4_G4seq|Peaks_G4seqpm_BG4|Peaks_BG4_G4seq|Peaks_G4seq_qG4|Peaks_G4seq_BG4|Peaks_qG4|Peaks_G4seqpm_qG4")) %>% 
  dplyr::select(-file)

#Raf en veut
fres  %>% 
  unite(Ctrl_dat,TypeExp,Exp,cell_line,Ctrl,size) %>% unite(Ctrl_dat,metric,Ctrl_dat) %>% spread(key = Ctrl_dat,value = value) %>% write_tsv(snakemake@output[[1]])
#Raf en veut
fres  %>%  group_by(Exp,Ctrl,cell_line,TypeExp,size,metric) %>% arrange(desc(value)) %>% slice(1) %>% 
  write_tsv(snakemake@output[[2]])

fres %>% unite(Ctrl_dat,TypeExp,Exp,cell_line,Ctrl,size) %>% group_by(Ctrl_dat,metric) %>% arrange(desc(value)) %>% mutate(Classement = 1:dplyr::n()) %>% 
  dplyr::select(-value) %>% spread(key = Ctrl_dat,value = Classement) %>% write_tsv(snakemake@output[[3]])

fres %>% write_tsv(snakemake@output[[4]])
98
99
script:
  "scripts/Snakemake/report_AUC.R"
SnakeMake From line 98 of main/Snakefile
ShowHide 26 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://www.biorxiv.org/content/10.1101/2020.07.22.215699v1
Name: deepg4toolscomparison
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...