Indels are not ideal - quick test for interrupted ORFs in bacterial/microbial genomes

public public 1yr ago 0 bookmarks

ideel

This repo builds on code by Mick Watson who wrote a blog post and follow up about a quick way to test the viability of a (long-read) assembly.

Dependencies:

  • Snakemake

  • Prodigal

  • Diamond

  • R (including libraries readr and ggplot2)

You will need a Diamond index like UniProt TREMBL.

run

Clone the repo.

The output of the workflow will be written to --directory . In there, make a directory called "genomes", put assemblies in there with .fa file extension

Edit the config.json file, specifying e.g. the path to the Diamond database.

Then:

# http://snakemake.readthedocs.io/en/stable/executable.html
# http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html
snakemake --configfile config.json --directory ~/tmp/ --cores 4
# per default, Snakemake uses only 1 CPU core

Code Snippets

 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
library(ggplot2)

# get command line arguments as an array
# args <- commandArgs(trailingOnly = TRUE)

# files
# filein <- args[1]
# fileout <- args[2]
filein <- snakemake@input[[1]]
fileout <- snakemake@output[[1]]

# http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts
# In the R script, an S4 object named snakemake analog to the Python case
# above is available and allows access to input and output files and other
# parameters. Here the syntax follows that of S4 classes with attributes
# that are R lists, e.g. we can access the first input file with snakemake@input[[1]]


# data
# data <- read_tsv(filein, col_names=c('qlen', 'slen'))
data <- read.table(filein, header=FALSE, sep='\t')
names(data) <- c('qlen', 'slen')


pseudogenes <- sum(data$qlen / data$slen < 0.9)
print(paste0('Encountered genes < 0.9 reference length: ', pseudogenes))


theme_min = function (
    size=10, font=NA, face='plain', 
    panelColor=backgroundColor, axisColor='#999999', 
    gridColor=gridLinesColor, textColor='black') 
{   
theme_text = function(...)
    ggplot2::theme_text(family=font, face=face, colour=textColor, size=size, ...)

opts(
    axis.text.x = theme_text(),
    axis.text.y = theme_text(),
    #axis.line = theme_blank(),
    axis.ticks = theme_segment(colour=axisColor, size=0.25),

    panel.border = theme_rect(colour=backgroundColor),
    # panel.border = theme_blank(),

    legend.background = theme_blank(),
    legend.key = theme_blank(),
    legend.key.size = unit(1.5, 'lines'),
    legend.text = theme_text(hjust=0),
    legend.title = theme_text(hjust=0),

    # panel.background = theme_rect(fill=panelColor, colour=NA),
    panel.background = element_blank(),

    # panel.grid.major = theme_line(colour=gridColor, size=0.33),
    panel.grid.major = element_blank(),

    # panel.grid.minor = theme_blank(),
    panel.grid.minor = element_blank(),

    strip.background = theme_rect(fill=NA, colour=NA),
    strip.text.x = theme_text(hjust=0),
    strip.text.y = theme_text(angle=-90),
    plot.title = theme_text(hjust=0),
    plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), 'lines'))
}


p <- ggplot(data, aes(x=qlen/slen)) + 
		geom_histogram(fill='white', color='grey25', bins=20) +
		xlab('query length / hit length') +
		ylab('frequency') +
		# scale_y_log10() +
        scale_x_continuous(limits=c(0, 1.3)) +
        theme_minimal()



# breaks
# bks <- seq(0,max(data$V1/data$V2)+1,by=0.05)

# main hist
# png(fileout, width=800, height=800, type='cairo')
# hist(data$V1/data$V2, breaks=bks, col='red', xlim=c(0,2), xlab='query len / hit len', ylab='frequency', main=filein)
# dev.off() 

# scaled hist
# fileout <- gsub('.png','.500.png', fileout)
# png(fileout, width=800, height=800, type='cairo')
# hist(data$V1/data$V2, ylim=c(0,500), breaks=bks, col='purple', xlim=c(0,2), xlab='query len / hit len', ylab='frequency', main=filein)
# dev.off()

ggsave(fileout, p, height=7, width=7, units='cm')
27
shell: 'prodigal -a {output} -q -i {input}'
37
shell: 'diamond blastp --threads {threads} --max-target-seqs 1 --db {params.db} --query {input} --outfmt {params.of} --out {output}'    
43
script: 'scripts/hist.R'
ShowHide 1 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/phiweger/ideel
Name: ideel
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...