Snakemake workflow to download and/or align reads to targets and produce useful outputs.

public public 1yr ago 0 bookmarks

A Snakemake workflow to download/align reads to targets and produce useful outputs. With hiss, you can:

  1. Download reads for all samples from a study archived in NCBI's sequence read archive (optional). If you'd like to ONLY do this, check out grabseqs , an easy-to-use utility that spun off from this project.

  2. Stringently align .fastq files to one or more target sequences using bowtie-2

  3. Generate coverage maps for each positive sample

  4. Generate standardized summary tables with information on depth and breadth of coverage for each positive sample.

Installing

Getting it up and running is hopefully simple:

git clone https://github.com/louiejtaylor/hisss
cd hisss

We use conda to handle dependencies; you can install miniconda from here . Make a new conda enviroment, then install dependencies from requirements.txt like so:

conda create -n hisss
conda activate hisss
conda install -c bioconda -c conda-forge -c louiejtaylor --file requirements.txt

Conda is great for managing dependencies and environments without requiring any admin privileges.

Configuration

Hisss can run on both local and remote fastqs that are either paired or unpaired. The options in config_template.yml should be self-explanatory--just replace the placeholders with the relevant paths to your samples and alignment targets.

The final step in setting up your config is to add your samples. We include two utilites to simplify adding samples to your config file depending on where your data are located: list_SRA.py and list_samples.py .

SRA or MG-RAST data

If you're using data from the SRA , grabbing all the samples from a study is as simple as passing the project identifier (SRP#) to list_SRA.py like so:

python ./scripts/list_SRA.py SRP####### >> my_config.yml

This command will append nicely-formatted sample names to my_config.yml , along with some metadata of questionable utility. It also saves the full SRA metadata file as a .csv (so we recommend running this in your project directory). You also don't need to know whether the reads are paired- or single-end beforehand--as long as the information is in the SRA metadata it'll be included.

A similar script is available for MG-RAST data, which can be used like so:

python ./scripts/list_MGRAST.py mgp###### >> my_config.yml

Local data

If you're running on local samples, use list_samples.py . Let's say your fastqs are paired, located in /project/fastq/ , and are named like Sample_[sample_name]_R[pair].fastq :

python ./scripts/list_samples.py -pattern "Sample_{sample}_R{rp}.fastq" /project/fastq/ >> my_config.yml

Running

To run, simply execute the following in the hisss root dir. The -p flag will print out the shell commands that will be executed. If you'd like to do a dry run (see the commands without running them), pass -np instead of -p .

snakemake -p --configfile path/to/my_config.yml all

If you're running on SRA data, we recommend using --restart-times since we've encountered issues with downloads randomly failing:

snakemake -p --restart-times 5 --configfile path/to/my_config.yml all

And if you'd just like to use hisss only to grab the data from SRA, pass the --notemp flag and specify download_only as the target rule:

snakemake -p --restart-times 5 --notemp --configfile path/to/my_config.yml download_only

There are many more useful options that you could pass to snakemake that are beyond the scope of this tutorial. Read more about them here !

When you're done, to leave the conda environment:

source deactivate

Troubleshooting

To run the dummy data (which should complete very quickly):

snakemake -p --configfile test_data/test_config.yml all

If you want to run the dummy data again after tinkering with the Snakefile or rules, you can clean up the test output like so:

cd test_data
bash clean_test.sh

Feel free to open an issue or tweet @Louviridae or @A2_Insanity if you have problems. Good luck!

Current workflow

directed acyclic graph of workflow

(generated by graphviz )

Code Snippets

11
12
13
14
15
16
17
18
19
shell:
	"""
	samtools depth -a {input} > {output.cov}	
	if [ -s {output.cov} ]; then
		Rscript scripts/plot_coverage.R {output.cov} {params.plot};
	else
		echo "No valid alignments detected";
	fi
	"""
29
30
31
32
shell:
	"""
	cat {params.plot_dir}*.cov.depth.txt > {output}
	"""
 9
10
shell:
	"bowtie2-build --threads {threads} -f {input} {input}"
21
22
23
24
25
26
shell:
	"""
	samtools view -bT {params.target} {input} > {output.bam}
	samtools sort -o {output.sorted} {output.bam}
	samtools index {output.sorted} {output.bai}
	"""
 9
10
11
12
shell:
	"""
	samtools idxstats {input.bam} > {output}
	"""
19
20
21
22
shell:
	"""
	samtools view -b {input.bam}|genomeCoverageBed -ibam stdin|grep -v 'genome'| perl scripts/coverage_counter.pl > {output}	
	"""
30
31
32
33
shell:
	"""
	Rscript scripts/summarize_alignments.R {input.cov} {input.stats} {output}
	"""
43
44
45
46
47
shell:
	"""
	echo -e "Sample\tAlignTarget\tFractionCoverage\tTargetLength\tMappedReads" > {output}
	cat {params.align_dir}*.align.summary.txt >> {output}
	"""
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
library(methods)
library(ggplot2)
library(magrittr)
library(reshape2)

#Set up argument parser and QC-------------------------------------------------
args <- commandArgs()

#print(args)
cov <- args[6]
output <-args[7]

if(!file_test("-f", cov)) 
{
  stop("Coverage file not defined.")
}

 if(is.na(output)) 
 {
   stop("Output file not defined.")
 }

#Convert to a dataframe if a positive hit--------------------------------------

if(!file.size(cov) == 0){
      cov_df <-read.table(cov, sep = "\t") 
      } else { 
        quit()
  }

colnames(cov_df) <- c("AlignTarget", "Position", "Count")

#Coverage Maps-----------------------------------------------------------------

##Function for creating plots.
plot_nuc_cov <- function(cov_df) {
  p.list <- lapply(sort(unique(cov_df$AlignTarget)), function(i) {
    ggplot(cov_df[cov_df$AlignTarget==i,], 
           aes(x = Position, 
               y = Count), fill = "black") +
    geom_col() +
    facet_wrap(~AlignTarget, scales = "free", ncol = 1) +
    ggtitle(NULL) +
    xlab("Position") + 
    ylab("Coverage") +
    theme_bw(base_size = 14) + 
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.spacing = unit(1.2, "lines"),
          strip.background = element_rect(colour="white", fill="white"),
          legend.position="right",
          axis.title = element_text(size = rel(1.5)),
          axis.text = element_text(size = rel(1.25), color = "black"),
          strip.text = element_text(size = rel(1.25), color = "black"))
  })

  return(p.list)
}


#Plot alignments---------------------------------------------------------------
plots <- plot_nuc_cov(cov_df)
pdf(output, onefile = TRUE)
invisible(lapply(plots, print))
dev.off()
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
library(methods)

#Set up argument parser and QC-------------------------------------------------
args <- commandArgs()

#print(args)
cov <- args[6]
read <-args[7]
output <-args[8]

if(is.na(cov)) 
{
  stop("Coverage file not defined.")
}

if(is.na(read)) 
{
  stop("Mapped read file not defined.")
}

if(is.na(output)) 
{
  stop("Output file not defined")
}

#Define sample name------------------------------------------------------------
# cov <- "test-A.genomecoverage.txt"
# read <- "test-A.sorted.idxstats.tsv"
sample_name = sub(".genomecoverage.txt", "", cov)
sample_name = sub(".*\\/", "", sample_name)

##Build a dataframe of coverage and mapped reads-------------------------------
cov_df <- read.delim(cov, sep = "\t", header=F)
cov_df$V3 <- sample_name
colnames(cov_df) <- c("AlignTarget", "Coverage", "Sample")

read_df <-read.delim(read, sep = "\t", header=F)
read_df$V5 <-sample_name
colnames(read_df) <- c("AlignTarget", "TargetLength", 
                       "MappedReads", "UnmappedReads", 
                      "Sample")
read_df <- subset(read_df, AlignTarget != "*", 
                  select = c(Sample, AlignTarget, TargetLength, MappedReads))

#Merge data. Keep all observations
all_align_data <- merge(cov_df, read_df, 
                        by = c("Sample", "AlignTarget"), 
                        all.x = TRUE, all.y = TRUE)

##Write output
write.table(all_align_data, sep = "\t",
            file = output,
            quote = FALSE,
            col.names = FALSE, row.names = FALSE)
ShowHide 7 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/louiejtaylor/hisss
Name: hisss
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...