Snakemake workflow to download and/or align reads to targets and produce useful outputs.

public 1yr ago 0 bookmarks

View Workflow

A Snakemake workflow to download/align reads to targets and produce useful outputs. With hiss, you can:

Download reads for all samples from a study archived in NCBI's sequence read archive (optional). If you'd like to ONLY do this, check out grabseqs , an easy-to-use utility that spun off from this project.
Stringently align .fastq files to one or more target sequences using bowtie-2
Generate coverage maps for each positive sample
Generate standardized summary tables with information on depth and breadth of coverage for each positive sample.

Installing

Getting it up and running is hopefully simple:

git clone https://github.com/louiejtaylor/hisss
cd hisss

We use conda to handle dependencies; you can install miniconda from here . Make a new conda enviroment, then install dependencies from requirements.txt like so:

conda create -n hisss
conda activate hisss
conda install -c bioconda -c conda-forge -c louiejtaylor --file requirements.txt

Conda is great for managing dependencies and environments without requiring any admin privileges.

Configuration

Hisss can run on both local and remote fastqs that are either paired or unpaired. The options in config_template.yml should be self-explanatory--just replace the placeholders with the relevant paths to your samples and alignment targets.

The final step in setting up your config is to add your samples. We include two utilites to simplify adding samples to your config file depending on where your data are located: list_SRA.py and list_samples.py .

SRA or MG-RAST data

If you're using data from the SRA , grabbing all the samples from a study is as simple as passing the project identifier (SRP#) to list_SRA.py like so:

python ./scripts/list_SRA.py SRP####### >> my_config.yml

This command will append nicely-formatted sample names to my_config.yml , along with some metadata of questionable utility. It also saves the full SRA metadata file as a .csv (so we recommend running this in your project directory). You also don't need to know whether the reads are paired- or single-end beforehand--as long as the information is in the SRA metadata it'll be included.

A similar script is available for MG-RAST data, which can be used like so:

python ./scripts/list_MGRAST.py mgp###### >> my_config.yml

Local data

If you're running on local samples, use list_samples.py . Let's say your fastqs are paired, located in /project/fastq/ , and are named like Sample_[sample_name]_R[pair].fastq :

python ./scripts/list_samples.py -pattern "Sample_{sample}_R{rp}.fastq" /project/fastq/ >> my_config.yml

Running

To run, simply execute the following in the hisss root dir. The -p flag will print out the shell commands that will be executed. If you'd like to do a dry run (see the commands without running them), pass -np instead of -p .

snakemake -p --configfile path/to/my_config.yml all

If you're running on SRA data, we recommend using --restart-times since we've encountered issues with downloads randomly failing:

snakemake -p --restart-times 5 --configfile path/to/my_config.yml all

And if you'd just like to use hisss only to grab the data from SRA, pass the --notemp flag and specify download_only as the target rule:

snakemake -p --restart-times 5 --notemp --configfile path/to/my_config.yml download_only

There are many more useful options that you could pass to snakemake that are beyond the scope of this tutorial. Read more about them here !

When you're done, to leave the conda environment:

source deactivate

Troubleshooting

To run the dummy data (which should complete very quickly):

snakemake -p --configfile test_data/test_config.yml all

If you want to run the dummy data again after tinkering with the Snakefile or rules, you can clean up the test output like so:

cd test_data
bash clean_test.sh

Feel free to open an issue or tweet @Louviridae or @A2_Insanity if you have problems. Good luck!

Current workflow

directed acyclic graph of workflow

(generated by graphviz )

Code Snippets

shell:
	"""
	samtools depth -a {input} > {output.cov}	
	if [ -s {output.cov} ]; then
		Rscript scripts/plot_coverage.R {output.cov} {params.plot};
	else
		echo "No valid alignments detected";
	fi
	"""

SnakeMake SAMtools From line 11 of rules/plot.rules

shell:
	"""
	cat {params.plot_dir}*.cov.depth.txt > {output}
	"""

SnakeMake From line 29 of rules/plot.rules

shell:
	"bowtie2-build --threads {threads} -f {input} {input}"

SnakeMake Bowtie 2 From line 9 of rules/process_alignment.rules

shell:
	"""
	samtools view -bT {params.target} {input} > {output.bam}
	samtools sort -o {output.sorted} {output.bam}
	samtools index {output.sorted} {output.bai}
	"""

SnakeMake SAMtools From line 21 of rules/process_alignment.rules

shell:
	"""
	samtools idxstats {input.bam} > {output}
	"""

SnakeMake SAMtools From line 9 of rules/summary.rules

shell:
	"""
	samtools view -b {input.bam}|genomeCoverageBed -ibam stdin|grep -v 'genome'| perl scripts/coverage_counter.pl > {output}	
	"""

SnakeMake SAMtools From line 19 of rules/summary.rules

shell:
	"""
	Rscript scripts/summarize_alignments.R {input.cov} {input.stats} {output}
	"""

SnakeMake From line 30 of rules/summary.rules

shell:
	"""
	echo -e "Sample\tAlignTarget\tFractionCoverage\tTargetLength\tMappedReads" > {output}
	cat {params.align_dir}*.align.summary.txt >> {output}
	"""

SnakeMake From line 43 of rules/summary.rules

library(methods)
library(ggplot2)
library(magrittr)
library(reshape2)

#Set up argument parser and QC-------------------------------------------------
args <- commandArgs()

#print(args)
cov <- args[6]
output <-args[7]

if(!file_test("-f", cov)) 
{
  stop("Coverage file not defined.")
}

 if(is.na(output)) 
 {
   stop("Output file not defined.")
 }

#Convert to a dataframe if a positive hit--------------------------------------

if(!file.size(cov) == 0){
      cov_df <-read.table(cov, sep = "\t") 
      } else { 
        quit()
  }

colnames(cov_df) <- c("AlignTarget", "Position", "Count")

#Coverage Maps-----------------------------------------------------------------

##Function for creating plots.
plot_nuc_cov <- function(cov_df) {
  p.list <- lapply(sort(unique(cov_df$AlignTarget)), function(i) {
    ggplot(cov_df[cov_df$AlignTarget==i,], 
           aes(x = Position, 
               y = Count), fill = "black") +
    geom_col() +
    facet_wrap(~AlignTarget, scales = "free", ncol = 1) +
    ggtitle(NULL) +
    xlab("Position") + 
    ylab("Coverage") +
    theme_bw(base_size = 14) + 
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.spacing = unit(1.2, "lines"),
          strip.background = element_rect(colour="white", fill="white"),
          legend.position="right",
          axis.title = element_text(size = rel(1.5)),
          axis.text = element_text(size = rel(1.25), color = "black"),
          strip.text = element_text(size = rel(1.25), color = "black"))
  })

  return(p.list)
}


#Plot alignments---------------------------------------------------------------
plots <- plot_nuc_cov(cov_df)
pdf(output, onefile = TRUE)
invisible(lapply(plots, print))
dev.off()

R ggplot2 magrittr reshape2 From line 8 of scripts/plot_coverage.R

library(methods)

#Set up argument parser and QC-------------------------------------------------
args <- commandArgs()

#print(args)
cov <- args[6]
read <-args[7]
output <-args[8]

if(is.na(cov)) 
{
  stop("Coverage file not defined.")
}

if(is.na(read)) 
{
  stop("Mapped read file not defined.")
}

if(is.na(output)) 
{
  stop("Output file not defined")
}

#Define sample name------------------------------------------------------------
# cov <- "test-A.genomecoverage.txt"
# read <- "test-A.sorted.idxstats.tsv"
sample_name = sub(".genomecoverage.txt", "", cov)
sample_name = sub(".*\\/", "", sample_name)

##Build a dataframe of coverage and mapped reads-------------------------------
cov_df <- read.delim(cov, sep = "\t", header=F)
cov_df$V3 <- sample_name
colnames(cov_df) <- c("AlignTarget", "Coverage", "Sample")

read_df <-read.delim(read, sep = "\t", header=F)
read_df$V5 <-sample_name
colnames(read_df) <- c("AlignTarget", "TargetLength", 
                       "MappedReads", "UnmappedReads", 
                      "Sample")
read_df <- subset(read_df, AlignTarget != "*", 
                  select = c(Sample, AlignTarget, TargetLength, MappedReads))

#Merge data. Keep all observations
all_align_data <- merge(cov_df, read_df, 
                        by = c("Sample", "AlignTarget"), 
                        all.x = TRUE, all.y = TRUE)

##Write output
write.table(all_align_data, sep = "\t",
            file = output,
            quote = FALSE,
            col.names = FALSE, row.names = FALSE)