A straightforward bioinformatic pipeline for detecting a bacterial strain in one or more metagenome(s).

public 1yr ago Version: 2 0 bookmarks

View Workflow

A straightforward bioinformatic pipeline for detecting the presence of a bacterial strain in one or more metagenome(s).

StrainSifter is based on Snakemake . This pipeline allows you to output phylogenetic trees showing strain relatedness of input strains, as well as pairwise counts of single-nucleotide variants (SNVs) between input samples.

Installation

To run StrainSifter, you must have miniconda3 and Snakemake installed.

Install instructions (One time only)

Download and install miniconda3 :

For Linux:

 wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh

Clone the StrainSifter workflow to the directory where you wish to run the pipeline:
```
 git clone https://github.com/bhattlab/strainsifter
```

Create the new conda environment:

 cd strainsifter conda env create -f envs/environment.yaml

Install Snakemake:

conda install snakemake -c bioconda -c conda-forge

StrainSifter has been developed and tested with Snakemake version 5.1.4 or higher. Check your version by typing:

snakemake --version

If you are running a version of Snakemake prior to 5.1.4, update to the latest version:

conda update snakemake -c bioconda -c conda-forge

Activate the conda environment (Every time you use StrainSifter)

source activate ssift

Dependencies

We recommend running StrainSifter in the provided conda environment. If you wish to run StrainSifter without using the conda environment, the following tools must be installed and in your system PATH:

Running StrainSifter

Due to the computing demands of the StrainSifter pipeline, we recommend running on a computing cluster if possible. Instructions to enable Snakemake to schedule cluster jobs with SLURM can be found at https://github.com/bhattlab/slurm

Input files

Reference genome assembly in fasta format (can be a draft genome or a finished reference genome) Acceptable file extensions: ".fasta", ".fa", ".fna"
Two or more short read datasets in fastq format (metagenomic reads or isolate reads), optionally gzipped Acceptable file extensions: ".fq", ".fastq", ".fq.gz", ".fastq.gz"

Short read data can be paired- or single-end.

You will need to indicate input files in the config file for each sample you wish to run StrainSifter on. This is described below in the Config section:

Config

You must update the config.yaml file as follows:

reference: Path to reference genome (fasta format)

reads: Samples and the file path(s) to the input reads.

Optionally, you can update the following parameters:

prefix: (optional) desired filename for output files. If blank, the name of the reference genome will be used.

mapq: minimum mapping quality score to evaluate a read aligment

n_mismatches: consider reads with this many mismatches or fewer

min_cvg: minimum read depth to determine the nucleotide at any given postion

min_genome_percent: the minimum fraction of bases that must be covered at min_cvg or greater to process an sample

base_freq: minimum frequency of a nucleotide to call a base at any position

Example config.yaml:

##### input files #####
# reference genome (required)
reference: /path/to/ref.fna
# short read data (at least two samples required)
reads:
sample1:
[/path/to/sample1_R1.fq,
/path/tp/sample1_R2.fq]
sample2:
[/path/to/sample2_R1.fq,
/path/to/sample2_R2.fq]
sample3: /path/to/sample3.fq
# prefix for output files (optional - can leave blank)
prefix:
##### workflow parameters #####
# alignment parameters:
mapq: 60
n_mismatches: 5
# variant calling parameters:
min_cvg: 5
min_genome_percent: 0.5
base_freq: 0.8

Running StrainSifter

To run StrainSifter, the config file must be present in the directory in which you wish to run the workflow. You should then be able to run StrainSifter as follows:

Phylogeny

To generate a phylogenetic tree showing all of the input samples that contain your strain of interest at sufficient coverage to profile:

snakemake {prefix}.tree.pdf

SNV counts

To generate a list of pairwise SNV counts between all input samples:

snakemake {prefix}.dist.tsv

FAQ

Q: Can StrainSifter be used for non-bacterial genomes (e.g. yeast)?

A: At present, we recommend StrainSifter for bacteria only. In theory, StrainSifter should work for yeast if a haploid reference genome is provided.

Code Snippets

import sys
import gzip
import re

DEBUG = 0

ASCII_OFFSET = ord("!")
INDEL_PATTERN = re.compile(r"[+-](\d+)")
INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
ROUND = 3  # places to right of decimal point

min_coverage, min_proportion, min_qual = snakemake.params
pileup_file = snakemake.input[0]
out_file = snakemake.output[0]

min_coverage = int(min_coverage)
min_proportion = float(min_proportion)
min_qual = int(min_qual)

## function to process each line of pileup

def parse_pileup(line):

    # read fields from line of pileup
    chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")

    # proportion is 0 if no consensus base, top base is N by default
    parsed = {
        "proportion": 0.0,
        "chromosome": chromosome,
        "position": int(position),
        "reference": reference,
        "coverage": int(coverage),
        "pileup": pileup,
        "quality": quality,
        "top_base": "N"}

    # if the base coverage is below the limit or above our acceptable max, call N
    if parsed["coverage"] < min_coverage:
        return parsed

    # uppercase pileup string for processing
    pileup = pileup.upper()

    # Remove start and stop characters from pileup string
    pileup = re.sub(r"\^.|\$", "", pileup)

    # Remove indels from pileup string
    start = 0

    while True:
        match = INDEL_PATTERN.search(pileup, start)

        if match:
            integer = match.group(1)
            pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
            start = match.start()
        else:
            break

    # get total base count and top base count
    total = 0
    top_base = "N"
    top_base_count = 0

    # uppercase reference base for comparison
    reference = reference.upper()

    base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}

    quality_length = len(quality)

    for i in range(quality_length):

        # convert ASCII character to phred base quality
        base_quality = ord(quality[i]) - ASCII_OFFSET

        # only count high-quality bases
        if base_quality >= min_qual:

            currBase = pileup[i]

            if currBase in base_counts:
                base_counts[currBase] += 1

            else:
                base_counts[reference] += 1

            total += 1

    parsed["total"] = total

    # find top base
    for base in base_counts:
        if base_counts[base] > top_base_count:
            top_base = base
            top_base_count = base_counts[base]

    # if more that 0 bases processed
    if total > 0:
        prop = top_base_count / float(total)

        if prop >= min_proportion:
            parsed["proportion"] = prop
            parsed["top_base"] = top_base

    return parsed

## process pileup file

# read in input file
if not DEBUG:
    out_file = open(out_file, "w")

with open(pileup_file, "rt") as pileup_file:
    for pileup_file_line in pileup_file:

        parsed = parse_pileup(pileup_file_line)

        proportion = parsed["proportion"]

        # print to output file
        if DEBUG:
            print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
                parsed["pileup"], parsed["quality"]]))
        else:
            print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
                parsed["pileup"], parsed["quality"]]), file=out_file)

if not DEBUG:
    out_file.close()

Python Consensus From line 5 of scripts/callSNPs.py

import sys

HEADER_POS = 0
FASTA_POS = 1
#coreSNPs_file = sys.argv[1]

snp_file = snakemake.input[0]
out_file = snakemake.output[0]

# hold sequences as we go
fastas = []

with open(snp_file, "r") as core_snps:

    # process header
    header = core_snps.readline().rstrip("\n").split("\t")

    # add header to dict with empty string
    for h in range(len(header)):
        fastas += [[header[h], ""]]

    for line in core_snps:
        line = line.rstrip("\n").split("\t")

        for pos in range(len(line)):
            # print(fastas[pos])
            # print(fastas[pos][FASTA_POS])
            fastas[pos][FASTA_POS] += line[pos]

with open(out_file, "w") as sys.stdout:
    for f in range(len(fastas)):
        print(">" + fastas[f][0])
        print(fastas[f][1])

Python From line 1 of scripts/coreSNPs2fasta.py

import sys

snp_file = snakemake.input[0]
out_file = snakemake.output[0]

# read each line and print if base is called for every sample and
# bases are not the same in every sample
with open(snp_file, "r") as snps:
    with open(out_file, "w") as sys.stdout:

        # print header
        print(snps.readline().rstrip('\n'))

        # process each line and output positions with no unknown bases ("N")
        # and where all samples do not have same base
        for line in snps:
            positions = line.rstrip('\n').split('\t')
            if (len(set(positions)) > 1) and ("N" not in positions):
                print("\t".join(positions))

Python From line 4 of scripts/findCoreSNPs.py

import sys

# snakemake input and output files
sample_file = snakemake.input[0]
out_file = snakemake.output[0]

# minimum coverage threshold -- report percentage of bases covered at or
# beyond this depth
minCvg = int(snakemake.params[0])

totalBases = 0
coveredBases = 0
weightedAvg = 0
with open(sample_file, 'r') as sample:
    for line in sample:
        if line.startswith("genome"):
            chr, depth, numBases, size, fraction = line.rstrip('\n').split('\t')

            depth = int(depth)
            numBases = int(numBases)

            totalBases += numBases
            weightedAvg += depth * numBases

            if depth >= minCvg:
                coveredBases += numBases
avgCvg = 0
percCovered = 0

if totalBases > 0:
    avgCvg = float(weightedAvg) / float(totalBases)
    percCovered = float(coveredBases) / float(totalBases)

with open(out_file, 'w') as out:
    print("\t".join([str(round(avgCvg, 2)), str(round(percCovered, 2))]), file = out)

Python Snakemake From line 7 of scripts/getCoverage.py

import sys
import itertools
import subprocess
import re

# input and output files from snakemake
in_files = snakemake.input
out_file = snakemake.output[0]

cmd = " ".join(["cat", in_files[0], "| wc -l"])
totalBases = subprocess.check_output(cmd, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
totalBases = int(totalBases) - 1

# for every pairwise combination of files, check SNV distance
with open(out_file, "w") as sys.stdout:

    # print header
    print('\t'.join(["Sample1", "Sample2", "SNVs", "BasesCompared", "TotalBases"]))

    # get all pairwise combinations of input files
    for element in itertools.combinations(in_files, 2):
        file1, file2 = element

        cmd1 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | wc -l"])
        totalPos = subprocess.check_output(cmd1, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')

        cmd2 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | awk '$1 != $2 {print $0}' | wc -l"])
        diffPos = subprocess.check_output(cmd2, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')

        fname1 = re.findall('consensus/(.+)\.', file1)[0]
        fname2 = re.findall('consensus/(.+)\.', file2)[0]

        # dist = subprocess.check_output(['./count_snvs.sh', file1, file2]).decode('ascii').rstrip('\n')
        print('\t'.join([fname1, fname2, str(diffPos), str(totalPos), str(totalBases)]))

Python Snakemake From line 3 of scripts/pairwiseDist.py

library(ggtree)
library(phangorn)
library(ggplot2)

# input files
tree_file <- snakemake@input[[1]]
out_file <- snakemake@output[[1]]

# read tree file
tree <- read.tree(tree_file)

# midpoint root tree if there is more than 1 node
if (tree$Nnode > 1){
  tree <- midpoint(tree)
}

# draw tree
p <- ggtree(tree, branch.length = 1) +
  geom_tiplab(offset = .05) +
  geom_tippoint(size=3) +
  xlim(0, 4) +
  geom_treescale(width = 0.1, offset=0.25)

# save output file
ggsave(out_file, plot = p, height = 1*length(tree$tip.label))

R ggplot2 ggtree phangorn From line 1 of scripts/renderTree.R

shell:
	"bwa index {input}"

SnakeMake BWA From line 34 of master/Snakefile

shell:
	"bwa mem -t {threads} {params.ref} {input.r} | "\
	"samtools view -b -q {params.qual} | "\
	"bamtools filter -tag 'NM:<={params.nm}' | "\
	"samtools sort --threads {threads} -o {output}"

SnakeMake SAMtools BWA BamTools From line 52 of master/Snakefile

shell:
	"bedtools genomecov -ibam {input} > {output}"

SnakeMake BEDTools From line 68 of master/Snakefile

script:
	"scripts/getCoverage.py"

SnakeMake From line 83 of master/Snakefile

run:
	samps = input
	for samp in samps:
		with open(samp) as s:
			cvg, perc = s.readline().rstrip('\n').split('\t')
		if (float(cvg) >= params.min_cvg and float(perc) > params.min_perc):
			shell("ln -s $PWD/filtered_bam/{s}.filtered.bam passed_samples/{s}.bam".format(s=os.path.basename(samp).rstrip(".cvg")))

SnakeMake From line 98 of master/Snakefile

shell:
	"samtools faidx {input}"

SnakeMake SAMtools From line 113 of master/Snakefile

shell:
	"samtools mpileup -f {input.ref} -B -aa -o {output} {input.bam}"

SnakeMake SAMtools From line 127 of master/Snakefile

script:
	"scripts/callSNPs.py"

SnakeMake From line 142 of master/Snakefile

shell:
	"echo {wildcards.sample} > {output}; cut -f4 {input} >> {output}"

SnakeMake From line 153 of master/Snakefile

shell:
	"paste {input} > {output}"

SnakeMake From line 165 of master/Snakefile

script:
	"scripts/findCoreSNPs.py"

SnakeMake From line 177 of master/Snakefile

script:
	"scripts/coreSNPs2fasta.py"

SnakeMake From line 188 of master/Snakefile

shell:
	"muscle -in {input} -out {output}"

SnakeMake From line 199 of master/Snakefile

shell:
	"fasttree -nt {input} > {output}"

SnakeMake fasttree From line 210 of master/Snakefile

script:
	"scripts/renderTree.R"

SnakeMake From line 221 of master/Snakefile

script:
	"scripts/pairwiseDist.py"

SnakeMake From line 232 of master/Snakefile

ShowHide 15 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/bhattlab/StrainSifter

Name: strainsifter

Version: 2

Badge:

Insert copied code into your website to add a link to this workflow.

Other Versions:

License: None

Keywords:

bacterial Phylogenetic analysis Phylogenetic tree fasttree BamTools BEDTools BWA Consensus ggtree SAMtools Snakemake ggplot2 phangorn metagenomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free