Genome Re-sequencing Analysis Snakemake Workflow: De-novo and Variant Calling Modes

public public 1yr ago 0 bookmarks

This Snakemake workflow is for analysing genome re-sequencing experiments. It features 2 modes. The **de-novo** mode is used to confirm sample relationships from the raw sequencing reads with [kwip](https://github.com/kdmurray91/kWIP) and [mash](https://github.com/marbl/Mash). The **varcall** mode performs read alignments to one or several reference genomes followed by variant detection. Read alignments can be performed with [bwa](http://bio-bwa.sourceforge.net/bwa.shtml) and/or [NextGenMap]

Usage

  1. Create a new github repository in your github account using this workflow as a template .

  2. Clone your newly created repository to your local system where you want to perform the analysis.

  3. Setup the software dependencies

  4. Configure the workflow for your needs and input files

  5. Run the workflow

  6. Archive your workflow for documenting your work and easy reproduction.

Some pointers for setup , configuring , and running the workflow are below, for details please consult the technical documentation .

Setup

An easy way to setup the dependencies is conda.

Get the Miniconda Python 3 distribution:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Create an environment with the required software:

NOTE: conda's enviroment name in these examples is dna-proto .

$ conda env create --name dna-proto --file envs/condaenv.yml

Activate the environment:

$ conda activate dna-proto

Additional useful conda commands are here .


----

Check config and metadata

We provide scripts to list metadata and configuration parameters in utils/ .

python utils/check_metadata.py
python utils/check_config.py

Visualising the workflow

You can check the workflow in graphical form by printing the so-called DAG.

snakemake --dag -npr -j -1 | dot -Tsvg > dag.svg
eog dag.svg

Pretending a run of the workflow

Prior to running the workflow, pretend a run and confirm it will do what is intended.

snakemake -npr

Data

Main directory content:

.
├── envs
├── genomes_and_annotations
├── metadata
├── output
├── rules
├── scripts
├── utils
├── config.yml
├── Snakefile
├── snpEff.config

NOTE: the output directory and some files in the metadata directory are/will be generated by the workflow.

You will need to configure the workflow for your specific project. For details see the technical documentation . Below files and directories will need editing:

  • Snakefile

  • genomes_and_annotations/

  • metadata/

  • config.yml

  • snpEff.config

You can download example data for testing the workflow. click here to download

--

Clone the repository

clone this repository Now **clone the forked repository** to your machine. Go to your GitHub account, open the forked repository, click on the clone button and then click the *copy to clipboard* icon. The url is going to be like: ```https://github.com/your-username/dna-proto-workflow.git``` where `your-username` is your GitHub username.

Open a terminal and run the following git command:

git clone https://github.com/your-username/dna-proto-workflow.git

copy URL to clipboard Once you've cloned your fork, you can edit your local copy. However, if you want to contribute, you'll need to create a new branch.

Create a branch

Change to the repository directory on your computer (if you are not already there):

NOTE: Don't change the name of this directory!

cd dna-proto-workflow

You can check your branches and active branch, using the git branch command.

git branch -a

Now create a branch using the git checkout command:

git checkout -b new-branch-name

For example:

git checkout -b development

From this point, you are in the new branch and edits only affect your branch. If things go wrong, simply remove your branch using

git branch -d name-of-the-branch

Or revert back to the master -branch using

git checkout master

Make changes and commit

Once you've modified something, you can confirm that there are changes with git status (called from the top-level directory). Add those changes to your branch with git add :

git status
git add .
or
git add name_of_the_file_you_modified

Commit those changes with git commit :

git commit -m "write a message"

Code Snippets

29
30
31
32
33
shell:
    "python3 scripts/tidybamstat.py"
    "   -o output/alnstats/everything"  # prefix
    "   {input}"
    ">'{log}' 2>&1"
71
72
73
74
75
76
77
78
79
shell:
    "( bwa mem"
    "   -p" # paired input
    "   -t {threads}"
    "   -R '@RG\\tID:{wildcards.run}_{wildcards.lib}\\tSM:{params.sample}'"
    "   {input.ref}"
    "   {input.reads}"
    "| samtools view -Suh - >{output.bam}"
    " ) >'{log}' 2>&1"
100
101
102
103
104
105
106
107
108
109
110
shell:
    "( java"
    "   -{params.mem}"
    "   -jar {params.abra_release}"
    "   --in {input.set}"
    "   --out {output}"
    "   --ref {params.ref}"
    "   --threads {params.threads}"
    "   --targets {params.region}"
    "   --tmpdir {params.abra_temp}"
    ") >'{log}' 2>&1"
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
shell:
    "( samtools fixmate "
    "   -m"
    "   -@ {threads}"
    "   --output-fmt bam,level=0"
    "   {input.bam}"
    "   -"
    " | samtools sort"
    "   -T {config[abra2][temp]}/{wildcards.run}_{wildcards.lib}_sort_$RANDOM"
    "   --output-fmt bam,level=0"
    "   -@ {threads}"
    "   -m 1g"
    "   -"
    " | samtools markdup"
    "   -T {config[abra2][temp]}/{wildcards.run}_{wildcards.lib}_markdup_$RANDOM"
    "   -s" # report stats
    "   -@ {threads}"
    "   --output-fmt bam,level=3"
    "   -"
    "   {output.bam}"
    " ) >'{log}' 2>&1"
155
156
157
158
159
160
161
shell:
    "( samtools merge"
    "   -@ {threads}"
    "   --output-fmt bam,level=4"
    "   {output.bam}"
    "   {input}"
    " ) >'{log}' 2>&1"
172
173
174
175
176
177
178
179
180
shell:
    "( unset DISPLAY; qualimap bamqc"
    "   --java-mem-size=4G"
    "   -bam {input.bam}"
    "   -nr 10000"
    "   -nt {threads}"
    "   -outdir {output}"
    "   {input}"
    " ) >'{log}' 2>&1"
191
192
193
194
run:
    with open(output[0], "w") as fh:
        for s in input:
            print(s, file=fh)
208
209
210
211
212
213
214
215
216
shell:
    "( samtools merge"
    "   --output-fmt bam,level=7"
    "   -@ {threads}"
    "   -"
    "   {input}"
    " | tee {output.bam}"
    " | samtools index - {output.bai}"  # indexing takes bloody ages, we may as well do this on the fly
    " ) >'{log}' 2>&1"
227
228
shell:
    "(samtools stats -i 5000 -x {input} >{output}) >'{log}'"
241
242
shell:
    "samtools index {input}"
249
250
shell:
    "samtools view -Suh {input} >{output}"
297
298
299
300
301
302
303
304
305
306
307
shell:
    "( ngm"
    "   -q {input.reads}"
    "   --paired --broken-pairs"
    "   -r {input.ref}"
    "   -t {threads}"
    "   --rg-id {wildcards.run}_{wildcards.lib}"
    "   --rg-sm {params.sample}"
    "   --sensitivity {params.sensitivity}" # this is the mean from a bunch of different runs
    "| samtools view -Suh - >{output.bam}"
    " ) >'{log}' 2>&1"
41
42
43
44
45
46
47
48
49
50
shell:
    "( snpEff ann"
    " -filterInterval {input.bed}"
    " -csvStats {output.csvstats}"
    " -htmlStats {output.htmlStats}"
    " {params.extra}"
    " {params.database}"
    " {input.vcf}"
    " > {output.vcf}"
    " ) >'{log}' 2>&1"
35
36
script:
    "../scripts/pca_mash.R"
44
45
script:
    "../scripts/pca_kwip.R"
68
69
70
71
72
73
74
75
shell:
    " mash sketch"
    "   -k {wildcards.ksize}"
    "   -s {wildcards.sketchsize}"
    "   -p {threads}"
    "   -o {output}"
    "   {input}"
    " >{log} 2>&1"
86
87
88
89
90
91
92
shell:
    "mash dist"
    "   -p {threads}"
    "   -t" # tabular format
    "   {input} {input}" # needs input twice
    " >{output}"
    " 2>{log}"
105
106
107
108
109
110
111
112
113
114
115
116
shell:
    "load-into-counting.py"
    "   -N 1"
    "   -x {wildcards.sketchsize}"
    "   -k {wildcards.ksize}"
    "   -b"
    "   -f"
    "   -s tsv"
    "   -T {threads}"
    "   {output.ct}"
    "   {input}"
    " >{log} 2>&1"
130
131
132
133
134
135
136
shell:
    "kwip"
    " -d {output.d}"
    " -k {output.k}"
    " -t {threads}"
    " {input}"
    " >{log} 2>&1"
150
151
152
153
154
155
156
shell:
    "( kdm-unique-kmers.py"
    "    -t {threads}"
    "    -k {params.kmersize}"
    "    {input}"
    "    >{output}"
    " ) 2>{log}"
166
167
168
169
170
171
172
173
shell:
    "( sourmash compute"
    "   --name '{wildcards.sample}'"
    "   -k {wildcards.ksize}"
    "   -n {wildcards.sketchsize}"
    "   -o {output}"
    "   {input}"
    ") >{log} 2>&1"
185
186
shell:
    "(sourmash compare -k {wildcards.ksize} -o {output} {input} ) >{log} 2>&1"
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
shell:
    "( AdapterRemoval"
    "   --file1 {input.r1}"
    "   --file2 {input.r2}"
    "   --adapter1 {params.adp1}"
    "   --adapter2 {params.adp2}"
    "   --combined-output"
    "   --interleaved-output"
    "   --trimns"
    "   --trimqualities"
    "   --trimwindows 10"
    "   --minquality {params.minqual}"
    "   --threads 2"
    "   --settings {log.settings}"
    "   --output1 /dev/stdout"
    " | seqhax pairs"
    "   -l 20"
    "   -b >(pigz -p 5 >{output.reads})"
    "   /dev/stdin"
    ") >'{log.log}' 2>&1"
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
shell:
    "( AdapterRemoval"
    "   --file1 {input}"
    "   --adapter1 {params.adp1}"
    "   --adapter2 {params.adp2}"
    "   {params.extra}"
    "   --minquality {params.minqual}"
    "   --threads 2"
    "   --settings {log.settings}"
    "   --output1 /dev/stdout"
    " | seqhax pairs"
    "   -l 20"
    "   -b >(pigz -p 5 >{output.reads})"
    "   /dev/stdin"
    ") >'{log.log}' 2>&1"
110
111
shell:
    "cat {input} > {output}"
123
124
125
126
127
128
shell:
    "( seqhax stats"
    "    -t {threads}"
    "    {input}"
    "    >{output}"
    " ) 2>'{log}'"
140
141
142
143
144
145
shell:
    "( seqhax stats"
    "    -t {threads}"
    "    {input}"
    "    >{output}"
    " ) 2>''{log}'"
40
41
script:
    "../scripts/plot-quals.py"
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
shell:
    "(  freebayes"
    "   --theta {params.theta}"
    "   --samples {input.sset}"
    "   --ploidy 2"
    "   --use-best-n-alleles 3"
    "   --min-mapping-quality {params.minmq}"
    "   --min-base-quality {params.minbq}"
    "   --read-max-mismatch-fraction 0.1"
    "   --min-alternate-fraction 0"
    "   --min-alternate-count 2" # per sample
    "   --min-alternate-total 5" # across all samples
    "   --min-coverage 10" # across all samples
    "   --prob-contamination 1e-6"
    "   --use-mapping-quality"
    "   --strict-vcf"
    "   --region '{wildcards.region}'"
    "   --fasta-reference {input.ref}"
    "   {input.bam}"
    " | bcftools view"
    "   -O b  -o '{output.bcf}'"
    " ) >'{log}' 2>&1"
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
shell:
    "( bcftools mpileup"
    "   --adjust-MQ 50"
    "   --redo-BAQ"
    "   --max-depth 20000" # the default per file max (250x) is insane, i.e. <1x for most sets. new limit of 20000x  equates to a max. of 20x across all samples.
    "   --min-MQ {params.minmq}"
    "   --min-BQ {params.minbq}"
    "   --fasta-ref {input.ref}"
    "   --samples-file {input.sset}"
    "   --annotate FORMAT/DP,FORMAT/AD,FORMAT/SP,INFO/AD" #output extra tags
    "   --region '{wildcards.region}'"
    "   --output-type u" #uncompressed
    "   {input.bam}"
    " | bcftools call"
    "   --targets '{wildcards.region}'" # might not be needed
    "   --multiallelic-caller"
    "   --prior {params.theta}"
    "   -O b"
    "   -o {output.bcf}"
    " ) >'{log}' 2>&1"
136
137
138
139
140
141
shell:
    "( bcftools view"
    "   {params.filtarg}"
    "   '{input.bcf}'"
    "   -O b  -o '{output.bcf}'"
    " ) >'{log}' 2>&1"
152
153
154
155
run:
    with open(output[0], "w") as fh:
        for s in sorted(input):
            print(s, file=fh)
168
169
170
171
172
173
174
shell:
    "( bcftools concat"
    "   --threads {threads}"
    "   -O b"
    "   -o {output.bcf}"
    "   --file-list {input.fofn}"
    " ) >'{log}' 2>&1"
185
186
187
188
189
190
191
shell:
    "( bcftools view"
    "   {input.bcf}"
    "   -O z"
    "   --threads {threads}"
    "   -o {output.vcf}"
    " ) >'{log}' 2>&1"
201
202
203
204
205
206
207
shell:
    "( bcftools view"
    "   {input.bcf}"
    "   -O v"
    "   --threads {threads}"
    "   -o {output.vcf}"
    " ) >'{log}' 2>&1"
214
215
shell:
    "bcftools index -c -f {input} && bcftools index -t -f {input}"
222
223
shell:
    "bcftools stats -s - -d 0,1000,2 --threads {threads} {input} >{output}"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
library(rgl)

print (snakemake@input[[1]])
print (snakemake@output[[1]])

y0<-read.delim(snakemake@input[[1]], header=T)
data<-as.matrix(y0[, -1])
labels<-colnames(data)
pca <- prcomp(data, scale=F)
print ('PCA done!')
plot3d(pca$x[,c(1,2,3)], size=10)
text3d(pca$x[,c(1,2,3)], text=labels)
rgl.postscript(snakemake@output[[1]],"pdf")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
library(rgl)

print (snakemake@input[[1]])
print (snakemake@output[[1]])

y0<-read.delim(snakemake@input[[1]], header=T)
data<-as.matrix(y0[, -1])
labels<-as.data.frame(do.call(rbind, strsplit(colnames(data), "[.]")))$V4
pca <- prcomp(data, scale=F)
print ('PCA done!')
plot3d(pca$x[,c(1,2,3)], size=10)
text3d(pca$x[,c(1,2,3)], text=labels)
rgl.postscript(snakemake@output[[1]],"pdf")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import matplotlib
import matplotlib.pyplot as plt
from pysam import VariantFile
import numpy as np

###matplotlib.use("Agg")

print ("$ Script: plot-quals")
print ("INPUT: ", snakemake.input)
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
plt.savefig(snakemake.output[0])
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
from sys import stderr, stdout, stdin, argv
from os.path import basename, splitext
import re
import json
import csv
import argparse

try:
    from tqdm import tqdm
except ImportError:
    tqdm = lambda x: x


def doSN(line, data):
    if "SN" not in data:
        data["SN"] = {}
    row = line.rstrip('\n').split('\t')
    field = re.sub(r'\(|\)|:|%', '', row[1])
    field = re.sub(r'\s+$|^\s+', '', field)
    field = re.sub(r'\s+', '_', field)
    val = row[2]
    data["SN"][field] = val


# TODO: not handling this right now
# def doTableByCycle(line, data, key):
#     cols = {
#         "FFQ": ["cycle_number", "quality"],
#         "LFQ": ["cycle_number", "quality"],
#     }


TABLE_COLS = {
    "IS": ["insert_size", "pairs", "pairs_in", "pairs_out", "pairs_other"],
    "ID": ["indel_size", "n_insertion", "n_deletion"],
    "GCF": ["gc_content", "n_reads"],
    "GCL": ["gc_content", "n_reads"],
    "GCC": ["cycle_number", "a_pct", "c_pct", "g_pct", "t_pct", "n_pct", None],
    "RL": ["read_length", "n_reads"],
    "COV": ["coverage_bin", None, "genome_bp"],
}

def doTable(line, data):
    key = line.strip().split('\t')[0]
    if key not in data:
        data[key] = []
    vals = line.strip().split("\t")
    valdict = {}
    for i, col in enumerate(TABLE_COLS[key]):
        if col is None:
            continue
        valdict[col] = vals[i+1]
    data[key].append(valdict)


def dump(samples, key, outfile):
    with open(outfile, "w") as fh:
        csvfh = None
        for sample, alldata in samples.items():
            if key not in alldata:
                continue
            data = alldata[key]
            if isinstance(data, dict):
                # for some where rows are split over multiple lines, e.g. SN
                data = [data, ]
            if csvfh is None:
                header = ["sample", ] + list(data[0].keys())
                csvfh = csv.DictWriter(fh, header)
                csvfh.writeheader()
            for row in data:
                row["sample"] = sample
                csvfh.writerow(row)



def main():
    p = argparse.ArgumentParser(prog="tidybamstat")
    p.add_argument("-o", "--outprefix", type=str, required=True,
                   help="Output prefix")
    p.add_argument("input", type=str, nargs='+')
    args = p.parse_args()

    samples = {}
    print("Load alignment stats:", file=stderr)
    for file in tqdm(args.input):
        data = dict()
        with open(file) as fh:
            for line in fh:
                dtype = line.strip().split("\t")[0]
                if line.startswith("#"):
                    continue
                elif line.startswith("SN"):
                    doSN(line, data)
                elif dtype in TABLE_COLS:
                    doTable(line, data)
        sample = splitext(basename(file))[0]
        samples[sample] = data

    allkeys = ["SN"] + list(TABLE_COLS.keys())
    print("Dumping data by data type: ", file=stderr, end='', flush=True)
    for key in allkeys:
        print(f"{key}, ", file=stderr, end='',  flush=True)
        outfile = f"{args.outprefix}_{key}.csv"
        dump(samples, key, outfile)
    print(" Done!", file=stderr)

if __name__  == "__main__":
    main()
ShowHide 31 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/PBGLMichaelHall/VariantCallingWithSnakemake
Name: variantcallingwithsnakemake
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...