Amplicon Sequencing snakemake workflow

public public 1yr ago Version: v0.1.0 0 bookmarks

AmpSeeker is a snakemake workflow for Amplicon Sequencing data analysis. The pipeline is a work in progress, however, it currently implements:

  • BCL to Fastq conversion
  • Genome alignment
  • Variant calling
  • Quality control
  • Coverage
  • Visualisation of reads in IGV
  • VCF to DataFrame/.xlsx
  • Allele frequency calculation
  • Principal component analysis
  • Geographic sample maps

The workflow uses a combination of papermill and jupyter book, so that users can visually explore the results in a local webpage for convenience.

Usage

Please see the wiki for more information on running the workflow. If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Testing

Test cases are in the subfolder .test . They are automatically executed via continuous integration with Github Actions .

Code Snippets

12
13
14
15
shell:
    """
    samtools faidx {input.ref} 2> {log}
    """
30
31
32
33
shell:
    """
    bwa index {input.ref} 2> {log}
    """
58
59
60
61
62
shell:
    """
    bwa mem -t {threads} -R {params.tag} {input.ref} {input.reads} 2> {log.align} |
    samtools sort -@{threads} -o {output} 2> {log.sort}
    """
73
74
shell:
    "samtools index {input} {output} 2> {log}"
 96
 97
 98
 99
100
shell:
    """
    bcftools mpileup -Ov -f {params.ref} -R {params.regions} --max-depth {params.depth} {input.bam} 2> {log.mpileup} |
    bcftools call -m -Ov 2> {log.call} | bcftools sort -Ov -o {output.calls} 2> {log.call}
    """
121
122
123
124
125
shell:
    """
    bcftools mpileup -Ov -f {params.ref} --max-depth {params.depth} {input.bam} 2> {log.mpileup} |
    bcftools call -m -Ov 2> {log.call} | bcftools sort -Ov -o {output.calls} 2> {log.call}
    """
142
143
shell:
    "snpEff download {params.ref} -dataDir {params.dir} 2> {log}"
164
165
166
167
shell:
    """
    snpEff eff {params.db} -dataDir {params.dir} -csvStats {output.csvStats} {input.calls} > {output.calls} 2> {log}
    """
20
21
22
23
24
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} -p genome_name {params.reference_name} -p reference_fasta {input.genome} -p reference_gff3 {input.gff3} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
43
44
45
46
47
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} -p dataset {params.dataset} -p vcf_path {input.vcf} -p colour_column {params.column} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
63
64
65
66
67
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} -p dataset {params.dataset} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
87
88
89
90
91
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} -p dataset {params.dataset} -p bed_path {input.bed} -p cohort_column {params.cohort_column} -p vcf_path {input.vcf} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
108
109
110
111
112
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p dataset {params.dataset} -p wkdir {params.wkdir} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
12
13
shell:
    "bcl-convert --bcl-input-directory {input.illumina_in_dir} --output-directory {output.reads_dir} --sample-sheet {input.sample_csv} --force 2> {log}"
32
33
34
35
36
37
38
39
40
41
42
43
44
shell:
    """
    while IFS="," read -r _ sample_id _ _ read1 read2; do

        if [[ $sample_id == "RGSM" ]]; then
           continue
        fi

        echo renaming $read1 and $read2 to ${{sample_id}}_1.fastq.gz and ${{sample_id}}_2.fastq.gz 
        mv $read1 resources/reads/${{sample_id}}_1.fastq.gz
        mv $read2 resources/reads/${{sample_id}}_2.fastq.gz
    done < {input.fastq_list}
    """
25
26
27
28
shell: 
    """
    python -m ipykernel install --user --name=AmpSeq_python 2> {log}
    """
20
21
22
23
24
shell:
    """
    jupyter-book build --all {input.pages} --path-output results/ampseeker-results &&
    ln -sf results/ampseeker-results/_build/html/index.html AmpSeeker-results.html
    """
46
47
48
49
shell:
    """
    papermill -k AmpSeq_python {input.input_nb} {output.out_nb} -p wkdir {params.wkdir} 2> {log}
    """
17
18
19
20
21
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
39
40
41
42
43
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} -p index_read_qc {params.index_qc} -p wkdir {params.wd} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
62
63
64
65
66
shell:
    """
    papermill {input.nb} {output.nb} -k AmpSeq_python -p metadata_path {input.metadata} -p bed_targets_path {input.targets} -p wkdir {params.wkdir} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """
11
12
13
14
15
16
17
18
19
20
shell:
    """
    zcat resources/reads/*I1*.fastq.gz | fastqc stdin --outdir results/qc/index-read-qc/ 2>> {log}
    mv results/qc/index-read-qc/stdin_fastqc.html results/qc/index-read-qc/I1.html 2>> {log}
    mv results/qc/index-read-qc/stdin_fastqc.zip results/qc/index-read-qc/I1.zip 2>> {log}

    zcat resources/reads/*I2*.fastq.gz | fastqc stdin --outdir results/index-read-qc/ 2>> {log}
    mv results/qc/index-read-qc/stdin_fastqc.html results/qc/index-read-qc/I2.html 2>> {log}
    mv results/qc/index-read-qc/stdin_fastqc.zip results/qc/index-read-qc/I2.zip 2>> {log}
    """
32
33
wrapper:
    "v1.25.0/bio/fastp"
54
55
56
57
shell:
    """
    mosdepth {params.prefix} {input.bam} --fast-mode --threads {threads} 2> {log}
    """
72
73
74
75
shell:
    """
    samtools flagstat {input.bam} > {output} 2> {log}
    """
88
89
90
91
shell:
    """
    qualimap bamqc -bam {input.bam} -outdir {output.folder} 2> {log}
    """
102
103
104
105
shell:
    """
    bcftools stats {input.vcf} > {output.stats} 2> {log}
    """
124
125
wrapper: 
    "v2.2.1/bio/multiqc"
13
14
wrapper:
    "v1.17.4-17-g62b55d45/bio/bgzip"
25
26
27
28
shell:
    """
    tabix {input.calls} 2> {log}
    """
41
42
43
44
shell:
    """
    bcftools merge --threads {threads} -o {output.vcf} -O v {input.vcfs} --force-samples 2> {log}
    """
57
58
59
60
shell:
    """
    bcftools merge --threads {threads} -o {output.vcf} -O v {input.vcfs} 2> {log}
    """
73
74
75
76
shell:
    """
    bcftools merge --threads {threads} -o {output.vcf} -O v {input.vcfs} 2> {log}
    """
85
86
wrapper:
    "v1.17.4-17-g62b55d45/bio/bgzip"
 97
 98
 99
100
shell:
    """
    tabix {input.vcfgz} 2> {log}
    """
113
114
115
116
shell:
    """
    bcftools merge -o {output.vcf} -Ov {input.vcf} 2> {log}
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
__author__ = "William Rowell"
__copyright__ = "Copyright 2020, William Rowell"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    """
    (bgzip -c {extra} --threads {snakemake.threads} \
        {snakemake.input} > {snakemake.output}) {log}
    """
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
__author__ = "Sebastian Kurscheid"
__copyright__ = "Copyright 2019, Sebastian Kurscheid"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
import re

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


# Assert input
n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."


# Input files
if n == 1:
    reads = "--in1 {}".format(snakemake.input.sample)
else:
    reads = "--in1 {} --in2 {}".format(*snakemake.input.sample)


# Output files
trimmed_paths = snakemake.output.get("trimmed", None)
if trimmed_paths:
    if n == 1:
        trimmed = "--out1 {}".format(snakemake.output.trimmed)
    else:
        trimmed = "--out1 {} --out2 {}".format(*snakemake.output.trimmed)

        # Output unpaired files
        unpaired = snakemake.output.get("unpaired", None)
        if unpaired:
            trimmed += f" --unpaired1 {unpaired} --unpaired2 {unpaired}"
        else:
            unpaired1 = snakemake.output.get("unpaired1", None)
            if unpaired1:
                trimmed += f" --unpaired1 {unpaired1}"
            unpaired2 = snakemake.output.get("unpaired2", None)
            if unpaired2:
                trimmed += f" --unpaired2 {unpaired2}"

        # Output merged PE reads
        merged = snakemake.output.get("merged", None)
        if merged:
            if not re.search(r"--merge\b", extra):
                raise ValueError(
                    "output.merged specified but '--merge' option missing from params.extra"
                )
            trimmed += f" --merged_out {merged}"
else:
    trimmed = ""


# Output failed reads
failed = snakemake.output.get("failed", None)
if failed:
    trimmed += f" --failed_out {failed}"


# Stats
html = "--html {}".format(snakemake.output.html)
json = "--json {}".format(snakemake.output.json)


shell(
    "(fastp --thread {snakemake.threads} "
    "{extra} "
    "{adapters} "
    "{reads} "
    "{trimmed} "
    "{json} "
    "{html} ) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/sanjaynagi/AmpSeeker
Name: ampseeker
Version: v0.1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...