aMeta: Ancient DNA Shotgun Metagenomics Analysis Workflow

public public 1yr ago Version: 1.0.0 0 bookmarks

aMeta is a Snakemake workflow for identifying microbial sequences in ancient DNA shotgun metagenomics samples. The workflow performs:

  • trimming adapter sequences and removing reads shorter than 30 bp with Cutadapt

  • quaity control before and after trimming with FastQC and MultiQC

  • taxonomic sequence kmer-based classification with KrakenUniq

  • sequence alignment with Bowtie2 and screening for common microbial pathogens

  • deamination pattern analysis with MapDamage2

  • Lowest Common Ancestor (LCA) sequence alignment with Malt

  • authentication and validation of identified microbial species with MaltExtract

When using aMeta and / or pre-built databases provided together with the wokflow for your research projects, please cite our preprint: https://www.biorxiv.org/content/10.1101/2022.10.03.510579v1

Installation

Clone the github repository, then create and activate aMeta conda environment (here and below cd aMeta implies navigating to the cloned root aMeta directory). For this purpose, we recommend installing own conda , for example from here https://docs.conda.io/en/latest/miniconda.html, and mamba https://mamba.readthedocs.io/en/latest/installation.html:

git clone https://github.com/NBISweden/aMeta
cd aMeta
mamba env create -f workflow/envs/environment.yaml
# alternatively: conda env create -f workflow/envs/environment.yaml, however it takes longer
conda activate aMeta

Run a test to make sure that the workflow was installed correctly:

cd .test
./runtest.sh -j 4

Here, and below, by -j you can specify the number of threads that the workflow can use. Please make sure that the installation and test run accomplished successfully before proceeding with running aMeta on your data. Potential problems with installation and test run often come from unstable internet connection and particular conda settings used e.g. at computer clusters, therefore we advise you to use your own freshly installed conda .

Quick start

To run the worflow you need to prepare a tab-delimited sample-file config/samples.tsv with at least two columns, and a configuration file config/config.yaml , below we provide examples for both files.

Here is an example of samples.tsv , this implies that the fastq-files are located in aMeta/data folder:

sample	fastq
foo	data/foo.fq.gz
bar	data/bar.fq.gz

Currently, it is important that the sample names in the first column exactly match the names of the fastq-files in the second column. For example, a fastq-file "data/foo.fq.gz" specified in the "fastq" column, must have a name "foo" in the "sample" column. Please make sure that the names in the first and second columns match.

Main results of the workflow and their interpretation

All output files of the workflow are located in aMeta/results directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in results/overview_heatmap_scores.pdf .

Overview

The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of eight quality metrics that provide information about microbial presence and ancient status. The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient. Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: a) a microbe is present but not ancient, i.e. modern contaminant, or b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe you think about. The former is a more common case scenario. The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evedence (very few reads) to claim presence of a microbe in a sample.

 

Code Snippets

23
24
shell:
    "bowtie2-build-l --threads {threads} {input.ref} {input.ref} > {log} 2>&1"
47
48
49
50
51
52
53
54
55
56
57
shell:
    """bowtie2 --large-index -x {params.BOWTIE2_DB} --end-to-end --threads {threads} --very-sensitive -U {input.fastq} > $(dirname {output.bam})/AlignedToBowtie2DB.sam 2> {log};"""
    """grep @ $(dirname {output.bam})/AlignedToBowtie2DB.sam | awk '!seen[$2]++' > $(dirname {output.bam})/header_nodups.txt;"""
    """grep -v '^@' $(dirname {output.bam})/AlignedToBowtie2DB.sam > $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;"""
    """cat $(dirname {output.bam})/header_nodups.txt $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam > $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;"""
    """samtools view -bS -q 1 -h -@ {threads} $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam | samtools sort - -@ {threads} -o {output.bam};"""
    """samtools index {output.bam};"""
    """rm $(dirname {output.bam})/header_nodups.txt;"""
    """rm $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;"""
    """rm $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;"""
    """rm $(dirname {output.bam})/AlignedToBowtie2DB.sam;"""
19
20
21
22
shell:
    "mkdir -p {params.dir}; "
    "while read taxid; do mkdir -p {params.dir}/$taxid; touch {params.dir}/$taxid/.done; done<{input.species};"
    "touch {output.done}"
48
49
shell:
    "touch {output}; "
63
64
shell:
    "awk -v var={wildcards.taxid} '{{ if($1==var) print $0 }}' {params.tax_db}/taxDB | cut -f3 > {output.node_list}"
94
95
shell:
    "time MaltExtract -Xmx32G -i {input.rma6} -f def_anc -o {params.extract} -r {params.ncbi_db} --reads --threads {threads} --matches --minPI 85.0 --maxReadLength 0 --minComp 0.0 --meganSummary -t {input.node_list} -v 2> {log}"
114
115
shell:
    "postprocessing.AMPS.r -m def_anc -r {params.extract} -t {threads} -n {input.node_list} || echo 'postprocessing failed for {wildcards.sample}_{wildcards.taxid}' > {output.analysis}  2> {log}"
134
135
shell:
    "samtools faidx {input.fna} 2> {log}"
159
160
161
162
163
164
165
166
167
shell:
    "echo {params.ref_id} > {output.name_list}; "
    "zgrep {params.ref_id} {input.sam} | uniq > {output.sam}; "
    "samtools view -bS {output.sam} > results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam; "
    "samtools sort results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam > {output.sorted_bam}; "
    "samtools index {output.sorted_bam}; "
    "samtools depth -a {output.sorted_bam} > {output.breadth_of_coverage}; "
    "grep -w -f {output.name_list} {input.malt_fasta_fai} | awk '{{printf(\"%s:1-%s\\n\", $1, $2)}}' > {output.name_list}.regions; "
    "samtools faidx {input.malt_fasta} -r {output.name_list}.regions -o results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.fasta"
184
185
shell:
    "samtools view {input.bam} | awk '{{ print length($10) }}' > {output.distribution}"
202
203
shell:
    "(samtools view -h {input.bam} || true) | pmdtools --printDS > {output.scores}"
226
227
shell:
    "Rscript {params.exe} {wildcards.taxid} {wildcards.sample}.trimmed.rma6 {input.dir}/"
245
246
247
248
shell:
    "(samtools view {input.bam} || true) | pmdtools --platypus --number 2000000 > {output.tmp}; "
    "cd results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}; "
    "R CMD BATCH $(which plotPMD); "
269
270
shell:
    "Rscript {params.exe} {input.rma6} $(dirname {input.maltextractlog}) {input.name_list} $(dirname {input.name_list}) &> {log};"
23
24
25
26
27
28
29
30
31
32
shell:
    "mkdir {output.dir}; "
    "if [ -s {input.species_tax_id} ]; then "
    'cat {input.species_tax_id} | parallel -j {threads} "grep -w {{}} {params.bowtie2_seqid2taxid} | cut -f1 > {output.dir}/{{}}.seq_ids" ; '
    "for i in $(cat {input.species_tax_id}); do xargs --arg-file={output.dir}/${{i}}.seq_ids samtools view -bh {input.bam} --write-index -@ {threads} -o {output.dir}/${{i}}.tax.bam; done >> {log} 2>&1; "
    "find {output.dir} -name '*.tax.bam' | parallel -j {threads} \"mapDamage {params.options} -i {{}} -r {params.BOWTIE2_DB} --merge-reference-sequences -d {output.dir}/mapDamage_{{}}\" >> {log} 2>&1 || true; "
    "for filename in {output.dir}/*.tax.bam; do newname=`echo $filename | sed 's/tax\.//g'`; mv $filename $newname; done >> {log} 2>&1; "
    "mv {output.dir}/mapDamage_{output.dir}/* {output.dir} >> {log} 2>&1; "
    "rm -r {output.dir}/mapDamage_results >> {log} 2>&1; "
    "else echo NO MICROBES TO AUTHENTICATE > {output.dir}/README.txt; fi"
21
22
shell:
    "krakenuniq --preload --db {params.DB} --fastq-input {input.fastq} --threads {threads} --output {output.seqs} --report-file {output.report} --gzip-compressed --only-classified-out &> {log}"
49
50
51
52
shell:
    """{params.exe} {input.krakenuniq} {params.n_unique_kmers} {params.n_tax_reads} {input.pathogenomesFound} &> {log}; """
    """cut -f7 {output.pathogens} | tail -n +2 > {output.pathogen_tax_id};"""
    """cut -f7 {output.filtered} | tail -n +2 > {output.species_tax_id}"""
78
79
80
81
shell:
    "{params.exe} {input.report} {input.seqs} &> {log}; "
    "cat {output.seqs} | cut -f 2,3 > {output.krona}; "
    "ktImportTaxonomy {output.krona} -o {output.html} {params.DB} &>> {log}"
109
110
111
shell:
    "Rscript {params.exe} results/KRAKENUNIQ {output.out_dir} {params.n_unique_kmers} {params.n_tax_reads} &> {log};"
    "Rscript {params.exe_plot} {output.out_dir} {output.out_dir} &> {log}"
25
26
script:
    "../scripts/malt-build.py"
SnakeMake From line 25 of rules/malt.smk
49
50
shell:
    "unset DISPLAY; malt-run -at SemiGlobal -m BlastN -i {input.fastq} -o {output.rma6} -a {params.gunzipped_sam} -t {threads} -d {input.db} -sup 1 -mq 100 -top 1 -mpi 85.0 -id 85.0 -v &> {log}"
69
70
71
shell:
    "mkdir -p {output.out_dir}; "
    "{params.exe} {input.sam} {params.unique_taxids} > {output.counts} 2> {log}"
SnakeMake From line 69 of rules/malt.smk
95
96
shell:
    "Rscript {params.exe} results/MALT_QUANTIFY_ABUNDANCE {output.out_dir} &> {log}"
SnakeMake From line 95 of rules/malt.smk
118
119
120
121
shell:
    "{params.exe} -d $(dirname {input.rma6}) -r 'S' &> {log}; "
    "mv results/MALT/count_table.tsv {output.out_dir}; "
    "mv {output.out_dir}/count_table.tsv {output.abundance_matrix}"
SnakeMake From line 118 of rules/malt.smk
141
142
143
shell:
    "mv {input.tre} {output.tre};"
    "mv {input.map} {output.map};"
SnakeMake From line 141 of rules/malt.smk
19
20
shell:
    "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_BEFORE_TRIMMING &> {log}"
41
42
shell:
    "cutadapt {params.adapters} --minimum-length 31 -o {output.fastq} {input.fastq} &> {log}"
62
63
shell:
    "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_AFTER_TRIMMING &> {log}"
85
86
87
shell:
    'echo {input} | tr " " "\n" > {output.html}.fof;'
    "multiqc -c {params.config} -l {output.html}.fof --verbose --force --outdir results/MULTIQC &> {log}"
17
18
shell:
    "Rscript {params.exe} results/AUTHENTICATION $(dirname {output.heatmap}) &> {log}"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
__author__ = "Per Unneberg"
__copyright__ = "Copyright 2022, Per Unneberg"
__email__ = "[email protected]"
__license__ = "MIT"

import re
from snakemake.shell import shell
import subprocess as sp

out = sp.run(["malt-build", "--help"], capture_output=True)
regex = re.compile("version (?P<major>\d+)\.(?P<minor>\d+)")
m = regex.search(out.stderr.decode())
if m is None:
    # Assume minor version 4
    minor = 4
else:
    minor = int(m.groupdict()["minor"])

a2t_option = "-a2taxonomy" if minor <= 4 else "-a2t"

log = snakemake.log_fmt_shell(stdout=False, stderr=True, append=True)
options = snakemake.params.get("extra", "")
unique_taxids = snakemake.input.unique_taxids

seqid2taxid = snakemake.params.seqid2taxid
nt_fasta = snakemake.params.nt_fasta
accession2taxid = snakemake.params.accession2taxid

output = snakemake

shell(
    "grep -wFf {unique_taxids} {seqid2taxid} > {snakemake.output.seqid2taxid_project}; "
    "cut -f1 {snakemake.output.seqid2taxid_project} > {snakemake.output.seqids_project}; "
    "grep -Ff {snakemake.output.seqids_project} {nt_fasta} | sed 's/>//g' > {snakemake.output.project_headers}; "
    "seqtk subseq {nt_fasta} {snakemake.output.project_headers} > {snakemake.output.project_fasta} {log}; "
    "unset DISPLAY; "
    "malt-build -i {snakemake.output.project_fasta} {a2t_option} {accession2taxid} -s DNA -t {snakemake.threads} -d {snakemake.output.db} {log}"
)
ShowHide 20 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NBISweden/ancient-microbiome-smk
Name: ancient-microbiome-smk
Version: 1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...