aMeta: Ancient DNA Shotgun Metagenomics Analysis Workflow

public 1yr ago Version: 1.0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

aMeta is a Snakemake workflow for identifying microbial sequences in ancient DNA shotgun metagenomics samples. The workflow performs:

trimming adapter sequences and removing reads shorter than 30 bp with Cutadapt
quaity control before and after trimming with FastQC and MultiQC
taxonomic sequence kmer-based classification with KrakenUniq
sequence alignment with Bowtie2 and screening for common microbial pathogens
deamination pattern analysis with MapDamage2
Lowest Common Ancestor (LCA) sequence alignment with Malt
authentication and validation of identified microbial species with MaltExtract

When using aMeta and / or pre-built databases provided together with the wokflow for your research projects, please cite our preprint: https://www.biorxiv.org/content/10.1101/2022.10.03.510579v1

Installation

Clone the github repository, then create and activate aMeta conda environment (here and below cd aMeta implies navigating to the cloned root aMeta directory). For this purpose, we recommend installing own conda , for example from here https://docs.conda.io/en/latest/miniconda.html, and mamba https://mamba.readthedocs.io/en/latest/installation.html:

git clone https://github.com/NBISweden/aMeta
cd aMeta
mamba env create -f workflow/envs/environment.yaml
# alternatively: conda env create -f workflow/envs/environment.yaml, however it takes longer
conda activate aMeta

Run a test to make sure that the workflow was installed correctly:

cd .test
./runtest.sh -j 4

Here, and below, by -j you can specify the number of threads that the workflow can use. Please make sure that the installation and test run accomplished successfully before proceeding with running aMeta on your data. Potential problems with installation and test run often come from unstable internet connection and particular conda settings used e.g. at computer clusters, therefore we advise you to use your own freshly installed conda .

Quick start

To run the worflow you need to prepare a tab-delimited sample-file config/samples.tsv with at least two columns, and a configuration file config/config.yaml , below we provide examples for both files.

Here is an example of samples.tsv , this implies that the fastq-files are located in aMeta/data folder:

sample	fastq
foo	data/foo.fq.gz
bar	data/bar.fq.gz

Currently, it is important that the sample names in the first column exactly match the names of the fastq-files in the second column. For example, a fastq-file "data/foo.fq.gz" specified in the "fastq" column, must have a name "foo" in the "sample" column. Please make sure that the names in the first and second columns match.

Main results of the workflow and their interpretation

All output files of the workflow are located in aMeta/results directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in results/overview_heatmap_scores.pdf .

Overview

The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of eight quality metrics that provide information about microbial presence and ancient status. The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient. Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: a) a microbe is present but not ancient, i.e. modern contaminant, or b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe you think about. The former is a more common case scenario. The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evedence (very few reads) to claim presence of a microbe in a sample.

Code Snippets

shell:
    "bowtie2-build-l --threads {threads} {input.ref} {input.ref} > {log} 2>&1"

SnakeMake From line 23 of rules/align.smk

shell:
    """bowtie2 --large-index -x {params.BOWTIE2_DB} --end-to-end --threads {threads} --very-sensitive -U {input.fastq} > $(dirname {output.bam})/AlignedToBowtie2DB.sam 2> {log};"""
    """grep @ $(dirname {output.bam})/AlignedToBowtie2DB.sam | awk '!seen[$2]++' > $(dirname {output.bam})/header_nodups.txt;"""
    """grep -v '^@' $(dirname {output.bam})/AlignedToBowtie2DB.sam > $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;"""
    """cat $(dirname {output.bam})/header_nodups.txt $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam > $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;"""
    """samtools view -bS -q 1 -h -@ {threads} $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam | samtools sort - -@ {threads} -o {output.bam};"""
    """samtools index {output.bam};"""
    """rm $(dirname {output.bam})/header_nodups.txt;"""
    """rm $(dirname {output.bam})/AlignedToBowtie2DB.noheader.sam;"""
    """rm $(dirname {output.bam})/AlignedToBowtie2DB.nodups.sam;"""
    """rm $(dirname {output.bam})/AlignedToBowtie2DB.sam;"""

SnakeMake SAMtools Bowtie 2 From line 47 of rules/align.smk

shell:
    "mkdir -p {params.dir}; "
    "while read taxid; do mkdir -p {params.dir}/$taxid; touch {params.dir}/$taxid/.done; done<{input.species};"
    "touch {output.done}"

SnakeMake From line 19 of rules/authentic.smk

shell:
    "touch {output}; "

SnakeMake From line 48 of rules/authentic.smk

shell:
    "awk -v var={wildcards.taxid} '{{ if($1==var) print $0 }}' {params.tax_db}/taxDB | cut -f3 > {output.node_list}"

SnakeMake From line 63 of rules/authentic.smk

shell:
    "time MaltExtract -Xmx32G -i {input.rma6} -f def_anc -o {params.extract} -r {params.ncbi_db} --reads --threads {threads} --matches --minPI 85.0 --maxReadLength 0 --minComp 0.0 --meganSummary -t {input.node_list} -v 2> {log}"

SnakeMake From line 94 of rules/authentic.smk

shell:
    "postprocessing.AMPS.r -m def_anc -r {params.extract} -t {threads} -n {input.node_list} || echo 'postprocessing failed for {wildcards.sample}_{wildcards.taxid}' > {output.analysis}  2> {log}"

SnakeMake AMPS From line 114 of rules/authentic.smk

shell:
    "samtools faidx {input.fna} 2> {log}"

SnakeMake SAMtools From line 134 of rules/authentic.smk

shell:
    "echo {params.ref_id} > {output.name_list}; "
    "zgrep {params.ref_id} {input.sam} | uniq > {output.sam}; "
    "samtools view -bS {output.sam} > results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam; "
    "samtools sort results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.bam > {output.sorted_bam}; "
    "samtools index {output.sorted_bam}; "
    "samtools depth -a {output.sorted_bam} > {output.breadth_of_coverage}; "
    "grep -w -f {output.name_list} {input.malt_fasta_fai} | awk '{{printf(\"%s:1-%s\\n\", $1, $2)}}' > {output.name_list}.regions; "
    "samtools faidx {input.malt_fasta} -r {output.name_list}.regions -o results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}/{params.ref_id}.fasta"

SnakeMake SAMtools From line 159 of rules/authentic.smk

shell:
    "samtools view {input.bam} | awk '{{ print length($10) }}' > {output.distribution}"

SnakeMake SAMtools From line 184 of rules/authentic.smk

shell:
    "(samtools view -h {input.bam} || true) | pmdtools --printDS > {output.scores}"

SnakeMake pmdtools From line 202 of rules/authentic.smk

shell:
    "Rscript {params.exe} {wildcards.taxid} {wildcards.sample}.trimmed.rma6 {input.dir}/"

SnakeMake From line 226 of rules/authentic.smk

shell:
    "(samtools view {input.bam} || true) | pmdtools --platypus --number 2000000 > {output.tmp}; "
    "cd results/AUTHENTICATION/{wildcards.sample}/{wildcards.taxid}; "
    "R CMD BATCH $(which plotPMD); "

SnakeMake pmdtools From line 245 of rules/authentic.smk

shell:
    "Rscript {params.exe} {input.rma6} $(dirname {input.maltextractlog}) {input.name_list} $(dirname {input.name_list}) &> {log};"

SnakeMake From line 269 of rules/authentic.smk

shell:
    "mkdir {output.dir}; "
    "if [ -s {input.species_tax_id} ]; then "
    'cat {input.species_tax_id} | parallel -j {threads} "grep -w {{}} {params.bowtie2_seqid2taxid} | cut -f1 > {output.dir}/{{}}.seq_ids" ; '
    "for i in $(cat {input.species_tax_id}); do xargs --arg-file={output.dir}/${{i}}.seq_ids samtools view -bh {input.bam} --write-index -@ {threads} -o {output.dir}/${{i}}.tax.bam; done >> {log} 2>&1; "
    "find {output.dir} -name '*.tax.bam' | parallel -j {threads} \"mapDamage {params.options} -i {{}} -r {params.BOWTIE2_DB} --merge-reference-sequences -d {output.dir}/mapDamage_{{}}\" >> {log} 2>&1 || true; "
    "for filename in {output.dir}/*.tax.bam; do newname=`echo $filename | sed 's/tax\.//g'`; mv $filename $newname; done >> {log} 2>&1; "
    "mv {output.dir}/mapDamage_{output.dir}/* {output.dir} >> {log} 2>&1; "
    "rm -r {output.dir}/mapDamage_results >> {log} 2>&1; "
    "else echo NO MICROBES TO AUTHENTICATE > {output.dir}/README.txt; fi"

SnakeMake SAMtools MapDamage From line 23 of rules/damage.smk

shell:
    "krakenuniq --preload --db {params.DB} --fastq-input {input.fastq} --threads {threads} --output {output.seqs} --report-file {output.report} --gzip-compressed --only-classified-out &> {log}"

SnakeMake KrakenUniq From line 21 of rules/krakenuniq.smk

shell:
    """{params.exe} {input.krakenuniq} {params.n_unique_kmers} {params.n_tax_reads} {input.pathogenomesFound} &> {log}; """
    """cut -f7 {output.pathogens} | tail -n +2 > {output.pathogen_tax_id};"""
    """cut -f7 {output.filtered} | tail -n +2 > {output.species_tax_id}"""

SnakeMake From line 49 of rules/krakenuniq.smk

shell:
    "{params.exe} {input.report} {input.seqs} &> {log}; "
    "cat {output.seqs} | cut -f 2,3 > {output.krona}; "
    "ktImportTaxonomy {output.krona} -o {output.html} {params.DB} &>> {log}"

SnakeMake Krona From line 78 of rules/krakenuniq.smk

shell:
    "Rscript {params.exe} results/KRAKENUNIQ {output.out_dir} {params.n_unique_kmers} {params.n_tax_reads} &> {log};"
    "Rscript {params.exe_plot} {output.out_dir} {output.out_dir} &> {log}"

SnakeMake KrakenUniq From line 109 of rules/krakenuniq.smk

script:
    "../scripts/malt-build.py"

SnakeMake From line 25 of rules/malt.smk

shell:
    "unset DISPLAY; malt-run -at SemiGlobal -m BlastN -i {input.fastq} -o {output.rma6} -a {params.gunzipped_sam} -t {threads} -d {input.db} -sup 1 -mq 100 -top 1 -mpi 85.0 -id 85.0 -v &> {log}"

SnakeMake Malt From line 49 of rules/malt.smk

shell:
    "mkdir -p {output.out_dir}; "
    "{params.exe} {input.sam} {params.unique_taxids} > {output.counts} 2> {log}"

SnakeMake From line 69 of rules/malt.smk

shell:
    "Rscript {params.exe} results/MALT_QUANTIFY_ABUNDANCE {output.out_dir} &> {log}"

SnakeMake From line 95 of rules/malt.smk

shell:
    "{params.exe} -d $(dirname {input.rma6}) -r 'S' &> {log}; "
    "mv results/MALT/count_table.tsv {output.out_dir}; "
    "mv {output.out_dir}/count_table.tsv {output.abundance_matrix}"

SnakeMake From line 118 of rules/malt.smk

shell:
    "mv {input.tre} {output.tre};"
    "mv {input.map} {output.map};"

SnakeMake From line 141 of rules/malt.smk

shell:
    "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_BEFORE_TRIMMING &> {log}"

SnakeMake FastQC From line 19 of rules/qc.smk

shell:
    "cutadapt {params.adapters} --minimum-length 31 -o {output.fastq} {input.fastq} &> {log}"

SnakeMake Cutadapt From line 41 of rules/qc.smk

shell:
    "fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_AFTER_TRIMMING &> {log}"

SnakeMake FastQC From line 62 of rules/qc.smk

shell:
    'echo {input} | tr " " "\n" > {output.html}.fof;'
    "multiqc -c {params.config} -l {output.html}.fof --verbose --force --outdir results/MULTIQC &> {log}"

SnakeMake MultiQC From line 85 of rules/qc.smk

shell:
    "Rscript {params.exe} results/AUTHENTICATION $(dirname {output.heatmap}) &> {log}"

SnakeMake From line 17 of rules/summary.smk

__author__ = "Per Unneberg"
__copyright__ = "Copyright 2022, Per Unneberg"
__email__ = "[email protected]"
__license__ = "MIT"

import re
from snakemake.shell import shell
import subprocess as sp

out = sp.run(["malt-build", "--help"], capture_output=True)
regex = re.compile("version (?P<major>\d+)\.(?P<minor>\d+)")
m = regex.search(out.stderr.decode())
if m is None:
    # Assume minor version 4
    minor = 4
else:
    minor = int(m.groupdict()["minor"])

a2t_option = "-a2taxonomy" if minor <= 4 else "-a2t"

log = snakemake.log_fmt_shell(stdout=False, stderr=True, append=True)
options = snakemake.params.get("extra", "")
unique_taxids = snakemake.input.unique_taxids

seqid2taxid = snakemake.params.seqid2taxid
nt_fasta = snakemake.params.nt_fasta
accession2taxid = snakemake.params.accession2taxid

output = snakemake

shell(
    "grep -wFf {unique_taxids} {seqid2taxid} > {snakemake.output.seqid2taxid_project}; "
    "cut -f1 {snakemake.output.seqid2taxid_project} > {snakemake.output.seqids_project}; "
    "grep -Ff {snakemake.output.seqids_project} {nt_fasta} | sed 's/>//g' > {snakemake.output.project_headers}; "
    "seqtk subseq {nt_fasta} {snakemake.output.project_headers} > {snakemake.output.project_fasta} {log}; "
    "unset DISPLAY; "
    "malt-build -i {snakemake.output.project_fasta} {a2t_option} {accession2taxid} -s DNA -t {snakemake.threads} -d {snakemake.output.db} {log}"
)