A reproducible and scalable snakemake workflow for the analysis of DNA metabarcoding experiments, with a special focus on food and feed samples.

public 1yr ago Version: 1.6.6 0 bookmarks

View Workflow

FooDMe is a reproducible and scalable snakemake workflow for the analysis of DNA metabarcoding experiments, with a special focus on food and feed samples.

Usage

The documentation for this workflow is hosted on our homepage . If you use this workflow for research, you can cite this repo using the DOI above.

This workflow support snakemake´s standardized usage and is referenced in the snakemake workflows catalog .

Code Snippets

script:
    "../scripts/filter_taxonomy.py"

SnakeMake From line 20 of rules/blast.smk

shell:
    """
    exec 2> {log}

    export BLASTDB={params.taxdb}

    blastdbcmd -db {params.blast_DB} -tax_info -outfmt %T \
    > {output.taxlist}
    """

SnakeMake From line 36 of rules/blast.smk

script:
    "../scripts/make_blast_mask.py"

SnakeMake From line 63 of rules/blast.smk

script:
    "../scripts/apply_blocklist.py"

SnakeMake From line 79 of rules/blast.smk

shell:
    """
    touch {output.mask} 2> {log}
    """

SnakeMake From line 92 of rules/blast.smk

shell:
    """
    touch {output.block} > {log}
    """

SnakeMake From line 107 of rules/blast.smk

shell:
    """
    export BLASTDB={params.taxdb}

    if [ {input.mask} = "common/blast_mask.txt" ]
    then
        masking="-taxidlist common/blast_mask.txt"
    else
        masking=""
    fi

    blastn -db {params.blast_DB} \
        -query {input.query} \
        -out {output.report} \
        -task 'megablast' \
        -evalue {params.e_value} \
        -perc_identity {params.perc_identity} \
        -qcov_hsp_perc {params.qcov} $masking \
        -outfmt '6 qseqid sseqid evalue pident bitscore sacc staxid length mismatch gaps stitle' \
        -num_threads {threads} \
    2> {log} 

    sed -i '1 i\query\tsubject\tevalue\tidentity\tbitscore\tsubject_acc\tsubject_taxid\talignment_length\tmismatch\tgaps\tsubject_name' {output.report}
    """

SnakeMake BLAST From line 137 of rules/blast.smk

shell:
    """
    exec 2> {log}
    if [ -s {input.report} ]; then
      grep -v -f {params.acc_list} {input.report} > {output.report}
    else
      touch {output.report}
    fi
    """

SnakeMake From line 176 of rules/blast.smk

script:
    "../scripts/filter_blast.py"

SnakeMake From line 200 of rules/blast.smk

script:
    "../scripts/min_consensus_filter.py"

SnakeMake From line 218 of rules/blast.smk

shell:
    """
    exec 2> {log}

    if [ -s {input.blast} ]
    then
        # Get list of all OTUs
        OTUs=$(grep "^>" {input.otus} | cut -d";" -f1 | tr -d '>' | sort -u)

        for otu in $OTUs
        do
            size=$(grep -E "^>${{otu}}\>" {input.otus}  | cut -d"=" -f2)
            bhits=$(grep -c -E "^${{otu}};" {input.blast} || true)
            if [ $bhits -eq 0 ]
            then
                # When there is no blast hit
                echo "{wildcards.sample}\t$otu\t$size\t0\t0\t0\t0\t0\t-\t-\t-\t- (1.0)\t../{input.blast}\t../{input.filtered}" >> {output}
            else
                # Otherwise collect and print stats to file
                bit_best=$(grep -E "^${{otu}};" {input.blast} | cut -f5 | cut -d. -f1 | sort -rn | head -n1)
                bit_low=$(grep -E "^${{otu}};" {input.blast} | cut -f5 | cut -d. -f1 | sort -n | head -n1)
                bit_thr=$(($bit_best - {params.bit_diff}))
                shits=$(grep -c -E "^${{otu}}\>" {input.filtered})
                cons=$(grep -E "^${{otu}}\>" {input.lca} | cut -d'\t' -f2-5)

                echo "{wildcards.sample}\t$otu\t$size\t$bhits\t$bit_best\t$bit_low\t$bit_thr\t$shits\t$cons\t../{input.blast}\t../{input.filtered}" >> {output}
            fi
        done
        # Sort by size and add header (just to get hits on top)
        sort -k3,3nr -o {output} {output}
        sed -i '1 i\Sample\tQuery\tCount\tBlast hits\tBest bit-score\tLowest bit-score\tBit-score threshold\tSaved Blast hits\tConsensus\tRank\tTaxid\tDisambiguation\tlink_report\tlink_filtered' {output}

    else
        echo "{wildcards.sample}\t-\t-\t0\t0\t0\t0\t0\t-\t-\t-\t-\t../{input.blast}\t../{input.filtered}" > {output}
        sed -i '1 i\Sample\tQuery\tCount\tBlast hits\tBest bit-score\tLowest bit-score\tBit-score threshold\tSaved Blast hits\tConsensus\tRank\tTaxid\tDisambiguation\tlink_report\tlink_filtered' {output}
    fi
    """

SnakeMake BLAST From line 243 of rules/blast.smk

shell:
    """
    head -n 1 {input.report[0]} > {output.agg}
    for i in {input.report}; do 
      cat ${{i}} | tail -n +2 >> {output.agg}
    done
    """

SnakeMake From line 297 of rules/blast.smk

shell:
    """
    exec 2> {log}

    echo "Sample\tQuery\tUnknown sequences\tUnknown sequences [%]\t(Sub-)Species consensus\t(Sub-)Species consensus [%]\tGenus consensus\tGenus consensus [%]\tFamily consensus\tFamily consensus [%]\tHigher rank consensus\tHigher rank consensus [%]" > {output}

    all=$(grep -c -E "OTU_|ASV_" <(tail -n +2 {input}) || true)
    nohits=$(grep -c "[[:blank:]]-[[:blank:]]" {input} || true)
    spec=$(grep -c "species" {input} || true)
    gen=$(grep -c "genus" {input} || true)
    fam=$(grep -c "family" {input} || true)
    other=$(( $all - $nohits - $spec - $gen - $fam ))

    if [ $all -ne 0 ]
    then
        nohits_perc=$(printf %.2f "$((10**3 * (100* $nohits / $all)))e-3")
        spec_perc=$(printf %.2f "$((10**3 * (100* $spec / $all)))e-3")
        gen_perc=$(printf %.2f "$((10**3 * (100* $gen / $all)))e-3")
        fam_perc=$(printf %.2f "$((10**3 * (100* $fam / $all)))e-3")
        other_perc=$(printf %.2f "$((10**3 * (100* $other / $all)))e-3")

        echo "{wildcards.sample}\t$all\t$nohits\t$nohits_perc\t$spec\t$spec_perc\t$gen\t$gen_perc\t$fam\t$fam_perc\t$other\t$other_perc" >> {output}

    else
        echo "{wildcards.sample}\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0" >> {output}
    fi
    """

SnakeMake Consensus From line 317 of rules/blast.smk

shell:
    """
    exec 2> {log}
    cat {input.report[0]} | head -n 1 > {output.agg}
    for i in {input.report}; do 
        cat ${{i}} | tail -n +2 >> {output.agg}
    done
    """

SnakeMake From line 364 of rules/blast.smk

script:
    "../scripts/summarize_results.py"

SnakeMake From line 392 of rules/blast.smk

shell:
    """
    exec 2> {log}

    cat {input.report[0]} | head -n 1 > {output.agg}
    for i in {input.report}; do 
        cat ${{i}} | tail -n +2 >> {output.agg}
    done
    """

SnakeMake From line 412 of rules/blast.smk

script:
    "../scripts/krona_table.py"

SnakeMake From line 20 of rules/reports.smk

shell:
    "ktImportText -o {output.graph} {input.table} 2> {log}"

SnakeMake Krona From line 40 of rules/reports.smk

shell:
    """
    exec 2> {log}
    i=0
    for file in {input.report}
    do
        file_list[$i]="${{file}},$(echo ${{file}} | cut -d'/' -f1)"
        ((i+=1))
    done

    ktImportText -o {output.agg} ${{file_list[@]}}
    """

SnakeMake Krona From line 62 of rules/reports.smk

shell:
    """
    exec 2> {log}

    if [[ {params.method} == "otu" ]] 
    then
        echo "Sample\tQ30 rate\tInsert size peak\tRead number\tPseudo-reads\tReads in OTU\tOTU number\tAssigned reads\t(Sub-)Species consensus\tGenus consensus\tHigher rank consensus\tNo match" > {output.report}

        Q30=$(tail -n +2 {input.fastp} | cut -d'\t' -f9)
        size=$(tail -n +2 {input.fastp} | cut -d'\t' -f11)
        reads=$(tail -n +2 {input.merging} | cut -d'\t' -f2)
        pseudo=$(tail -n +2 {input.merging} | cut -d'\t' -f5)
        clustered=$(tail -n +2 {input.clustering} | cut -d'\t' -f10)
        otu=$(tail -n +2 {input.tax} | cut -d'\t' -f2)
        assigned=$(tail -n +2 {input.compo} | awk '$2 != "No match"' | cut -d'\t' -f5 | awk '{{s+=$1}}END{{print s}}')
        spec=$(tail -n +2 {input.tax} | cut -d'\t' -f5)
        gen=$(tail -n +2 {input.tax} | cut -d'\t' -f7)
        high=$(($(tail -n +2 {input.tax} | cut -d'\t' -f9) + $(tail -n +2 {input.tax} | cut -d'\t' -f11)))
        none=$(tail -n +2 {input.tax} | cut -d'\t' -f3)

        echo "{wildcards.sample}\t$Q30\t$size\t$reads\t$pseudo\t$clustered\t$otu\t$assigned\t$spec\t$gen\t$high\t$none" >> {output.report}
    else
        echo "Sample\tQ30 rate\tInsert size peak\tRead number\tPseudo-reads\tReads in ASV\tASV number\tAssigned reads\t(Sub-)Species consensus\tGenus consensus\tHigher rank consensus\tNo match" > {output.report}

        Q30=$(tail -n +2 {input.fastp} | cut -d'\t' -f9)
        size=$(tail -n +2 {input.fastp} | cut -d'\t' -f11)
        reads=$(tail -n +2 {input.clustering} | cut -d'\t' -f2)
        pseudo=$(tail -n +2 {input.clustering} | cut -d'\t' -f6)
        clustered=$(tail -n +2 {input.clustering} | cut -d'\t' -f16)
        otu=$(tail -n +2 {input.tax} | cut -d'\t' -f2)
        assigned=$(tail -n +2 {input.compo} | awk '$2 != "No match"' | cut -d'\t' -f5 | awk '{{s+=$1}}END{{print s}}')
        spec=$(tail -n +2 {input.tax} | cut -d'\t' -f5)
        gen=$(tail -n +2 {input.tax} | cut -d'\t' -f7)
        high=$(($(tail -n +2 {input.tax} | cut -d'\t' -f9) + $(tail -n +2 {input.tax} | cut -d'\t' -f11)))
        none=$(tail -n +2 {input.tax} | cut -d'\t' -f3)

        echo "{wildcards.sample}\t$Q30\t$size\t$reads\t$pseudo\t$clustered\t$otu\t$assigned\t$spec\t$gen\t$high\t$none" >> {output.report}
    fi
    """

SnakeMake From line 100 of rules/reports.smk

shell:
    """
    exec 2> {log}
    cat {input.report[0]} | head -n 1 > {output.agg}
    for i in {input.report}; do 
        cat ${{i}} | tail -n +2 >> {output.agg}
    done
    """

SnakeMake From line 156 of rules/reports.smk

script:
    "../scripts/write_report.Rmd"

SnakeMake From line 204 of rules/reports.smk

script:
    "../scripts/write_report.Rmd"

SnakeMake From line 243 of rules/reports.smk

script:
    "../scripts/conda_collector.py"

SnakeMake From line 261 of rules/reports.smk

shell:
    """
    exec 2> {log}

    echo "Database\tLast modified\tFull path" \
        > {output.report}

    paste \
        <(echo "BLAST") \
        <(date +%F -r {params.blast}.nto) \
        <(echo {params.blast}) \
        >> {output.report}

    paste \
        <(echo "taxdb.bti") \
        <(date +%F -r {params.taxdb}/taxdb.bti) \
        <(echo {params.taxdb}/taxdb.bti) \
        >> {output.report}

    paste \
        <(echo "taxdb.btd") \
        <(date +%F -r {params.taxdb}/taxdb.btd) \
        <(echo {params.taxdb}/taxdb.btd) \
        >> {output.report}

    paste \
        <(echo "taxdump lineages") \
        <(date +%F -r {params.taxdump_lin}) \
        <(echo {params.taxdump_lin}) \
        >> {output.report}

    paste \
        <(echo "taxdump nodes") \
        <(date +%F -r {params.taxdump_nodes}) \
        <(echo {params.taxdump_nodes}) \
        >> {output.report}
    """

SnakeMake From line 279 of rules/reports.smk

script:
    "../scripts/primer_disambiguation.py"

SnakeMake From line 16 of rules/trimming.smk

shell:
    """
    seqtk seq -r {input.primers} 1> {output.primers_rc} 2> {log}
    """

SnakeMake seqtk From line 31 of rules/trimming.smk

shell:
    """
    # Simple case only 5p trimming
    if [[ {params.primer_3p} == False ]]
    then
        cutadapt {input.r1} \
            {input.r2} \
            -o {output.r1} \
            -p {output.r2} \
            -g file:{params.primers} \
            -G file:{params.primers} \
            --untrimmed-output {output.trash_R1_5p} \
            --untrimmed-paired-output {output.trash_R2_5p} \
            --error-rate {params.error_rate} \
            2>&1 > {log}
        touch {output.trash_R1_3p}
        touch {output.trash_R2_3p}

    # in case trimming of 3p is also nescessary
    else
        cutadapt --interleaved \
            {input.r1} \
            {input.r2} \
            -g file:{params.primers} \
            -G file:{params.primers} \
            --untrimmed-output {output.trash_R1_5p} \
            --untrimmed-paired-output {output.trash_R2_5p} \
            --error-rate {params.error_rate} \
        2>> {log} \
        | cutadapt --interleaved \
            -o {output.r1} \
            -p {output.r2} \
            -a file:{input.primers_rc} \
            -A file:{input.primers_rc} \
            --untrimmed-output {output.trash_R1_3p} \
            --untrimmed-paired-output {output.trash_R2_3p} \
            --error-rate {params.error_rate} \
            - \
        2>&1 >> {log}
    fi
    """

SnakeMake Cutadapt From line 59 of rules/trimming.smk

shell:
    """
    exec 2> {log}

    before_r1=$(zcat {input.before_r1} | echo $((`wc -l`/4)))
    after_r1=$(zcat {input.after_r1} | echo $((`wc -l`/4)))
    before_r2=$(zcat {input.before_r2} | echo $((`wc -l`/4)))
    after_r2=$(zcat {input.after_r2} | echo $((`wc -l`/4)))

    before=$(( before_r1 + before_r2 ))
    after=$(( after_r1 + after_r2 ))

    if [ $after -ne 0 ] 
    then
        perc_discarded=$( python -c "print(f'{{round(100*(1-${{after}}/${{before}}),2)}}')" )
    else
        perc_discarded=0.00
    fi

    echo "Sample\tTotal raw reads\tTotal reads after primer trim\tNo primer found [%]" > {output.report}
    echo "{wildcards.sample}\t$before\t$after\t$perc_discarded" >> {output.report}
    """

SnakeMake From line 116 of rules/trimming.smk

shell:
    """
    fastp -i {input.r1} -I {input.r2} \
        -o {output.r1} -O {output.r2} \
        -h {output.html} -j {output.json}\
        --length_required {params.length_required} \
        --qualified_quality_phred {params.qualified_quality_phred} \
        --cut_by_quality3 \
        --cut_window_size {params.window_size} \
        --cut_mean_quality {params.mean_qual} \
        --disable_adapter_trimming \
        --thread {threads} \
        --report_title 'Sample {wildcards.sample}' \
    > {log} 2>&1
    """

SnakeMake fastp From line 164 of rules/trimming.smk

script:
    "../scripts/parse_fastp.py"

SnakeMake From line 193 of rules/trimming.smk

shell:
    """
    paste {input.cutadapt} {input.fastp} 1> {output.report} 2> {log}
    """

SnakeMake From line 212 of rules/trimming.smk

shell:
    """
    exec 2> {log}

    cat {input.report[0]} | head -n 1 > {output.agg}
    for i in {input.report}; do 
        cat ${{i}} | tail -n +2 >> {output.agg}
    done
    """

SnakeMake From line 233 of rules/trimming.smk

import sys


sys.stderr = open(snakemake.log[0], "w")


def main(taxids, blocklist, output):
    with open(taxids, 'r') as fi:
        taxs = set([line.strip() for line in fi.readlines()])

    with open(blocklist, 'r') as bl:
        blocks = set([line.split('#')[0].strip() for line in bl.readlines()])

    listout = taxs.difference(blocks)

    with open(output, 'w') as fo:
        for tax in listout:
            fo.write(f"{tax}\n")


if __name__ == '__main__':
    main(snakemake.input["taxids"],
         snakemake.input["blocklist"],
         snakemake.output['mask'])

Python From line 5 of scripts/apply_blocklist.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import os
import yaml
import pandas as pd


def extract_package_version(envfile):
    with open(envfile, 'r') as stream:
        env = yaml.safe_load(stream)
        for dep in env['dependencies']:
            p, v = dep.split("=")
            yield p, v


def main(report, basedir):
    mypath = os.path.join(basedir, "envs")
    envs = [
        os.path.join(mypath, f) for f in os.listdir(mypath)
        if os.path.isfile(os.path.join(mypath, f)) and f.lower().endswith(('.yaml', '.yml'))
    ]
    df = []
    for ef in envs:
        for p, v in extract_package_version(ef):
            df.append({'Package': p, 'Version': v})
    df = pd.DataFrame(df)
    df.sort_values('Package').to_csv(report, sep="\t", header=True, index=False)


if __name__ == '__main__':
    main(
        report=snakemake.output['report'],
        basedir=snakemake.params['dir']
    )

Python Pandas PyYAML From line 5 of scripts/conda_collector.py

import sys


sys.stderr = open(snakemake.log[0], "w")


from os import stat
import pandas as pd


def main(report, filtered, bit_diff):
    if stat(report).st_size == 0:
        with open(filtered, "w") as fout:
            fout.write(
                "query\tsubject\tevalue\tidentity\tbitscore\tsubject_acc\t"
                "subject_taxid\talignment_length\tmismatch\tgaps\tsubject_name"
            )
    else:
        df = pd.read_csv(report, sep="\t", header=0)
        if df.empty:
            df.to_csv(filtered, sep="\t", header=True, index=False)
        else:
            sd = dict(tuple(df.groupby("query")))
            dfout = pd.DataFrame()
            for key, val in sd.items():
                dfout = pd.concat(
                    [dfout, val[val["bitscore"] >= max(val["bitscore"]) - bit_diff]]
                )
            dfout["query"] = dfout["query"].str.split(";").str[0]
            dfout.to_csv(filtered, sep="\t", header=True, index=False)


if __name__ == '__main__':
    main(snakemake.input['report'],
         snakemake.output['filtered'],
         snakemake.params['bit_diff'])

Python Pandas From line 5 of scripts/filter_blast.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import taxidTools as txd


def main(nodes, lineage, taxid, out):
    tax = txd.Taxonomy.from_taxdump(nodes, lineage)
    tax.prune(taxid)
    tax.write(out)


if __name__ == '__main__':
    main(snakemake.params['nodes'],
         snakemake.params['rankedlineage'],
         snakemake.params['taxid'],
         snakemake.output['tax'])

Python taxidTools From line 5 of scripts/filter_taxonomy.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import taxidTools as txd
import pandas as pd


def get_lineage(taxid, tax):
    if taxid == "-":
        return ["Unassigned"]
    elif taxid == "Undetermined":
        return ["Undetermined"]
    else:
        return [node.name for node in tax.getAncestry(taxid)][::-1]
        # inverting list to have the lineage descending for Krona


def main(input, output, taxonomy):
    tax = txd.load(taxonomy)
    df = pd.read_csv(input, sep='\t', header=0)
    with open(output, "w") as out:
        for index, row in df.iterrows():
            out.write(
                "\t".join(
                    [str(row["Count"])] + get_lineage(row['Taxid'], tax)
                ) + "\n")


if __name__ == '__main__':
    main(snakemake.input['compo'],
         snakemake.output['krt'],
         snakemake.input['tax'])

Python Pandas taxidTools From line 5 of scripts/krona_table.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import taxidTools as txd


def main(taxid_file, parent, output, taxonomy):

    tax = txd.load(taxonomy)

    with open(taxid_file, "r") as fin:
        db_entries = set(fin.read().splitlines()[1:])

    with open(output, "w") as fout:
        for taxid in db_entries:
            try:
                if tax.isDescendantOf(str(taxid).strip(), str(parent).strip()):
                    fout.write(taxid + "\n")
                else:
                    pass
            except KeyError:
                pass  # Ignoring missing taxids as they are either not in the
                # taxdumps or actively filtered by the user.


if __name__ == '__main__':
    main(snakemake.input['taxlist'],
         snakemake.params["taxid"],
         snakemake.output['mask'],
         snakemake.input['tax'])

Python taxidTools From line 5 of scripts/make_blast_mask.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import taxidTools as txd
from collections import Counter, defaultdict


def parse_blast(blast_file):
    """
    Parse a BLAST report and returns a dictionnary where Keys are query
    sequence names and values list of taxids for each hit.
    BLAST report must have the following formatting:
        '6 qseqid sseqid evalue pident bitscore sacc
        staxids sscinames scomnames stitle'
    """
    dictout = defaultdict()
    with open(blast_file, 'r') as fi:
        next(fi)  # Skip header
        for line in fi:
            ls = line.split()
            taxids = ls[6].split(";")  # split taxids if nescessary
            # extend taxids list for this OTU
            if ls[0] in dictout.keys():
                dictout[ls[0]].extend(taxids)
            # or inititate the list
            else:
                dictout[ls[0]] = taxids

    # Make sure everything is str formated
    dictout = {k: [str(e) for e in v] for k, v in dictout.items()}

    return dictout


def main(blast_report, output, min_consensus, taxonomy):
    if min_consensus <= 0.5 or min_consensus > 1:
        raise ValueError("'min_consensus' must be in the interval (0.5 , 1]")

    tax = txd.load(taxonomy)
    otu_dict = parse_blast(blast_report)
    with open(output, 'w') as out:
        out.write("queryID\tConsensus\tRank\tTaxid\tDisambiguation\n")

        for queryID, taxid_list in otu_dict.items():
            try:
                consensus = tax.consensus(taxid_list, min_consensus)

            except KeyError:
                # Taxid not present in the Taxdump version
                # used raises a KeyError
                # Filter out missing sequences (verbose)
                taxid_list_new = []
                for taxid in taxid_list:
                    if taxid not in tax.keys():
                        pass  # This is most likely the result of active filtering by the user
                        # No need ot be over verbose with this
                        # print(f"WARNING: taxid {taxid} missing from Taxonomy "
                        #      f"reference, it will be ignored")
                    else:
                        taxid_list_new.append(taxid)

                # Update list
                taxid_list = taxid_list_new

                # Empty list case:
                if not taxid_list:
                    consensus = "Undetermined"
                else:
                    # Get the consensus with the filtered taxids
                    consensus = tax.consensus(taxid_list, min_consensus)

            finally:
                if consensus != "Undetermined":
                    rank = consensus.rank
                    name = consensus.name
                    taxid = consensus.taxid
                else:
                    taxid = "Undetermined"
                    rank = "Undetermined"
                    name = "Undetermined"

                # (freq, name) tuple to sort
                freqs = [((v/len(taxid_list)), tax.getName(k))
                         for k, v in Counter(taxid_list).items()]
                sorted_freqs = sorted(freqs, reverse=True)

                names = "; ".join([f"{f} ({round(n,2)})"
                                   for (n, f) in sorted_freqs])
                out.write(f"{queryID}\t{name}\t{rank}\t{taxid}\t{names}\n")


if __name__ == '__main__':
    main(snakemake.input['blast'],
         snakemake.output['consensus'],
         snakemake.params["min_consensus"],
         snakemake.input['tax'])

Python BLAST Consensus taxidTools From line 5 of scripts/min_consensus_filter.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import os
import json
import csv


def main(injson, inhtml, outtsv):
    with open(injson, "r") as handle:
        data = json.load(handle)
        link_path = os.path.join("..", inhtml)
        header = (
            "Total bases before quality trim\tTotal reads after quality trim"
            "\tTotal bases after quality trim\tQ20 rate after\tQ30 rate after"
            "\tDuplication rate\tInsert size peak\tlink_to_report"
        )
        datalist = [
            data["summary"]["before_filtering"]["total_bases"],
            data["summary"]["after_filtering"]["total_reads"],
            data["summary"]["after_filtering"]["total_bases"],
            data["summary"]["after_filtering"]["q20_rate"],
            data["summary"]["after_filtering"]["q30_rate"],
            data["duplication"]["rate"],
            data["insert_size"]["peak"],
            link_path,
        ]
    with open(outtsv, "w") as outfile:
        outfile.write(f"{header}\n")
        writer = csv.writer(outfile, delimiter="\t")
        writer.writerow(datalist)


if __name__ == '__main__':
    main(snakemake.input['json'],
         snakemake.input['html'],
         snakemake.output['tsv'])

Python JSON From line 5 of scripts/parse_fastp.py

import sys


sys.stderr = open(snakemake.log[0], "w")


from Bio import SeqIO
from itertools import product


def extend_ambiguous_dna(seq):
    """return list of all possible sequences given an ambiguous DNA input"""
    d = {
        'A': 'A',
        'C': 'C',
        'G': 'G',
        'T': 'T',
        'M': ['A', 'C'],
        'R': ['A', 'G'],
        'W': ['A', 'T'],
        'S': ['C', 'G'],
        'Y': ['C', 'T'],
        'K': ['G', 'T'],
        'V': ['A', 'C', 'G'],
        'H': ['A', 'C', 'T'],
        'D': ['A', 'G', 'T'],
        'B': ['C', 'G', 'T'],
        'N': ['G', 'A', 'T', 'C']
    }
    return list(map("".join, product(*map(d.get, seq))))


def primers_to_fasta(name, seq_list):
    """return fasta string of primers with tracing newline"""
    fas = ""
    for i in range(len(seq_list)):
        fas += f">{name}[{i}]\n{seq_list[i]}\n"
    return fas


def main(fastain, fastaout):
    with open(fastain, "r") as fin, open(fastaout, "w") as fout:
        for record in SeqIO.parse(fin, "fasta"):
            explicit = extend_ambiguous_dna(record.seq)
            fasta = primers_to_fasta(record.id, explicit)
            fout.write(fasta)


if __name__ == '__main__':
    main(snakemake.params['primers'],
         snakemake.output['primers'])

Python Biopython From line 5 of scripts/primer_disambiguation.py

import sys


sys.stderr = open(snakemake.log[0], "w")


import pandas as pd


def concatenate_uniq(entries):
    s = "; ".join(entries.to_list())
    df = pd.DataFrame(
        [e.rsplit(" (", 1) for e in s.split("; ")], columns=["name", "freq"]
        )  # parenthesis in names
    df.loc[:, "freq"] = df["freq"].str.replace(")", "", regex=False).astype(float)
    # Aggreagte, normalize, and sort
    tot = df["freq"].sum()
    df = df.groupby("name").apply(lambda x: x.sum() / tot)
    df = df.sort_values(by=["freq"], ascending=False)
    # Format as string
    uniq = df.to_dict()["freq"]
    uniq = [f"{name} ({round(freq, 2)})" for name, freq in uniq.items()]
    return "; ".join(uniq)


def main(compo, report, sample):
    df = pd.read_csv(compo, sep="\t", header=0).fillna(0)

    # Empty input case
    if len(df["Query"]) == 1 and df["Query"].head(1).item() == "-":
        with open(report, "w") as fout:
            fout.write(
                "Sample\tConsensus\tRank\tTaxid\tCount\tDisambiguation\tPercent of total\tPercent of assigned"
            )

    else:
        groups = df.groupby(["Consensus", "Rank", "Taxid"]).agg(
            {"Count": "sum", "Disambiguation": concatenate_uniq}
        )
        groups = groups.sort_values("Count", ascending=False).reset_index()

        # Get percs of total
        groups["perc"] = round(groups["Count"] / groups["Count"].sum() * 100, 2)

        # Get percs of assigned
        assigned, notassigned = (
            groups[groups["Consensus"] != "-"],
            groups[groups["Consensus"] == "-"],
        )
        assigned["perc_ass"] = round(assigned["Count"] / assigned["Count"].sum() * 100, 2)
        notassigned["perc_ass"] = "-"
        groups = pd.concat([assigned, notassigned])

        # Formatting
        groups.insert(0, "Sample", sample)
        groups.rename(columns={"perc": "Percent of total",
                               "perc_ass": "Percent of assigned"},
                      inplace=True)
        groups["Consensus"].replace({"-": "No match"}, inplace=True)
        groups["Taxid"].replace({0: "-"}, inplace=True)
        groups.to_csv(report, sep="\t", index=False)


if __name__ == '__main__':
    main(snakemake.input['compo'],
         snakemake.output['report'],
         snakemake.params['sample_name'])

Python Pandas From line 5 of scripts/summarize_results.py

# logging
log = file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type = "message")

knitr::opts_chunk$set(out.width = '80%',fig.asp= 0.5,fig.align='center',echo=FALSE, warning=FALSE, message=FALSE)
options(markdown.HTML.header = system.file("misc", "datatables.html", package = "knitr"))

library(DT, quietly = T)
library(tidyverse, quietly = T)
library(htmltools, quietly = T)

executor <- Sys.info()["user"]

R Markdown tidyverse DT htmltools From line 21 of scripts/write_report.Rmd

htmltools::a(
    href="https://cvua-rrw.github.io/FooDMe/",
    htmltools::img(
        src = knitr::image_uri(snakemake@params[['logo']]), 
        alt = 'FooDMe documentation', 
        style = 'position:absolute; top:0; right:0; padding:10px;',
        width=200
    )
)

R Markdown From line 41 of scripts/write_report.Rmd

workdir <- snakemake@params[["workdir"]]

overview <- snakemake@input[["summary"]]
fastp <- snakemake@input[["fastp"]]
qc_filtering <- snakemake@input[["qc_filtering"]]
clustering <- snakemake@input[["clustering"]]
blast_rep <- snakemake@input[["blast_rep"]]
taxonomy <- snakemake@input[["taxonomy"]]
result <- snakemake@input[["result"]]
db <- snakemake@input[["db"]]
soft <- snakemake@input[["soft"]]

OTU_bool <- snakemake@params[["method"]] == "otu" # store True if using OTU 

# infer run name from workdir
run <- basename(workdir)
#head(tail(strsplit(workdir,"/")[[1]],2),1)

reportAll <- snakemake@params[["sample"]] == "all"

# Number of samples
nsamples <- nrow(read.csv(file = overview, sep = "\t", check.names=FALSE))

R Markdown fastp From line 53 of scripts/write_report.Rmd

data_table <- read.csv(file = overview, sep = "\t", check.names=FALSE)
datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						)
						))

R Markdown From line 99 of scripts/write_report.Rmd

data_table <- read.csv(file = fastp, sep = "\t", check.names=FALSE)

# Create hyperlinks
data_table$links <- paste0("<a href=", data_table$link_to_report, ">file</a>")
data_table$link_to_report = NULL

datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						)
						))

R Markdown From line 127 of scripts/write_report.Rmd

cat("## Read filtering statistics\n")

R Markdown From line 156 of scripts/write_report.Rmd

data_table <- read.csv(file = qc_filtering, sep = "\t", check.names=FALSE)
datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						)
						))

R Markdown From line 159 of scripts/write_report.Rmd

cat("## Clustering statistics\n")

R Markdown From line 183 of scripts/write_report.Rmd

data_table <- read.csv(file = clustering, sep = "\t", check.names=FALSE)
datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						)
						))

R Markdown From line 186 of scripts/write_report.Rmd

cat("## Denoising statistics\n")

R Markdown From line 210 of scripts/write_report.Rmd

data_table <- read.csv(file = clustering, sep = "\t", check.names=FALSE)
datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						)
						))

R Markdown From line 213 of scripts/write_report.Rmd

data_table <- read.csv(file = blast_rep, sep = "\t", check.names=FALSE)
#Process links
data_table$blast_report <- paste0("<a href=", data_table$link_report, ">file</a>")
data_table$link_report = NULL
data_table$filtered_report <- paste0("<a href=", data_table$link_filtered, ">file</a>")
data_table$link_filtered = NULL

datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						),
						deferRender = TRUE,
						scroller = TRUE
						))

R Markdown From line 239 of scripts/write_report.Rmd

data_table <- read.csv(file = taxonomy, sep = "\t", check.names=FALSE)
datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						)
						))

R Markdown From line 273 of scripts/write_report.Rmd

data_table <- read.csv(file = result, sep = "\t", check.names=FALSE)
datatable(data_table, filter = 'top', rownames = FALSE, escape = FALSE,
		extensions = list("ColReorder" = NULL, "Buttons" = NULL),
		options = list(	
					dom = 'BRrltpi',
					autoWidth=FALSE,
					scrollX = TRUE,
					lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
					ColReorder = TRUE,
					buttons =
					list(
						'copy',
						'print',
						list(
						extend = 'collection',
						buttons = c('csv', 'excel', 'pdf'),
						text = 'Download'
						),
						I('colvis')
						),
						deferRender = TRUE,
						scroller = TRUE
						))

R Markdown From line 299 of scripts/write_report.Rmd

if (snakemake@params[["sample"]] == "all") {
	krona_source <- "krona_chart.html"
} else {
	krona_source <- paste0(snakemake@params[["sample"]], "_krona_chart.html")
}

htmltools::tags$iframe(title = "Krona chart", src = krona_source, width ="100%", height="800px") 

R Markdown From line 327 of scripts/write_report.Rmd

db_table <- read.csv(file = db, sep = "\t", check.names=FALSE)
knitr::kable(db_table)

R Markdown From line 343 of scripts/write_report.Rmd

soft_table <- read.csv(file = soft, sep = "\t", check.names=FALSE)
knitr::kable(soft_table)

R Markdown From line 350 of scripts/write_report.Rmd

ShowHide 49 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://cvua-rrw.github.io/FooDMe

Name: foodme

Version: 1.6.6

Badge:

Insert copied code into your website to add a link to this workflow.

License: BSD 3-Clause "New" or "Revised" License

Keywords:

JSON paired sequencing data Filtering Quality control report Biopython BLAST Consensus Cutadapt fastp Krona Pandas seqtk Snakemake DT htmltools tidyverse PyYAML taxidTools Taxonomy

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free