Pipeline for Predicting Antibiotic Resistance in Mycobacterium tuberculosis Using Snakemake

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This snakemake-based pipeline is a tool for predicting antibiotic resistance in Mycobacterium tuberculosis.

Background

The tuberculosis disease caused by mycobacterium tuberculosis is still a global concern. To approach the emerging problem of upcoming resistant strains, various diagnostic methods are needed. Here, we present a pipeline capable of predicting mycobacteria resistances in an OS-independent, reproducible way.

Structure

mdrmtbworkflow

Input

FastQ files of mycobacterium tuberculosis (illumina-sequenced short-read)
The pointfinder database
A table of gene loci to be investigated (names must be similar to the pointfinder DB)
Several parameters in the config file
A samples table, where the name of the sample and forward and reverse FastQ files are shown (see config/pep/samples.csv)

Usage

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template . Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Setup database

For mutation database, the pointfinder_db database was used (Johnsen et al. 2019 - https://bitbucket.org/genomicepidemiology/pointfinder_db/src/master/). To parse this database, some changes were made. The changed database files are stored in a folder called database. Rename this folder to 'resources' to match the path configuration of the pipeline. Also, here a file called 'gene_loci.csv' can be found. This file specifies the loci which shall be investigated by the pipeline. If some changes to investigated loci are to be made, the name of the new locus must be present in the resistens-overview.txt database file from pointfinder.

Step 3: Setup data

After downloading the code, the FastQ files to be investigated can be moved to a folder in the same file structure as the Snakefile called data. The names of the samples as well as single single read names need to be written in samples.csv (/config/pep/samples.csv). If all reads should be considered, the config file (/config/config.yaml) parameter 'reducing' under reduce_reads should be “False".

Step 4: Install snakemake

MYCRes is a snakemake-based workflow. Therefore, the installation of this package is necessary to start it. We used snakemake version 7.18.2 in a conda environment. For that you can use:

conda create --name sm
conda activate sm
conda install -c bioconda snakemake

Step 5: Start snakemake

Start the workflow with the command "snakemake --cores all --use-conda" when you are in the same file structure as the Snakefile. A snakemake report.html file is created automatically in the results folder.

Output

As an output, an interactive graphic shows all found resistance-causing mutations and coverage details on the mutation and the gene locus.

visualization

Code Snippets

shell:
    "freebayes -f {input.fna} --region {params.region} {input.bam} |"
    " vcffilter -f {params.filter} | vcfallelicprimitives -kg > {output}"

SnakeMake FreeBayes biopet-vcffilter From line 18 of rules/call_variants.smk

script:
    "../scripts/sum_vars.py"

SnakeMake From line 38 of rules/call_variants.smk

script:
    "../scripts/compare_variants.py"

SnakeMake From line 11 of rules/create_antibiogram.smk

script:
    "../scripts/sum_depths.py"

SnakeMake From line 18 of rules/create_depth_profile.smk

script:
    "../scripts/integr_res.py"

SnakeMake From line 34 of rules/create_depth_profile.smk

script:
    "../scripts/sum_res.py"

SnakeMake From line 50 of rules/create_depth_profile.smk

script:
    "../scripts/altair_plot_single.py"

SnakeMake From line 27 of rules/create_html.smk

script:
    "../scripts/altair_plot_list.py"

SnakeMake From line 57 of rules/create_html.smk

script:
    "../scripts/altair_summary.py"

SnakeMake From line 75 of rules/create_html.smk

script:
    "../scripts/covsum_to_html.py"

SnakeMake From line 97 of rules/create_html.smk

shell:
    "snakemake --nolock --report {output}"

SnakeMake Snakemake From line 22 of rules/create_report.smk

shell:
    "snakemake --nolock --report {output}"

SnakeMake Snakemake From line 46 of rules/create_report.smk

shell:
    "curl -SL -o {output} {params.mtb_genome} 2> {log}"

SnakeMake From line 11 of rules/map_reads.smk

shell:
    "gzip -dkf {input} > {output}"

SnakeMake From line 24 of rules/map_reads.smk

shell:
    "samtools faidx {input} > {output}"

SnakeMake SAMtools From line 37 of rules/map_reads.smk

wrapper:
    "v1.14.1/bio/bwa/index"

SnakeMake From line 55 of rules/map_reads.smk

wrapper:
    "v1.14.1/bio/bwa/mem"

SnakeMake From line 79 of rules/map_reads.smk

wrapper:
    "v1.14.1/bio/samtools/sort"

SnakeMake From line 97 of rules/map_reads.smk

wrapper:
    "v1.17.2/bio/samtools/index"

SnakeMake From line 112 of rules/map_reads.smk

wrapper:
    "v1.14.1/bio/fastqc"

SnakeMake FastQC From line 10 of rules/qc.smk

wrapper:
    "v1.14.1/bio/multiqc"

SnakeMake MultiQC From line 29 of rules/qc.smk

wrapper:
    "v1.14.1/bio/fastqc"

SnakeMake FastQC From line 43 of rules/qc.smk

wrapper:
    "v1.14.1/bio/multiqc"

SnakeMake MultiQC From line 62 of rules/qc.smk

shell:
    "samtools depth -H -d 1000000 -r {params.region} -o {output} {input.bam}"

SnakeMake SAMtools From line 80 of rules/qc.smk

shell:
    "(samtools coverage -r {params.region} -o {output} {input.bam} &&"
    " sed -i 's/AL123456.3/{wildcards.loci}/' {output})"

SnakeMake From line 98 of rules/qc.smk

shell:
    "cat {input} >> {output} ; "
    "echo -ne '\n' >> {output} ; "
    "sed -i '1!{{/^#rname/d;}}' {output}"

SnakeMake From line 120 of rules/qc.smk

wrapper:
    "v1.14.1/bio/samtools/sort"

SnakeMake From line 17 of rules/reduce_reads.smk

shell:
    "samtools view -h -o {output} {input}"

SnakeMake SAMtools From line 27 of rules/reduce_reads.smk

script:
    "../scripts/rremove_reads.py"

SnakeMake From line 39 of rules/reduce_reads.smk

shell:
    "samtools view -bS {input} > {output}"

SnakeMake SAMtools From line 49 of rules/reduce_reads.smk

wrapper:
    "v1.14.1/bio/cutadapt/pe"

SnakeMake From line 15 of rules/trimm.smk

import pandas as pd
import altair as alt
import warnings
from altair.expr import datum, if_

alt.data_transformers.disable_max_rows()
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)

abresprofile = snakemake.input.res
depthprofile = snakemake.input.depth

lis = []
for abres, depth in zip(abresprofile, depthprofile):
    ab_df = pd.read_csv(abres, header=0, sep=",")
    dp_df = pd.read_csv(depth, header=0, sep=",")

    dp_df["Gene"] = dp_df["Gene"].str.split("_").str[0]
    ab_df["Mut_gene"] = ab_df["Mut_gene"].str.split("_").str[0]
    lsmut = []
    lslocmut = []
    for i in range(ab_df.shape[0]):
        lsmut.append(
            ab_df.loc[i]["Ref_Codon"]
            + str(ab_df.loc[i]["Codon_num"])
            + ab_df.loc[i]["Alt_Codon"]
        )
        lslocmut.append(
            ab_df.loc[i]["Mut_gene"]
            + "_"
            + ab_df.loc[i]["Ref_Codon"]
            + str(ab_df.loc[i]["Codon_num"])
            + ab_df.loc[i]["Alt_Codon"]
        )
    ab_df["Mutation"] = lsmut
    ab_df["LocMutation"] = lslocmut

    barchart = (
        alt.Chart(dp_df)
        .mark_bar(color="#1f77b4")
        .encode(
            x=alt.X("Gene:O", title=None, axis=alt.Axis(labelAngle=-45)),
            y=alt.Y("mean(Depth)", scale=alt.Scale(type="log")),
        )
        .properties(title="Mean Depth per locus", width=800, height=200)
    )

    base2 = (
        alt.Chart(ab_df)
        .transform_calculate(color='datum.Var_Qual >= 1 ? "red" : "orange"',)
        .encode(
            x=alt.X(
                "Resistance",
                sort=["Resistance"],
                axis=alt.Axis(
                    orient="top",
                    labelAngle=35,
                    labelFontSize=10,
                    tickWidth=0,
                    domain=False,
                ),
            ),
        )
        .properties(width=360, height=100)
    )

    res = (
        base2.mark_circle(size=1800).encode(color=alt.Color("color:N", scale=None))
        + base2.mark_text(size=12).encode(text="Mut_gene")
        + base2.mark_text(dy=28, size=7).encode(text="Mutation")
    ).transform_filter(datum.Mut_status == "res_mut")

    qual = (
        alt.Chart(ab_df)
        .mark_bar(color="#1f77b4")
        .encode(
            x=alt.X(
                "LocMutation",
                sort=["Resistance"],
                axis=alt.Axis(
                    orient="top", labelAngle=35, title="Mutation specifics", tickWidth=0
                ),
            ),
            y=alt.Y(
                "Read_depth",
                scale=alt.Scale(type="log"),
                axis=alt.Axis(title="Read depth"),
            ),
            tooltip=["Var_Qual", "Read_depth"],
        )
        .properties(width=360, height=100)
        .transform_filter(datum.Mut_status == "res_mut")
    )

    lis.append(
        alt.vconcat(
            (res | qual),
            barchart,
            title=alt.TitleParams(
                text=str(snakemake.params)
                + " resistance plot -"
                + abres.split("/")[1]
                + "%",
                anchor="middle",
                fontWeight="bold",
                fontSize=20,
            ),
        )
    )
a = alt.vconcat(*lis)
a.save(snakemake.output[0])

Python Pandas altair From line 3 of scripts/altair_plot_list.py

import altair as alt
import pandas as pd
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
alt.data_transformers.disable_max_rows()

abresprofile = str(snakemake.input[0])
depthprofile = str(snakemake.input[1])

ab_df = pd.read_csv(abresprofile, header=0, sep=",")
dp_df = pd.read_csv(depthprofile, header=0, sep=",")

dp_df["Gene"] = dp_df["Gene"].str.split("_").str[0]
ab_df["Mut_gene"] = ab_df["Mut_gene"].str.split("_").str[0]
lsmut = []
lslocmut = []
for i in range(ab_df.shape[0]):
    lsmut.append(
        ab_df.loc[i]["Ref_Codon"]
        + str(ab_df.loc[i]["Codon_num"])
        + ab_df.loc[i]["Alt_Codon"]
    )
    lslocmut.append(
        ab_df.loc[i]["Mut_gene"]
        + "_"
        + ab_df.loc[i]["Ref_Codon"]
        + str(ab_df.loc[i]["Codon_num"])
        + ab_df.loc[i]["Alt_Codon"]
    )
ab_df["Mutation"] = lsmut
ab_df["LocMutation"] = lslocmut
ab_df = ab_df[ab_df["Mut_status"] == "res_mut"]

selector = alt.selection_single(empty="all", fields=["Gene"])
intervall_selector = alt.selection_interval(empty="all", encodings=["x"])

base = alt.Chart(dp_df).transform_calculate(
    color='datum.Antibiotic !== "-" ? "red" : "lightgrey"',
)

base_selector = base.add_selection(selector)

barchart = (
    base_selector.mark_bar()
    .encode(
        x=alt.X("Gene:O", title=None, axis=alt.Axis(labelAngle=-45)),
        y=alt.Y("mean(Depth):Q", scale=alt.Scale(type="log")),
        tooltip=alt.Tooltip(["mean(Depth)"], format=".2f"),
        color=alt.condition(selector, alt.value("#1f77b4"), alt.value("lightgrey")),
    )
    .properties(title="Mean Depth per locus", width=800, height=200)
)

poschart = (
    base_selector.mark_bar()
    .encode(x="Position", y="Depth", color=alt.Color("color:N", scale=None),)
    .transform_filter(selector)
    .add_selection(intervall_selector)
    .properties(title="Locus specific depth", width=360, height=200)
)

zoom_chart = (
    base.mark_bar()
    .encode(
        x="Position",
        y="Depth",
        color=alt.Color("color:N", scale=None),
        tooltip=["Mutation:N", "Antibiotic:N", "Depth"],
    )
    .transform_filter(intervall_selector)
    .properties(title="Zoomed locus specific depth", width=360, height=200)
)


base2 = (
    alt.Chart(ab_df)
    .transform_calculate(color='datum.Var_Qual >= 1 ? "red" : "orange"',)
    .encode(
        x=alt.X(
            "Resistance",
            sort=["Resistance"],
            axis=alt.Axis(
                orient="top", labelAngle=35, labelFontSize=10, tickWidth=0, domain=False
            ),
        ),
        tooltip=["AvLocus_Depth", "Read_depth", "Alt_num", "Var_Qual"],
    )
    .properties(width=360, height=100)
)

res = (
    base2.mark_circle(size=1800).encode(color=alt.Color("color:N", scale=None))
    + base2.mark_text(size=12).encode(text="Mut_gene")
    + base2.mark_text(dy=28, size=7).encode(text="Mutation")
)

qual = (
    alt.Chart(ab_df)
    .mark_bar(color="#1f77b4")
    .encode(
        x=alt.X(
            "LocMutation",
            sort=["Resistance"],
            axis=alt.Axis(
                orient="top", labelAngle=35, title="Mutation specifics", tickWidth=0
            ),
        ),
        y=alt.Y(
            "Read_depth", scale=alt.Scale(type="log"), axis=alt.Axis(title="Read depth")
        ),
        tooltip=["Var_Qual", "Read_depth"],
    )
    .properties(width=360, height=100)
)

plot = alt.vconcat(
    (res | qual),
    barchart,
    (poschart | zoom_chart),
    title=alt.TitleParams(
        text=str(snakemake.params) + " resistance plot",
        anchor="middle",
        fontWeight="bold",
        fontSize=20,
    ),
)
plot.save(snakemake.output[0], embed_options={"renderer": "svg"})

Python Pandas altair From line 3 of scripts/altair_plot_single.py

import pandas as pd
import altair as alt
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
alt.data_transformers.disable_max_rows()

dfres = pd.read_csv(str(snakemake.input), header=0, sep=",")
dfres["Sample"] = dfres["Sample"].str.split("_").str[0]
dfres["resistance"] = dfres["resistance"].replace(
    {
        "RIFAMPICIN": "RIF",
        "ISONIAZID": "INH",
        "KANAMYCIN": "KAN",
        "PYRAZINAMIDE": "PZA",
        "ETHAMBUTOL": "EMB",
        "ETHIONAMIDE": "ETH",
        "STREPTOMYCIN": "SM",
        "FLUOROQUINOLONE": "FQ",
    }
)
sorting = [
    "RIF",
    "INH",
    "PZA",
    "EMB",
    "SM",
    "FQ",
    "KAN",
    "AMI",
    "CAP",
    "ETH",
    "LIN",
    "BDQ",
    "CFZ",
]

plot = (
    alt.Chart(dfres)
    .mark_rect(stroke="black", strokeWidth=0.1)
    .encode(
        y=alt.Y("Sample:O", sort=dfres["Sample"].to_list()),
        x=alt.X(
            "resistance:O",
            axis=alt.Axis(orient="top", title="Resistotype", tickWidth=0),
            sort=sorting,
        ),
        color=alt.Color(
            "quality",
            scale=alt.Scale(
                domain=[0, 1, 2, 3],
                range=["lightgrey", "yellow", "orange", "red"],
                interpolate="rgb",
            ),
            title="Variant quality",
        ),
        tooltip=["locus", "mutation", "quality"],
    )
    .properties(title="Resistances MYCRes")
)

plot.save(snakemake.output[0], embed_options={"renderer": "svg"})

Python Pandas altair From line 3 of scripts/altair_summary.py

import pandas as pd
import csv

RES_DB_PATH = "resources/pointfinder_db/mycobacterium_tuberculosis/resistens-overview.txt"
STOP_DB_PATH = "resources/pointfinder_db/mycobacterium_tuberculosis/stop-overview.txt"

def Average(lst):
    return sum(lst) / len(lst)

def read_resistance_db(path):
    if "resistens-overview.txt" in path:
        db = pd.read_csv(path,header=0,sep="\t",)
        db["PMID"] = db["PMID"].str.replace(",", ";")
        db["Resistance"] = db["Resistance"].str.replace(",", ";")
        return db
    elif "stop-overview.txt" in path:
        stopcodonlist = ["TAG", "TAA", "TGA"]
        db = pd.read_csv(path,header=0,sep="\t",)
        return db,stopcodonlist
    else:
        return "ERROR: Wrong database called"

def Average(lst):
    return sum(lst) / len(lst)

def comparison_method(gene_or_codon):
    if gene_or_codon == "gene":
        pd_col = 2
    elif gene_or_codon == "codon":
        pd_col = 3
    return pd_col

def create_compare_lists(comp, sumvar, gene_df):
    index = sumvar.loc[i][4] - 1
    hit_df = gene_df.loc[gene_df["Codon_pos"] == int(sumvar.loc[i][comp])] if isinstance(comp, int) else None
    if comp == 2:
        complist, mutlist = [sumvar.loc[i][5][index],sumvar.loc[i][7][index]], list(hit_df.loc[:,"Ref_codon":"Res_codon"].values)[0]
    elif comp == 3:
        complist, mutlist = list(sumvar.loc[i,"Ref_Codon":"Alt_aas"].values), list(hit_df.loc[:,"Ref_nuc":"Res_codon"].values)[0]
        del complist[2]
    elif comp == "negative":
        ref_cod, alt_cod = list(sumvar.loc[i][5]), list(sumvar.loc[i][7])
        ref_cod.reverse()
        alt_cod.reverse()
        hit_df = gene_df.loc[gene_df["Codon_pos"] == int(sumvar.loc[i][2])]
        complist, mutlist = [ref_cod[index],alt_cod[index]], list(hit_df.loc[:,"Ref_codon":"Res_codon"].values)[0]
    return complist, mutlist, hit_df

def prepare_output(i,sumvar):
    res = ""
    for elem in sumvar.loc[i,].tolist():
        res += str(elem)+","
    return res

def process_res_mut(rtype, hit_df, res):
    if rtype == "res":
        for elem in hit_df.loc[:,"Resistance":"PMID"].values[0]:
            res += str(elem)+","
        outcsv.write("res_mut,{}\n".format(res.rstrip(",")))
    elif rtype == "no_res":
        outcsv.write("no_res_mut,{},-,-\n".format(res.rstrip(",")))

resdb = read_resistance_db(RES_DB_PATH)
stopdb,stopcodonlist = read_resistance_db(STOP_DB_PATH)
sumvar = pd.read_csv(str(snakemake.input), header=0, sep="\t")


with open(str(snakemake.output), "w") as outcsv:
    outcsv.writelines(
        "Mut_status,Mut_gene,Genome_pos,Gene_pos,Codon_num,Codon_pos,Ref_Codon,Ref_aas,Alt_Codon,Alt_aas,Var_type,Read_depth,Alt_num,Var_Qual,Resistance,PMID\n"
    )
    for i in range(sumvar.shape[0]):
        lenlst = []
        gene_df = resdb.loc[resdb["#Gene_ID"] == sumvar.loc[i][0]]
        for col in gene_df:
            lenlst.append(gene_df["Ref_nuc"].apply(len).mean())
        if Average(lenlst) == 1:
            comp = comparison_method("gene")
            for j in gene_df["Codon_pos"].unique():
                if int(j) == int(sumvar.loc[i][comp]):
                    complist,mutlist,hit_df = create_compare_lists(comp, sumvar, gene_df)
                    res = prepare_output(i,sumvar)
                    if mutlist[0] == complist[0] and complist[1] in list(mutlist[1]):
                        process_res_mut("res", hit_df, res)
                    else:
                        process_res_mut("no_res", hit_df, res)
        else:
            comp = comparison_method("codon")
            for j in gene_df["Codon_pos"].unique():
                if int(j) == int(sumvar.loc[i][comp]):
                    complist,mutlist,hit_df = create_compare_lists(comp, sumvar, gene_df)
                    res = prepare_output(i,sumvar)
                    if mutlist[0] == complist[0] and mutlist[1] == complist[1] and complist[2] in list(mutlist[2]):
                        process_res_mut("res", hit_df, res)
                    else:
                        process_res_mut("no_res", hit_df, res)       
            if not gene_df[gene_df["Codon_pos"] < 0].empty:
                comp = comparison_method("gene")
                gene_df = gene_df[gene_df["Codon_pos"] < 0]            
                for j in gene_df["Codon_pos"].unique():
                    if int(j) == int(sumvar.loc[i][comp]):
                        complist,mutlist,hit_df = create_compare_lists("negative", sumvar, gene_df)
                        res = prepare_output(i,sumvar)
                        if mutlist[0] == complist[0] and complist[1] in list(mutlist[1]):
                            process_res_mut("res", hit_df, res)
                        else:
                            process_res_mut("no_res", hit_df, res)

    for i in range(stopdb.shape[0]):
        var_df = sumvar.loc[sumvar["Gene_name"] == stopdb.loc[i][0]]
        for j in var_df["Alt_Codon"].values:
            if j in stopcodonlist:
                res = var_df[
                    (var_df["Gene_name"] == stopdb.loc[i][0]) & (var_df["Alt_Codon"] == j)
                ].to_string(header=False, index=False)
                outcsv.write("res_mut\t{}".format(res.replace(" ", ",")))

Python Pandas From line 1 of scripts/compare_variants.py

import pandas as pd

depthsum = snakemake.input
tabtitle = str(snakemake.params) + " coverage summary"

frames = []
for sum in depthsum:
    df = pd.read_csv(sum, header=0, sep="\t")
    df["#rname"] = df["#rname"].str.split("_").str[0]
    df["startpos"] = df["endpos"] - df["startpos"] + 1
    df = df.drop(["endpos"], axis=1)
    df = df.rename(columns={"#rname": "locus", "startpos": "numpos"})
    df.insert(1, "read_reduction", str(sum).split("/")[1])
    frames.append(df)

dfall = pd.concat(frames).sort_values(["locus", "read_reduction"])


with open(str(snakemake.output), "w") as outhtml:
    outhtml.write(tabtitle)
    dfall.to_html(buf=outhtml, header=True, index=False, justify="center")

Python Pandas From line 1 of scripts/covsum_to_html.py

import pandas as pd
import numpy as np
import ast

ab_df = pd.read_csv(str(snakemake.input[0]), header=0, sep=",")
dp_df = pd.read_csv(str(snakemake.input[1]), header=0, sep=",")

# drops all mutation with a positional read depth below 8
ab_df = ab_df.drop(ab_df[ab_df.Read_depth < 8].index)

# converts columns which have int and lists as values
def convert_lists(value):
    try:
        value = int(value)
    except:
        value = sum(ast.literal_eval(value))
    return value


# integrate locus average in abresprofile
ls = []
gene_loci = dp_df["Gene"].drop_duplicates().to_list()
for gene in gene_loci:
    gene_df = dp_df.loc[dp_df["Gene"] == gene]
    ls.append([gene, round(gene_df["Depth"].mean())])
dp_df = pd.DataFrame(ls)
dp_df = dp_df.rename(columns={0: "Mut_gene", 1: "AvLocus_Depth"})
res = ab_df.merge(dp_df, on="Mut_gene")
locdepth = res.pop("AvLocus_Depth")
res.insert(11, "AvLocus_Depth", locdepth)
res["Alt_num"] = res["Alt_num"].apply(convert_lists)
res["Var_Qual"] = round(
    np.square(res["Alt_num"] / res["Read_depth"]) * np.log10(res["Alt_num"]), 2
)
res.to_csv(str(snakemake.output.resout), mode="a", index=False, header=True)

dp_df = pd.read_csv(str(snakemake.input[1]), header=0, sep=",")

# integrate resistance in depth profile
for i in range(res.shape[0]):
    pos = res.loc[i]["Genome_pos"]
    locidx = dp_df.index[dp_df["Position"] == pos].tolist()[0]
    dp_df.at[locidx, "Mutation"] = (
        res.loc[i]["Ref_Codon"] + str(res.loc[i]["Codon_num"]) + res.loc[i]["Alt_Codon"]
    )
    dp_df.at[locidx, "Antibiotic"] = res.loc[i]["Resistance"]
    dp_df.at[locidx, "Var_Qual"] = res.loc[i]["Var_Qual"]
dp_df.to_csv(str(snakemake.output.depthout), mode="a", index=False, header=True)

Python Pandas numpy From line 1 of scripts/integr_res.py

import random

perc = 1 - (int(snakemake.params.red) / 100)
if perc == 1:
    with open(str(snakemake.input), "r") as insam, open(
        str(snakemake.output), "w"
    ) as outsam:
        lines = insam.readlines()
        for line in lines:
            outsam.write(line)
else:
    samdict = {}
    with open(str(snakemake.input), "r") as insam, open(
        str(snakemake.output), "w"
    ) as outsam:
        lines = insam.readlines()
        num_lines = len(lines)
        keep = int(num_lines * perc)
        linidx = random.sample(range(6, num_lines - 1), keep)
        linidx.sort()
        for number, line in enumerate(lines):
            samdict[number] = line
            if number < 6:
                outsam.write(line)
        for ran in linidx:
            outsam.write(samdict[ran])

Python From line 3 of scripts/rremove_reads.py

import pandas as pd

path_list = list(snakemake.input[0:])
genenames = list(snakemake.params)
outfile = str(snakemake.output)

with open(str(snakemake.output), "w+") as outcsv:
    if len(outcsv.readlines()) == 0:
        outcsv.write("Gene,Position,Depth,Mutation,Antibiotic\n")
    for path in path_list:
        locus_df = pd.read_csv(path, header=0, sep="\t")
        for name in genenames[0]:
            if name in path:
                locus_df.rename(
                    columns={
                        locus_df.columns[0]: "Gene",
                        locus_df.columns[1]: "Position",
                        locus_df.columns[2]: "Depth",
                    },
                    inplace=True,
                )
                locus_df = locus_df.replace("AL123456.3", name, regex=True)
                locus_df["Mutation"] = "-"
                locus_df["Antibiotic"] = "-"
                break
        locus_df.to_csv(outcsv, mode="a", index=False, header=False)

Python Pandas From line 3 of scripts/sum_depths.py

import pandas as pd

with open(str(snakemake.output), "w+") as outcsv:

    sum_df = pd.DataFrame(
        columns=["Sample", "locus", "mutation", "resistance", "quality"]
    )
    for i in list(snakemake.input):
        temp_df = pd.read_csv(i, header=0, sep=",")
        temp_df = temp_df.loc[temp_df["Mut_status"] == "res_mut"]
        temp_df.insert(0, "Sample", i.split("/")[3])
        temp_df = temp_df.drop(columns=["Mut_status"]).rename(
            columns={"Mut_gene": "locus"}
        )
        temp_df.insert(
            2,
            "mutation",
            temp_df["Ref_Codon"]
            + temp_df["Codon_num"].astype(str)
            + temp_df["Alt_Codon"],
        )
        temp_df.insert(3, "resistance", temp_df["Resistance"])
        temp_df.insert(4, "quality", temp_df["Var_Qual"])
        temp_df.drop(temp_df.columns[range(5, 20)], axis=1, inplace=True)
        sum_df = pd.concat([sum_df, temp_df])
    sum_df["resistance"] = sum_df.resistance.apply(lambda x: x[0:].split(";"))
    sum_df = sum_df.explode("resistance")
    sum_df.to_csv(outcsv, mode="a", index=False, header=True)

Python Pandas From line 1 of scripts/sum_res.py

import sys
import vcf
import codpos_genidx as cg
import retrieve_codon as rc
import retrieve_aac as ra

# The Script is meant to be called from the snakemake-rule "create_variant_profile"
# Input are vcf files which are created for every sample
# The Script parses the vcf file and iterates over all variants called
# Two external scripts are called with the variant position
# 1. genomeidx_to_gene gives Codon-infos and gene name
# 2. rescinfo_to_codon gives the whole reference Codon in reading frame
# These information together with various information from the vcf file are gathered
# over all vcf files per sample and safed in a csv-file

with open(str(snakemake.output), "w") as outcsv, open(
    str(snakemake.output).replace(".csv", "_complex.csv"), "w"
) as outcsv2:
    outcsv.writelines(
        "Gene_name,Genome_pos,Gene_pos,Codon_num,Codon_pos,Ref_Codon,Ref_aas,Alt_Codon,Alt_aas,Var_type,Read_depth,Alt_num,Var_Qual\n".replace(
            ",", "\t"
        )
    )
    outcsv2.writelines(
        "Gene_name,Genome_pos,Gene_pos,Codon_num,Codon_pos,Ref_Codon,Ref_aas,Alt_Codon,Alt_aas,Var_type,Read_depth,Alt_num,Var_Qual\n".replace(
            ",", "\t"
        )
    )
    for i in snakemake.input:
        v = vcf.Reader(filename=str(i))
        for recs in v:
            line = []
            res = ""
            cod = cg.genomeidx_to_gene(recs.POS)
            retcod = rc.refcodinfo_to_codon(recs.POS, cod[0][2], cod[1])
            retaac = ra.codon_to_as(retcod)
            varcod = rc.varcodinfo_to_codon(retcod, cod[0][2], recs.ALT, cod[1])
            varaac = ra.codon_to_as(varcod)
            line.extend(
                [
                    cod[1],
                    recs.POS,
                    cod[0][1],
                    cod[0][0],
                    cod[0][2],
                    retcod,
                    retaac,
                    varcod,
                    varaac,
                    recs.INFO["TYPE"],
                ]
            )
            for rec in recs:
                line.extend([rec["DP"], rec["AO"]])
            line.append(int(recs.QUAL))
            for elem in line:
                res += str(elem) + "\t"
            if len(recs.INFO["TYPE"]) == 1 and recs.INFO["TYPE"][0] == "snp":
                outcsv.write(res[:-1] + "\n")
            else:
                outcsv2.write(res[:-1] + "\n")

Python PyVCF From line 1 of scripts/sum_vars.py

__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os.path import splitext

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0])

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)

Python Snakemake BWA From line 1 of index/wrapper.py

__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )

Python Snakemake SAMtools Picard snakemake-wrapper-utils From line 1 of mem/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " --cores {snakemake.threads}"
    " {adapters}"
    " {extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)

Python Snakemake MultiQC From line 3 of multiqc/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts


samtools_opts = get_samtools_opts(snakemake)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


with tempfile.TemporaryDirectory() as tmpdir:
    tmp_prefix = Path(tmpdir) / "samtools_fastq.sort_"

    shell(
        "samtools sort {samtools_opts} {extra} -T {tmp_prefix} {snakemake.input[0]} {log}"
    )

Python Snakemake SAMtools snakemake-wrapper-utils From line 1 of sort/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}"
)