Pipeline for Predicting Antibiotic Resistance in Mycobacterium tuberculosis Using Snakemake

public public 1yr ago 0 bookmarks

This snakemake-based pipeline is a tool for predicting antibiotic resistance in Mycobacterium tuberculosis.

Background

The tuberculosis disease caused by mycobacterium tuberculosis is still a global concern. To approach the emerging problem of upcoming resistant strains, various diagnostic methods are needed. Here, we present a pipeline capable of predicting mycobacteria resistances in an OS-independent, reproducible way.

Structure

mdrmtbworkflow

Input

  • FastQ files of mycobacterium tuberculosis (illumina-sequenced short-read)

  • The pointfinder database

  • A table of gene loci to be investigated (names must be similar to the pointfinder DB)

  • Several parameters in the config file

  • A samples table, where the name of the sample and forward and reverse FastQ files are shown (see config/pep/samples.csv)

Usage

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template . Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Setup database

For mutation database, the pointfinder_db database was used (Johnsen et al. 2019 - https://bitbucket.org/genomicepidemiology/pointfinder_db/src/master/). To parse this database, some changes were made. The changed database files are stored in a folder called database. Rename this folder to 'resources' to match the path configuration of the pipeline. Also, here a file called 'gene_loci.csv' can be found. This file specifies the loci which shall be investigated by the pipeline. If some changes to investigated loci are to be made, the name of the new locus must be present in the resistens-overview.txt database file from pointfinder.

Step 3: Setup data

After downloading the code, the FastQ files to be investigated can be moved to a folder in the same file structure as the Snakefile called data. The names of the samples as well as single single read names need to be written in samples.csv (/config/pep/samples.csv). If all reads should be considered, the config file (/config/config.yaml) parameter 'reducing' under reduce_reads should be “False".

Step 4: Install snakemake

MYCRes is a snakemake-based workflow. Therefore, the installation of this package is necessary to start it. We used snakemake version 7.18.2 in a conda environment. For that you can use:

conda create --name sm
conda activate sm
conda install -c bioconda snakemake

Step 5: Start snakemake

Start the workflow with the command "snakemake --cores all --use-conda" when you are in the same file structure as the Snakefile. A snakemake report.html file is created automatically in the results folder.

Output

As an output, an interactive graphic shows all found resistance-causing mutations and coverage details on the mutation and the gene locus.

visualization

Code Snippets

18
19
20
shell:
    "freebayes -f {input.fna} --region {params.region} {input.bam} |"
    " vcffilter -f {params.filter} | vcfallelicprimitives -kg > {output}"
38
39
script:
    "../scripts/sum_vars.py"
11
12
script:
    "../scripts/compare_variants.py"
18
19
script:
    "../scripts/sum_depths.py"
34
35
script:
    "../scripts/integr_res.py"
50
51
script:
    "../scripts/sum_res.py"
27
28
script:
    "../scripts/altair_plot_single.py"
57
58
script:
    "../scripts/altair_plot_list.py"
75
76
script:
    "../scripts/altair_summary.py"
97
98
script:
    "../scripts/covsum_to_html.py"
22
23
shell:
    "snakemake --nolock --report {output}"
46
47
shell:
    "snakemake --nolock --report {output}"
11
12
shell:
    "curl -SL -o {output} {params.mtb_genome} 2> {log}"
24
25
shell:
    "gzip -dkf {input} > {output}"
37
38
shell:
    "samtools faidx {input} > {output}"
55
56
wrapper:
    "v1.14.1/bio/bwa/index"
79
80
wrapper:
    "v1.14.1/bio/bwa/mem"
97
98
wrapper:
    "v1.14.1/bio/samtools/sort"
112
113
wrapper:
    "v1.17.2/bio/samtools/index"
10
11
wrapper:
    "v1.14.1/bio/fastqc"
29
30
wrapper:
    "v1.14.1/bio/multiqc"
43
44
wrapper:
    "v1.14.1/bio/fastqc"
62
63
wrapper:
    "v1.14.1/bio/multiqc"
80
81
shell:
    "samtools depth -H -d 1000000 -r {params.region} -o {output} {input.bam}"
 98
 99
100
shell:
    "(samtools coverage -r {params.region} -o {output} {input.bam} &&"
    " sed -i 's/AL123456.3/{wildcards.loci}/' {output})"
SnakeMake From line 98 of rules/qc.smk
120
121
122
123
shell:
    "cat {input} >> {output} ; "
    "echo -ne '\n' >> {output} ; "
    "sed -i '1!{{/^#rname/d;}}' {output}"
SnakeMake From line 120 of rules/qc.smk
17
18
wrapper:
    "v1.14.1/bio/samtools/sort"
27
28
shell:
    "samtools view -h -o {output} {input}"
39
40
script:
    "../scripts/rremove_reads.py"
49
50
shell:
    "samtools view -bS {input} > {output}"
15
16
wrapper:
    "v1.14.1/bio/cutadapt/pe"
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import pandas as pd
import altair as alt
import warnings
from altair.expr import datum, if_

alt.data_transformers.disable_max_rows()
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)

abresprofile = snakemake.input.res
depthprofile = snakemake.input.depth

lis = []
for abres, depth in zip(abresprofile, depthprofile):
    ab_df = pd.read_csv(abres, header=0, sep=",")
    dp_df = pd.read_csv(depth, header=0, sep=",")

    dp_df["Gene"] = dp_df["Gene"].str.split("_").str[0]
    ab_df["Mut_gene"] = ab_df["Mut_gene"].str.split("_").str[0]
    lsmut = []
    lslocmut = []
    for i in range(ab_df.shape[0]):
        lsmut.append(
            ab_df.loc[i]["Ref_Codon"]
            + str(ab_df.loc[i]["Codon_num"])
            + ab_df.loc[i]["Alt_Codon"]
        )
        lslocmut.append(
            ab_df.loc[i]["Mut_gene"]
            + "_"
            + ab_df.loc[i]["Ref_Codon"]
            + str(ab_df.loc[i]["Codon_num"])
            + ab_df.loc[i]["Alt_Codon"]
        )
    ab_df["Mutation"] = lsmut
    ab_df["LocMutation"] = lslocmut

    barchart = (
        alt.Chart(dp_df)
        .mark_bar(color="#1f77b4")
        .encode(
            x=alt.X("Gene:O", title=None, axis=alt.Axis(labelAngle=-45)),
            y=alt.Y("mean(Depth)", scale=alt.Scale(type="log")),
        )
        .properties(title="Mean Depth per locus", width=800, height=200)
    )

    base2 = (
        alt.Chart(ab_df)
        .transform_calculate(color='datum.Var_Qual >= 1 ? "red" : "orange"',)
        .encode(
            x=alt.X(
                "Resistance",
                sort=["Resistance"],
                axis=alt.Axis(
                    orient="top",
                    labelAngle=35,
                    labelFontSize=10,
                    tickWidth=0,
                    domain=False,
                ),
            ),
        )
        .properties(width=360, height=100)
    )

    res = (
        base2.mark_circle(size=1800).encode(color=alt.Color("color:N", scale=None))
        + base2.mark_text(size=12).encode(text="Mut_gene")
        + base2.mark_text(dy=28, size=7).encode(text="Mutation")
    ).transform_filter(datum.Mut_status == "res_mut")

    qual = (
        alt.Chart(ab_df)
        .mark_bar(color="#1f77b4")
        .encode(
            x=alt.X(
                "LocMutation",
                sort=["Resistance"],
                axis=alt.Axis(
                    orient="top", labelAngle=35, title="Mutation specifics", tickWidth=0
                ),
            ),
            y=alt.Y(
                "Read_depth",
                scale=alt.Scale(type="log"),
                axis=alt.Axis(title="Read depth"),
            ),
            tooltip=["Var_Qual", "Read_depth"],
        )
        .properties(width=360, height=100)
        .transform_filter(datum.Mut_status == "res_mut")
    )

    lis.append(
        alt.vconcat(
            (res | qual),
            barchart,
            title=alt.TitleParams(
                text=str(snakemake.params)
                + " resistance plot -"
                + abres.split("/")[1]
                + "%",
                anchor="middle",
                fontWeight="bold",
                fontSize=20,
            ),
        )
    )
a = alt.vconcat(*lis)
a.save(snakemake.output[0])
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import altair as alt
import pandas as pd
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
alt.data_transformers.disable_max_rows()

abresprofile = str(snakemake.input[0])
depthprofile = str(snakemake.input[1])

ab_df = pd.read_csv(abresprofile, header=0, sep=",")
dp_df = pd.read_csv(depthprofile, header=0, sep=",")

dp_df["Gene"] = dp_df["Gene"].str.split("_").str[0]
ab_df["Mut_gene"] = ab_df["Mut_gene"].str.split("_").str[0]
lsmut = []
lslocmut = []
for i in range(ab_df.shape[0]):
    lsmut.append(
        ab_df.loc[i]["Ref_Codon"]
        + str(ab_df.loc[i]["Codon_num"])
        + ab_df.loc[i]["Alt_Codon"]
    )
    lslocmut.append(
        ab_df.loc[i]["Mut_gene"]
        + "_"
        + ab_df.loc[i]["Ref_Codon"]
        + str(ab_df.loc[i]["Codon_num"])
        + ab_df.loc[i]["Alt_Codon"]
    )
ab_df["Mutation"] = lsmut
ab_df["LocMutation"] = lslocmut
ab_df = ab_df[ab_df["Mut_status"] == "res_mut"]

selector = alt.selection_single(empty="all", fields=["Gene"])
intervall_selector = alt.selection_interval(empty="all", encodings=["x"])

base = alt.Chart(dp_df).transform_calculate(
    color='datum.Antibiotic !== "-" ? "red" : "lightgrey"',
)

base_selector = base.add_selection(selector)

barchart = (
    base_selector.mark_bar()
    .encode(
        x=alt.X("Gene:O", title=None, axis=alt.Axis(labelAngle=-45)),
        y=alt.Y("mean(Depth):Q", scale=alt.Scale(type="log")),
        tooltip=alt.Tooltip(["mean(Depth)"], format=".2f"),
        color=alt.condition(selector, alt.value("#1f77b4"), alt.value("lightgrey")),
    )
    .properties(title="Mean Depth per locus", width=800, height=200)
)

poschart = (
    base_selector.mark_bar()
    .encode(x="Position", y="Depth", color=alt.Color("color:N", scale=None),)
    .transform_filter(selector)
    .add_selection(intervall_selector)
    .properties(title="Locus specific depth", width=360, height=200)
)

zoom_chart = (
    base.mark_bar()
    .encode(
        x="Position",
        y="Depth",
        color=alt.Color("color:N", scale=None),
        tooltip=["Mutation:N", "Antibiotic:N", "Depth"],
    )
    .transform_filter(intervall_selector)
    .properties(title="Zoomed locus specific depth", width=360, height=200)
)


base2 = (
    alt.Chart(ab_df)
    .transform_calculate(color='datum.Var_Qual >= 1 ? "red" : "orange"',)
    .encode(
        x=alt.X(
            "Resistance",
            sort=["Resistance"],
            axis=alt.Axis(
                orient="top", labelAngle=35, labelFontSize=10, tickWidth=0, domain=False
            ),
        ),
        tooltip=["AvLocus_Depth", "Read_depth", "Alt_num", "Var_Qual"],
    )
    .properties(width=360, height=100)
)

res = (
    base2.mark_circle(size=1800).encode(color=alt.Color("color:N", scale=None))
    + base2.mark_text(size=12).encode(text="Mut_gene")
    + base2.mark_text(dy=28, size=7).encode(text="Mutation")
)

qual = (
    alt.Chart(ab_df)
    .mark_bar(color="#1f77b4")
    .encode(
        x=alt.X(
            "LocMutation",
            sort=["Resistance"],
            axis=alt.Axis(
                orient="top", labelAngle=35, title="Mutation specifics", tickWidth=0
            ),
        ),
        y=alt.Y(
            "Read_depth", scale=alt.Scale(type="log"), axis=alt.Axis(title="Read depth")
        ),
        tooltip=["Var_Qual", "Read_depth"],
    )
    .properties(width=360, height=100)
)

plot = alt.vconcat(
    (res | qual),
    barchart,
    (poschart | zoom_chart),
    title=alt.TitleParams(
        text=str(snakemake.params) + " resistance plot",
        anchor="middle",
        fontWeight="bold",
        fontSize=20,
    ),
)
plot.save(snakemake.output[0], embed_options={"renderer": "svg"})
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import pandas as pd
import altair as alt
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
alt.data_transformers.disable_max_rows()

dfres = pd.read_csv(str(snakemake.input), header=0, sep=",")
dfres["Sample"] = dfres["Sample"].str.split("_").str[0]
dfres["resistance"] = dfres["resistance"].replace(
    {
        "RIFAMPICIN": "RIF",
        "ISONIAZID": "INH",
        "KANAMYCIN": "KAN",
        "PYRAZINAMIDE": "PZA",
        "ETHAMBUTOL": "EMB",
        "ETHIONAMIDE": "ETH",
        "STREPTOMYCIN": "SM",
        "FLUOROQUINOLONE": "FQ",
    }
)
sorting = [
    "RIF",
    "INH",
    "PZA",
    "EMB",
    "SM",
    "FQ",
    "KAN",
    "AMI",
    "CAP",
    "ETH",
    "LIN",
    "BDQ",
    "CFZ",
]

plot = (
    alt.Chart(dfres)
    .mark_rect(stroke="black", strokeWidth=0.1)
    .encode(
        y=alt.Y("Sample:O", sort=dfres["Sample"].to_list()),
        x=alt.X(
            "resistance:O",
            axis=alt.Axis(orient="top", title="Resistotype", tickWidth=0),
            sort=sorting,
        ),
        color=alt.Color(
            "quality",
            scale=alt.Scale(
                domain=[0, 1, 2, 3],
                range=["lightgrey", "yellow", "orange", "red"],
                interpolate="rgb",
            ),
            title="Variant quality",
        ),
        tooltip=["locus", "mutation", "quality"],
    )
    .properties(title="Resistances MYCRes")
)

plot.save(snakemake.output[0], embed_options={"renderer": "svg"})
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import pandas as pd
import csv

RES_DB_PATH = "resources/pointfinder_db/mycobacterium_tuberculosis/resistens-overview.txt"
STOP_DB_PATH = "resources/pointfinder_db/mycobacterium_tuberculosis/stop-overview.txt"

def Average(lst):
    return sum(lst) / len(lst)

def read_resistance_db(path):
    if "resistens-overview.txt" in path:
        db = pd.read_csv(path,header=0,sep="\t",)
        db["PMID"] = db["PMID"].str.replace(",", ";")
        db["Resistance"] = db["Resistance"].str.replace(",", ";")
        return db
    elif "stop-overview.txt" in path:
        stopcodonlist = ["TAG", "TAA", "TGA"]
        db = pd.read_csv(path,header=0,sep="\t",)
        return db,stopcodonlist
    else:
        return "ERROR: Wrong database called"

def Average(lst):
    return sum(lst) / len(lst)

def comparison_method(gene_or_codon):
    if gene_or_codon == "gene":
        pd_col = 2
    elif gene_or_codon == "codon":
        pd_col = 3
    return pd_col

def create_compare_lists(comp, sumvar, gene_df):
    index = sumvar.loc[i][4] - 1
    hit_df = gene_df.loc[gene_df["Codon_pos"] == int(sumvar.loc[i][comp])] if isinstance(comp, int) else None
    if comp == 2:
        complist, mutlist = [sumvar.loc[i][5][index],sumvar.loc[i][7][index]], list(hit_df.loc[:,"Ref_codon":"Res_codon"].values)[0]
    elif comp == 3:
        complist, mutlist = list(sumvar.loc[i,"Ref_Codon":"Alt_aas"].values), list(hit_df.loc[:,"Ref_nuc":"Res_codon"].values)[0]
        del complist[2]
    elif comp == "negative":
        ref_cod, alt_cod = list(sumvar.loc[i][5]), list(sumvar.loc[i][7])
        ref_cod.reverse()
        alt_cod.reverse()
        hit_df = gene_df.loc[gene_df["Codon_pos"] == int(sumvar.loc[i][2])]
        complist, mutlist = [ref_cod[index],alt_cod[index]], list(hit_df.loc[:,"Ref_codon":"Res_codon"].values)[0]
    return complist, mutlist, hit_df

def prepare_output(i,sumvar):
    res = ""
    for elem in sumvar.loc[i,].tolist():
        res += str(elem)+","
    return res

def process_res_mut(rtype, hit_df, res):
    if rtype == "res":
        for elem in hit_df.loc[:,"Resistance":"PMID"].values[0]:
            res += str(elem)+","
        outcsv.write("res_mut,{}\n".format(res.rstrip(",")))
    elif rtype == "no_res":
        outcsv.write("no_res_mut,{},-,-\n".format(res.rstrip(",")))

resdb = read_resistance_db(RES_DB_PATH)
stopdb,stopcodonlist = read_resistance_db(STOP_DB_PATH)
sumvar = pd.read_csv(str(snakemake.input), header=0, sep="\t")


with open(str(snakemake.output), "w") as outcsv:
    outcsv.writelines(
        "Mut_status,Mut_gene,Genome_pos,Gene_pos,Codon_num,Codon_pos,Ref_Codon,Ref_aas,Alt_Codon,Alt_aas,Var_type,Read_depth,Alt_num,Var_Qual,Resistance,PMID\n"
    )
    for i in range(sumvar.shape[0]):
        lenlst = []
        gene_df = resdb.loc[resdb["#Gene_ID"] == sumvar.loc[i][0]]
        for col in gene_df:
            lenlst.append(gene_df["Ref_nuc"].apply(len).mean())
        if Average(lenlst) == 1:
            comp = comparison_method("gene")
            for j in gene_df["Codon_pos"].unique():
                if int(j) == int(sumvar.loc[i][comp]):
                    complist,mutlist,hit_df = create_compare_lists(comp, sumvar, gene_df)
                    res = prepare_output(i,sumvar)
                    if mutlist[0] == complist[0] and complist[1] in list(mutlist[1]):
                        process_res_mut("res", hit_df, res)
                    else:
                        process_res_mut("no_res", hit_df, res)
        else:
            comp = comparison_method("codon")
            for j in gene_df["Codon_pos"].unique():
                if int(j) == int(sumvar.loc[i][comp]):
                    complist,mutlist,hit_df = create_compare_lists(comp, sumvar, gene_df)
                    res = prepare_output(i,sumvar)
                    if mutlist[0] == complist[0] and mutlist[1] == complist[1] and complist[2] in list(mutlist[2]):
                        process_res_mut("res", hit_df, res)
                    else:
                        process_res_mut("no_res", hit_df, res)       
            if not gene_df[gene_df["Codon_pos"] < 0].empty:
                comp = comparison_method("gene")
                gene_df = gene_df[gene_df["Codon_pos"] < 0]            
                for j in gene_df["Codon_pos"].unique():
                    if int(j) == int(sumvar.loc[i][comp]):
                        complist,mutlist,hit_df = create_compare_lists("negative", sumvar, gene_df)
                        res = prepare_output(i,sumvar)
                        if mutlist[0] == complist[0] and complist[1] in list(mutlist[1]):
                            process_res_mut("res", hit_df, res)
                        else:
                            process_res_mut("no_res", hit_df, res)

    for i in range(stopdb.shape[0]):
        var_df = sumvar.loc[sumvar["Gene_name"] == stopdb.loc[i][0]]
        for j in var_df["Alt_Codon"].values:
            if j in stopcodonlist:
                res = var_df[
                    (var_df["Gene_name"] == stopdb.loc[i][0]) & (var_df["Alt_Codon"] == j)
                ].to_string(header=False, index=False)
                outcsv.write("res_mut\t{}".format(res.replace(" ", ",")))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd

depthsum = snakemake.input
tabtitle = str(snakemake.params) + " coverage summary"

frames = []
for sum in depthsum:
    df = pd.read_csv(sum, header=0, sep="\t")
    df["#rname"] = df["#rname"].str.split("_").str[0]
    df["startpos"] = df["endpos"] - df["startpos"] + 1
    df = df.drop(["endpos"], axis=1)
    df = df.rename(columns={"#rname": "locus", "startpos": "numpos"})
    df.insert(1, "read_reduction", str(sum).split("/")[1])
    frames.append(df)

dfall = pd.concat(frames).sort_values(["locus", "read_reduction"])


with open(str(snakemake.output), "w") as outhtml:
    outhtml.write(tabtitle)
    dfall.to_html(buf=outhtml, header=True, index=False, justify="center")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import pandas as pd
import numpy as np
import ast

ab_df = pd.read_csv(str(snakemake.input[0]), header=0, sep=",")
dp_df = pd.read_csv(str(snakemake.input[1]), header=0, sep=",")

# drops all mutation with a positional read depth below 8
ab_df = ab_df.drop(ab_df[ab_df.Read_depth < 8].index)

# converts columns which have int and lists as values
def convert_lists(value):
    try:
        value = int(value)
    except:
        value = sum(ast.literal_eval(value))
    return value


# integrate locus average in abresprofile
ls = []
gene_loci = dp_df["Gene"].drop_duplicates().to_list()
for gene in gene_loci:
    gene_df = dp_df.loc[dp_df["Gene"] == gene]
    ls.append([gene, round(gene_df["Depth"].mean())])
dp_df = pd.DataFrame(ls)
dp_df = dp_df.rename(columns={0: "Mut_gene", 1: "AvLocus_Depth"})
res = ab_df.merge(dp_df, on="Mut_gene")
locdepth = res.pop("AvLocus_Depth")
res.insert(11, "AvLocus_Depth", locdepth)
res["Alt_num"] = res["Alt_num"].apply(convert_lists)
res["Var_Qual"] = round(
    np.square(res["Alt_num"] / res["Read_depth"]) * np.log10(res["Alt_num"]), 2
)
res.to_csv(str(snakemake.output.resout), mode="a", index=False, header=True)

dp_df = pd.read_csv(str(snakemake.input[1]), header=0, sep=",")

# integrate resistance in depth profile
for i in range(res.shape[0]):
    pos = res.loc[i]["Genome_pos"]
    locidx = dp_df.index[dp_df["Position"] == pos].tolist()[0]
    dp_df.at[locidx, "Mutation"] = (
        res.loc[i]["Ref_Codon"] + str(res.loc[i]["Codon_num"]) + res.loc[i]["Alt_Codon"]
    )
    dp_df.at[locidx, "Antibiotic"] = res.loc[i]["Resistance"]
    dp_df.at[locidx, "Var_Qual"] = res.loc[i]["Var_Qual"]
dp_df.to_csv(str(snakemake.output.depthout), mode="a", index=False, header=True)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import random

perc = 1 - (int(snakemake.params.red) / 100)
if perc == 1:
    with open(str(snakemake.input), "r") as insam, open(
        str(snakemake.output), "w"
    ) as outsam:
        lines = insam.readlines()
        for line in lines:
            outsam.write(line)
else:
    samdict = {}
    with open(str(snakemake.input), "r") as insam, open(
        str(snakemake.output), "w"
    ) as outsam:
        lines = insam.readlines()
        num_lines = len(lines)
        keep = int(num_lines * perc)
        linidx = random.sample(range(6, num_lines - 1), keep)
        linidx.sort()
        for number, line in enumerate(lines):
            samdict[number] = line
            if number < 6:
                outsam.write(line)
        for ran in linidx:
            outsam.write(samdict[ran])
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd

path_list = list(snakemake.input[0:])
genenames = list(snakemake.params)
outfile = str(snakemake.output)

with open(str(snakemake.output), "w+") as outcsv:
    if len(outcsv.readlines()) == 0:
        outcsv.write("Gene,Position,Depth,Mutation,Antibiotic\n")
    for path in path_list:
        locus_df = pd.read_csv(path, header=0, sep="\t")
        for name in genenames[0]:
            if name in path:
                locus_df.rename(
                    columns={
                        locus_df.columns[0]: "Gene",
                        locus_df.columns[1]: "Position",
                        locus_df.columns[2]: "Depth",
                    },
                    inplace=True,
                )
                locus_df = locus_df.replace("AL123456.3", name, regex=True)
                locus_df["Mutation"] = "-"
                locus_df["Antibiotic"] = "-"
                break
        locus_df.to_csv(outcsv, mode="a", index=False, header=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd

with open(str(snakemake.output), "w+") as outcsv:

    sum_df = pd.DataFrame(
        columns=["Sample", "locus", "mutation", "resistance", "quality"]
    )
    for i in list(snakemake.input):
        temp_df = pd.read_csv(i, header=0, sep=",")
        temp_df = temp_df.loc[temp_df["Mut_status"] == "res_mut"]
        temp_df.insert(0, "Sample", i.split("/")[3])
        temp_df = temp_df.drop(columns=["Mut_status"]).rename(
            columns={"Mut_gene": "locus"}
        )
        temp_df.insert(
            2,
            "mutation",
            temp_df["Ref_Codon"]
            + temp_df["Codon_num"].astype(str)
            + temp_df["Alt_Codon"],
        )
        temp_df.insert(3, "resistance", temp_df["Resistance"])
        temp_df.insert(4, "quality", temp_df["Var_Qual"])
        temp_df.drop(temp_df.columns[range(5, 20)], axis=1, inplace=True)
        sum_df = pd.concat([sum_df, temp_df])
    sum_df["resistance"] = sum_df.resistance.apply(lambda x: x[0:].split(";"))
    sum_df = sum_df.explode("resistance")
    sum_df.to_csv(outcsv, mode="a", index=False, header=True)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import sys
import vcf
import codpos_genidx as cg
import retrieve_codon as rc
import retrieve_aac as ra

# The Script is meant to be called from the snakemake-rule "create_variant_profile"
# Input are vcf files which are created for every sample
# The Script parses the vcf file and iterates over all variants called
# Two external scripts are called with the variant position
# 1. genomeidx_to_gene gives Codon-infos and gene name
# 2. rescinfo_to_codon gives the whole reference Codon in reading frame
# These information together with various information from the vcf file are gathered
# over all vcf files per sample and safed in a csv-file

with open(str(snakemake.output), "w") as outcsv, open(
    str(snakemake.output).replace(".csv", "_complex.csv"), "w"
) as outcsv2:
    outcsv.writelines(
        "Gene_name,Genome_pos,Gene_pos,Codon_num,Codon_pos,Ref_Codon,Ref_aas,Alt_Codon,Alt_aas,Var_type,Read_depth,Alt_num,Var_Qual\n".replace(
            ",", "\t"
        )
    )
    outcsv2.writelines(
        "Gene_name,Genome_pos,Gene_pos,Codon_num,Codon_pos,Ref_Codon,Ref_aas,Alt_Codon,Alt_aas,Var_type,Read_depth,Alt_num,Var_Qual\n".replace(
            ",", "\t"
        )
    )
    for i in snakemake.input:
        v = vcf.Reader(filename=str(i))
        for recs in v:
            line = []
            res = ""
            cod = cg.genomeidx_to_gene(recs.POS)
            retcod = rc.refcodinfo_to_codon(recs.POS, cod[0][2], cod[1])
            retaac = ra.codon_to_as(retcod)
            varcod = rc.varcodinfo_to_codon(retcod, cod[0][2], recs.ALT, cod[1])
            varaac = ra.codon_to_as(varcod)
            line.extend(
                [
                    cod[1],
                    recs.POS,
                    cod[0][1],
                    cod[0][0],
                    cod[0][2],
                    retcod,
                    retaac,
                    varcod,
                    varaac,
                    recs.INFO["TYPE"],
                ]
            )
            for rec in recs:
                line.extend([rec["DP"], rec["AO"]])
            line.append(int(recs.QUAL))
            for elem in line:
                res += str(elem) + "\t"
            if len(recs.INFO["TYPE"]) == 1 and recs.INFO["TYPE"][0] == "snp":
                outcsv.write(res[:-1] + "\n")
            else:
                outcsv2.write(res[:-1] + "\n")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os.path import splitext

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0])

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " --cores {snakemake.threads}"
    " {adapters}"
    " {extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts


samtools_opts = get_samtools_opts(snakemake)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


with tempfile.TemporaryDirectory() as tmpdir:
    tmp_prefix = Path(tmpdir) / "samtools_fastq.sort_"

    shell(
        "samtools sort {samtools_opts} {extra} -T {tmp_prefix} {snakemake.input[0]} {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}"
)
ShowHide 37 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/adriandrr/mdr_mtb-master
Name: mdr_mtb-master
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...