Anlysis of pharmacogenomic targets from Genomic Medicine Swedens heamatology panel.

public public 1yr ago 0 bookmarks

This workflow is designed to generate sample-specific report with clinical guidelines dependent on variants detected in a Genomic Medicine Sweden sequencing panel.

It is designed to be coupled to an existing pipeline and need to be given the location of analysis ready bam files, path to which is specified in data/example_config .

Setup

Set paths in data/example_congif.yaml for reference_fasta and dbsnp to suitable files, as well as bam_location to input-pattern to bamfiles. specify the samples and sequencerun desired.

If using singularity with snakemake please run script envs/get_containers.sh

This was tested using:

  • Snakemake 5.6.0

  • Hg19 reference genome

  • dbsnp build 138

Code Snippets

13
14
15
16
17
18
19
20
21
shell:
    """
    gatk VariantAnnotator \
        -R {params.ref} \
        -V {input.vcf} \
        -I {input.bam} \
        -O {output.vcf} \
        --dbsnp {params.dbsnp}
    """
11
12
13
14
15
16
17
18
shell:
    """
    python3 {params.script_location}/src/Summary/reform_genomic_region.py \
        --target_bed={params.target_bed} \
        --output_file={output.interval} \
        --padding={params.padding} \
        --format='bed'
    """
33
34
35
36
37
shell:
    """
    samtools view -b {input.bam} -L {input.region_list} > {output.bam}
    samtools index {output.bam}
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from pysam import VariantFile
import argparse
import sys


def filter_variants(vcf, read_ratio, depth, output):
    """
    Soft filter all variants with suspicious read ratio and insufficient read-depth
    """
    vcf_in = VariantFile(vcf)
    new_header = vcf_in.header
    new_header.filters.add(f"AR{read_ratio}", None, None,
                           f"Ratio of ref/alt reads lower than {read_ratio}")
    new_header.filters.add(f"DP{depth}", None, None, f"DP is lower than {depth}x")
    vcf_out = VariantFile(output, "w", header=new_header)

    for record in vcf_in.fetch():
        ad = record.samples[0]["AD"]
        #  No multiallelic split

        if record.info["DP"] < depth:
            record.filter.add("DP100")

        elif len(ad) == 2:
            n_ref, n_alt = ad
            if n_alt / (n_ref + n_alt) < read_ratio:
                record.filter.add(f"AR{read_ratio}")
            else:
                record.filter.add("PASS")
        vcf_out.write(record)


def main():
    parser = argparse.ArgumentParser(
        description="Filter variants on depth and read ratio"
    )
    parser.add_argument("--input_vcf", type=str)
    parser.add_argument("--read_ratio", type=float)
    parser.add_argument("--depth", type=int)
    parser.add_argument("--output_file", type=str)

    args = parser.parse_args(sys.argv[1:])

    input_vcf = args.input_vcf
    read_ratio = args.read_ratio
    depth = args.depth
    output_file = args.output_file

    filter_variants(input_vcf, read_ratio, depth, output_file)


if __name__ == '__main__':
    main()
13
14
15
16
17
18
19
20
shell:
    """
    python3 {params.script_location}/src/Filtering/variant_filtration.py \
        --input_vcf={input.vcf} \
        --read_ratio={params.read_ratio} \
        --depth={params.DP} \
        --output_file={output.filtered_vcf} 
    """
17
18
19
20
21
22
23
24
25
26
shell:
    """
    python3 {params.script_location}/src/Summary/get_possible_diplotypes.py \
        --variant_csv {input.found_variants} \
        --haplotype_definitions {params.haplotype_definitions} \
        --clinical_guidelines {params.clinical_guidelines} \
        --haplotype_activity {params.haplotype_activity} \
        --output {output.csv} \
        --hidden_haplotypes {params.hidden_haplotypes}
    """
39
40
41
42
43
44
45
shell:
    """
    python3 {params.script_location}/src/Summary/get_interaction_guidelines.py \
        --diploids {input.diploids} \
        --interaction_guidelines {params.interacting_targets} \
        --output {output.csv}
    """
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
shell:
    """
    wkdir=$(pwd)  # Needed since Rscript will set wd to location of file not session
    intdir=$(echo {output.html} | head -c -6)
    Rscript \
        -e ".libPaths('/lib/rlib'); library(rmdformats); rmarkdown::render('{params.script_location}/src/Report/generate_sample_report.Rmd', output_file='$wkdir/{output.html}', output_format=c('readthedown'), intermediates_dir='$wkdir/$intdir')" \
        --args --title='Farmakogenomisk analys av {wildcards.sample}' --author=joel \
        --found_variants=$wkdir/{input.found_variants} \
        --missed_variants=$wkdir/{input.missed_variants}  \
        --haplotype_definitions={params.haplotype_definitions} \
        --clinical_guidelines=$wkdir/{input.diploids} \
        --interaction_guidelines=$wkdir/{input.interactions} \
        --data_location={params.script_location}/data \
        --depth_file=$wkdir/{input.depth_at_baits} \
        --sample={wildcards.sample} \
        --seqid={wildcards.seqID} \
        --dbsnp=$(basename {params.dbsnp}) \
        --ref=$(basename {params.ref}) \
        --name="{params.name}" \
        --adress="{params.adress}" \
        --mail="{params.mail}" \
        --phone="{params.phone}"

        rmdir $wkdir/$intdir
    """
13
14
15
16
17
18
19
shell:
     """
     python3 {params.script_location}/src/Summary/append_rsid_to_gdf.py \
        --input_gdf={input.gdf} \
        --target_bed={params.target_bed} \
        --output_file={output.gdf}
     """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from get_target_variants import GDF
import argparse
import sys

# As script to run "shell:" in snakemake rule to allow singularity run

def main():
    parser = argparse.ArgumentParser(
        description="Rewrite bed to chr:start-end list. Removing wt targets or adding padding"
    )
    parser.add_argument("--input_gdf", type=str)
    parser.add_argument("--target_bed", type=str)
    parser.add_argument("--output_file", type=str)

    args = parser.parse_args(sys.argv[1:])

    input_gdf = args.input_gdf
    target_bed = args.target_bed
    output_file = args.output_file

    c_gdf = GDF(input_gdf)
    c_gdf.rsid_per_position(target_bed)
    c_gdf.write_proccessed_gdf(output_file)


if __name__ == '__main__':
    main()
12
13
14
15
16
17
18
shell:
    """
    python3 {params.script_location}/src/Summary/reform_genomic_region.py \
        --target_bed={params.target_bed} \
        --output_file={output.interval} \
        --detected_variants={input.detected_variants}
    """
34
35
36
37
shell:
    """
    java -jar /usr/GenomeAnalysisTK.jar -T DepthOfCoverage -R {params.ref} -I {input.bam} -o {output.gdf} -L {input.interval}
    """
49
50
51
52
53
54
55
shell:
    """
    python3 {params.script_location}/src/Summary/reform_genomic_region.py \
        --target_bed={params.target_bed} \
        --output_file={output.interval} \
        --padding={params.padding}
    """
71
72
73
74
75
shell:
    """
    # NOTE: does not work with openjdk-11, openjdk-8 works
    java -jar /usr/GenomeAnalysisTK.jar -T DepthOfCoverage -R {params.ref} -I {input.bam} -o {output.gdf} -L {input.interval}
    """
14
15
16
17
18
19
20
shell:
    """
    python3 {params.script_location}/src/Summary/get_target_variants.py \
        --target_bed {params.target_bed} \
        --vcf {input.vcf} \
        --output {output.csv} 
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import pandas as pd
import argparse
import sys
import numpy as np
from itertools import product


class GetInteractions:
    def __init__(self, variants_file, guideline_file):
        self.variants_df = pd.read_csv(variants_file, sep="\t")
        self.variants_df = self.variants_df.sort_values(by=["gene"])
        self.interaction_guidelines_df = pd.read_csv(guideline_file, sep="\t")

    def get_possible_interactions(self):
        # TODO: refactor inefficient and ugly
        interacting_genes = set(self.interaction_guidelines_df[["gene1", "gene2"]].values.flat)
        if self.variants_df.empty:
            return self.interaction_guidelines_df[np.zeros(len(self.interaction_guidelines_df), dtype=bool)]
        self.variants_df = self.variants_df[self.variants_df.gene.isin(interacting_genes)]

        if self.variants_df.empty:
            return self.interaction_guidelines_df[np.zeros(len(self.interaction_guidelines_df), dtype=bool)]

        index_combinations = product(
            range(len(self.variants_df)),
            range(len(self.variants_df))
        )

        prev_idx = np.zeros(len(self.interaction_guidelines_df), dtype=bool)
        haplotypes = [""] * len(self.interaction_guidelines_df)
        for i, j in index_combinations:
            if not i == j:
                gene1 = self.variants_df.iloc[[i]].gene.values[0]
                gene2 = self.variants_df.iloc[[j]].gene.values[0]

                activity1 = int(self.variants_df.iloc[[i]].Genotype_activity)
                activity2 = int(self.variants_df.iloc[[j]].Genotype_activity)

                all_haplotypes = ",".join([
                    self.variants_df.iloc[[i]]["Haplotype1"].values.flat[0],
                    self.variants_df.iloc[[i]]["Haplotype2"].values.flat[0],
                    self.variants_df.iloc[[j]]["Haplotype1"].values.flat[0],
                    self.variants_df.iloc[[j]]["Haplotype2"].values.flat[0]
                ])

                idx = (self.interaction_guidelines_df.gene1 == gene1) & \
                      (self.interaction_guidelines_df.gene2 == gene2) & \
                      (self.interaction_guidelines_df.activity_1 == activity1) & \
                      (self.interaction_guidelines_df.activity_2 == activity2)

                if any(idx):
                    prev_idx += idx
                    for idx_i in idx:
                        if idx_i:
                            haplotypes[idx_i] = all_haplotypes

        haplotypes = [h for h in haplotypes if h != ""]
        interactions = self.interaction_guidelines_df[prev_idx]
        interactions["haplotypes"] = haplotypes
        return interactions

    def run_and_write(self, output):
        interactions = self.get_possible_interactions()
        interactions.to_csv(output, index=None, sep="\t")


def main():
    parser = argparse.ArgumentParser(
        description="Get guidelines based on interacting variants in different genes"
    )
    parser.add_argument("--diploids", type=str)
    parser.add_argument("--interaction_guidelines", type=str)
    parser.add_argument("--output", type=str, help="Location of output")
    args = parser.parse_args(sys.argv[1:])
    get_interactions = GetInteractions(args.diploids, args.interaction_guidelines)
    get_interactions.run_and_write(args.output)


if __name__ == '__main__':
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
import pandas as pd
import re
import argparse
import sys
import numpy as np


class ArrangeHaplotype:
    """
    Get possible haplotypes from variant combinations detected in file.
    Add clinical guidelines based on the haplotypes detected
    """

    def __init__(self,
                 detected_variants,
                 haplotype_definitions,
                 activity_scores,
                 clinical_guidelines):

        self.detected_variants = pd.read_csv(detected_variants, sep="\t")
        self.haplotype_definitions = pd.read_csv(haplotype_definitions, sep="\t")
        self.activity_scores = pd.read_csv(activity_scores, sep="\t")
        self.clinical_guidelines = pd.read_csv(clinical_guidelines, sep="\t")
        if not self.detected_variants.empty:
            self.merge_data()

    def merge_data(self):
        """
        Join possible haplotypes containing ids
        :return:
        """

        self.detected_variants["multival_haplotype"] = \
            self.detected_variants.ID.apply(
                lambda x: list(self.haplotype_definitions["HAPLOTYPE"][
                                   self.haplotype_definitions.ID == x
                                   ])
            )
        self.detected_variants["CN"] = self.detected_variants["GT"].apply(
            lambda x: sum(map(int, re.split('[/|]+', x)))
        )

    def get_haplotypes(self):
        """
        Return tree-structure of possible combinations of haplotypes explaning seen variants.
        Assumptions: Haplotype explaning most variants goes first, if any of variants in haplotype
         is zero the all haplotypes containing that variant is removed for futher chocies.
        :return: Gene - haplotype-tree dict
        """

        def _get_haplotypes(variant_subdf, current_haplotype, depth=2):
            idx = variant_subdf["multival_haplotype"].apply(lambda x: current_haplotype in x)
            variant_subdf.loc[idx, "CN"] -= 1

            if any(variant_subdf["CN"] == 0):
                remove_hap = lambda x, y: x.remove(y) if y in x else x
                variant_subdf["multival_haplotype"] = variant_subdf["multival_haplotype"].apply(
                    lambda x: remove_hap(x, current_haplotype)
                )

            variant_subdf = variant_subdf[variant_subdf["CN"] != 0]

            if depth == 1:
                if len(variant_subdf) == 0:
                    return [current_haplotype, True]
                else:
                    return [current_haplotype, False]

            if len(variant_subdf) == 0 or not any(variant_subdf["multival_haplotype"].apply(lambda x: bool(x))):
                wt_haplotype = "WT"
                return [current_haplotype, [_get_haplotypes(variant_subdf.copy(), wt_haplotype, depth - 1)]]

            remaining_haplo = set([element for halpolist in variant_subdf["multival_haplotype"]
                                   for element in halpolist])
            return [current_haplotype, [
                _get_haplotypes(variant_subdf.copy(), hap, depth - 1) for hap in remaining_haplo
            ]]

        genes = set(self.detected_variants["GENE"])
        full_mat = {}
        for gene in genes:
            gene_subset = self.detected_variants[self.detected_variants.GENE == gene]
            print(gene_subset)
            candidate_haplotypes = np.array(list(set(
                [element for halpolist in gene_subset["multival_haplotype"] for element in halpolist]
            )))

            order = list(reversed(np.argsort(
                [sum(gene_subset["multival_haplotype"].apply(lambda y: x in y)) for x in candidate_haplotypes]
            )))
            candidate_haplotypes = candidate_haplotypes[order]

            gene_haps = []
            for current_haplotype in candidate_haplotypes:
                gene_haps.append(_get_haplotypes(gene_subset.copy(), current_haplotype))
                idx = gene_subset["multival_haplotype"].apply(lambda x: current_haplotype in x)
                cn = gene_subset.loc[idx, "CN"]
                if any([(c - 1) == 0 for c in cn]):
                    remove_hap = lambda x, y: x.remove(y) if y in x else x
                    gene_subset["multival_haplotype"] = gene_subset["multival_haplotype"].apply(
                        lambda x: remove_hap(x, current_haplotype)
                    )

            full_mat.update({gene: gene_haps})

        return full_mat

    def get_haplotype_dataframe(self):  # Wow what a mess
        hap_mat = self.get_haplotypes()

        def _haplot_to_row(hap, gene):
            def prim_haplot_to_row(hap, gene):
                current_hap = (f"{gene}-1" if hap[0] == "WT" else hap[0])
                if not type(hap[1]) is list:
                    return [current_hap, gene, hap[1]]
                else:
                    next_hap = [prim_haplot_to_row(c_hap, gene) for c_hap in hap[1]]
                    return [[current_hap] + c_hap for c_hap in next_hap][0]

            return [prim_haplot_to_row(c_hap, gene) for c_hap in hap]

        out = []
        for gene, hap in hap_mat.items():
            if len(hap) != 0:
                out += _haplot_to_row(hap, gene)

        hap_df = pd.DataFrame(out, columns=["Haplotype1", "Haplotype2", "gene", "pass"])
        hap_df = hap_df[hap_df["pass"]]

        return hap_df[["gene", "Haplotype1", "Haplotype2"]]

    def get_clinical_guidelines_table(self):
        if self.detected_variants.empty:
            columns = [
                "gene", "Haplotype1", "Haplotype2",
                "HAPLOTYPE1", "ACTIVITY_SCORE1", "HAPLOTYPE2",
                "ACTIVITY_SCORE2", "Genotype_activity",
                "Gene", "Activity", "Guideline"
            ]
            return pd.DataFrame(columns=columns)
        hap_df = self.get_haplotype_dataframe()
        hap_df = hap_df.merge(self.activity_scores, how="left",
                              left_on="Haplotype1", right_on="HAPLOTYPE")
        hap_df = hap_df.merge(self.activity_scores, how="left",
                              left_on="Haplotype2", right_on="HAPLOTYPE", suffixes=("1", "2"))
        hap_df["Genotype_activity"] = hap_df["ACTIVITY_SCORE1"] + hap_df["ACTIVITY_SCORE2"]

        hap_df = hap_df.merge(self.clinical_guidelines, how="left",
                              left_on=["gene", "Genotype_activity"], right_on=["Gene", "Activity"])

        return hap_df

    def get_wildtypes(self, hap_df):
        hap_genes = list(hap_df.gene.values)
        for gene in set(self.clinical_guidelines.Gene):
            if hap_df.empty or gene not in hap_genes:
                gene_df = pd.DataFrame(
                    {"gene": [gene], "Haplotype1": [gene + "-1"], "Haplotype2": [gene + "-1"],
                     "HAPLOTYPE1": [gene + "-1"], "ACTIVITY_SCORE1": [1], "HAPLOTYPE2": [gene + "-1"],
                     "ACTIVITY_SCORE2": [1], "Genotype_activity": [2.0]})
                gene_df = gene_df.merge(self.clinical_guidelines, how="left",
                                        left_on=["gene", "Genotype_activity"], right_on=["Gene", "Activity"])
                hap_df = hap_df.append(gene_df, ignore_index=True)
        return hap_df


def main():
    parser = argparse.ArgumentParser(
        description="Finds selected RSIDs form bed file in input VCF"
    )
    parser.add_argument("--variant_csv", type=str)
    parser.add_argument("--haplotype_definitions", type=str)
    parser.add_argument("--clinical_guidelines", type=str)
    parser.add_argument("--haplotype_activity", type=str)
    parser.add_argument("--hidden_haplotypes", type=str)

    parser.add_argument("--output", type=str, help="Location of output")

    args = parser.parse_args(sys.argv[1:])
    variant_csv = args.variant_csv
    haplotype_definitions = args.haplotype_definitions
    clinical_guidelines = args.clinical_guidelines
    haplotype_activity = args.haplotype_activity
    output = args.output
    hidden_haplotypes = args.hidden_haplotypes

    ah = ArrangeHaplotype(
        variant_csv,
        haplotype_definitions,
        haplotype_activity,
        clinical_guidelines
    )

    df = ah.get_clinical_guidelines_table()
    df = ah.get_wildtypes(df)
    columns = [
        "gene", "Haplotype1", "Haplotype2",
        "HAPLOTYPE1", "ACTIVITY_SCORE1", "HAPLOTYPE2",
        "ACTIVITY_SCORE2", "Genotype_activity",
        "Gene", "Activity", "Guideline"
    ]
    if not df.empty:
        hidden_haplotypes = pd.read_csv(
            hidden_haplotypes, sep="\t"
        )
        hidden_haplotypes["comb"] = hidden_haplotypes[["Haplotype1", "Haplotype2"]].apply(
            lambda x: "".join(sorted(x.tolist())), axis=1
        )
        df["comb"] = df[["Haplotype1", "Haplotype2"]].apply(
            lambda x: "".join(sorted(x.tolist())), axis=1
        )
        print(df["comb"])
        print(hidden_haplotypes["comb"])
        df = df[~df["comb"].isin(hidden_haplotypes["comb"])]
    df.to_csv(output, sep="\t", index=False, columns=columns)


if __name__ == '__main__':
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import pandas as pd
import gzip
import re
import os
import argparse
import sys


class GDF:
    """
    We are assuming single sample GDF
    """

    def __init__(self, filename):
        self.data = pd.read_csv(filename, sep="\t")
        self.data_cols = self.data.columns.values

        pos_df = pd.DataFrame(
            self.data.Locus.apply(lambda x: x.split(":")).to_list(),
            columns=["CHROM", "POS"]
        )
        pos_df.POS = pos_df.POS.astype('int64')
        self.data = pd.concat([self.data, pos_df], axis=1)

    def rsid_per_position(self, target_bed):
        def _annotate(x, targets):
            try:
                ids = targets[(targets.CHROM == x.CHROM) &
                              (targets.START <= x.POS) &
                              (x.POS <= targets.END)].ID.to_list()
                return ", ".join(ids)
            except IndexError:
                return "-"

        targets = pd.read_csv(
            target_bed, sep="\t",
            names=["CHROM", "START", "END", "ID", "GENE"]
        )
        targets["save"] = targets.START
        idx_swap = targets.START > targets.END
        targets.loc[idx_swap, "START"] = targets.loc[idx_swap, "END"]
        targets.loc[idx_swap, "END"] = targets.loc[idx_swap, "save"]
        targets["CHROM"] = targets.CHROM.apply(lambda x: f"chr{x}")
        self.data["ID"] = self.data.apply(lambda x: _annotate(x, targets), axis=1)

    def write_proccessed_gdf(self, filename, annotate=True):
        if annotate:
            self.data.to_csv(filename, sep="\t", index=False)
        else:
            self.data.to_csv(filename, sep="\t", columns=self.data_cols, index=False)


class VCF:
    """
     We are assuming single sample VCF
     """
    def __init__(self, filename):
        self.meta = []
        self.data = pd.DataFrame()
        self.original_header = []
        self.read_vcf(filename)

    def read_vcf(self, filename):
        if ".gz" in filename:
            f = gzip.open(filename, "rt")
        else:
            f = open(filename, "r")

        lines = f.readlines()
        lines = [l.strip() for l in lines]
        i = None
        for i, line in enumerate(lines):
            if re.search("^#CHROM", line):
                break

        if i is None:
            raise ImportError("No lines in: " + filename)

        self.meta = lines[:i - 1]
        data = [l.split("\t") for l in lines[i:]]
        self.data = pd.DataFrame(data[1:], columns=data[0])
        self.original_header = self.data.columns.values
        if not self.data.empty:
            sample_column = self.data.columns.values[-1]
            max_len_idx = self.data.FORMAT.str.len().idxmax
            format_columns = self.data.FORMAT[max_len_idx].split(":")
            format_split = pd.DataFrame(self.data[sample_column].apply(lambda x: x.split(":")).to_list(),
                                        columns=format_columns)
            self.data = pd.concat([self.data, format_split], axis=1)

    def filter_snp(self, filter_file, exclude=True, column="SNP"):
        filter_snps = pd.read_csv(filter_file, sep="\t")[column]
        if exclude:
            self.data = self.data[~self.data.ID.isin(filter_snps)]
        else:
            self.data = self.data[self.data.ID.isin(filter_snps)]

    def write_vcf(self, filename_out):
        with open(filename_out, "w+") as f:
            for line in self.meta:
                f.write(line + "\n")

            self.data.to_csv(f, mode="a", sep="\t", columns=self.original_header, index=False)


class VariantQCCollection:
    def __init__(self, target_bed, vcf_filename):
        self.bed_targets = pd.read_csv(target_bed, names=None, sep="\t")
        self.bed_targets.columns = ["#CHROM", "START", "END", "ID", "GENE"]
        self.sample = os.path.basename(vcf_filename)
        self.vcf = VCF(vcf_filename)
        self.detected_variants = []

    def detect_variants(self):
        """
        Select all variants within VCF with ID same as in target_bed
        """
        self.detected_variants = self.vcf.data.ID[
            (self.vcf.data.ID.isin(self.bed_targets.ID)) &
            (self.vcf.data.FILTER == "PASS")
        ].tolist()

    def write_detected_variant_qc(self, output_file):
        """
        Write detected variants to csv
        """
        if not self.detected_variants:
            self.detect_variants()

        current_variants = self.vcf.data[self.vcf.data.ID.isin(self.detected_variants)]
        current_variants = current_variants.merge(
            self.bed_targets[["ID", "GENE"]],
            on="ID"
        )

        current_variants.to_csv(output_file, index=False, sep="\t")


def main():
    parser = argparse.ArgumentParser(
        description="Finds selected RSIDs form bed file in input VCF"
    )
    parser.add_argument("--target_bed", type=str, help="Bed-file containing RSIDs of interest")
    parser.add_argument("--vcf", type=str)
    parser.add_argument("--output", type=str, help="Location of output, NOTE: will overwrite")

    args = parser.parse_args(sys.argv[1:])
    vcf_f = args.vcf
    bed_f = args.target_bed
    output_f = args.output

    var_collect = VariantQCCollection(bed_f, vcf_f)
    var_collect.write_detected_variant_qc(output_f)


if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import pandas as pd
import argparse
import sys


def reform(target_bed, output_f, detected_variants, padding, file_format):
    targets = pd.read_csv(target_bed, sep="\t",
                          names=["CHROM", "START", "END", "ID", "GENE"],
                          dtype={"START": int, "END": int})
    if detected_variants != "":
        detected_rsid = pd.read_csv(detected_variants, sep="\t").ID
        targets = targets[~targets.ID.isin(detected_rsid)]

    bed = file_format == "bed"
    with open(output_f, "w+") as f:
        for i, row in targets.iterrows():
            chrom, start, end, id = row[0:4]
            if start > end:
                end, start = start, end
            start -= padding
            end += padding
            if bed:
                f.write(f"chr{chrom}\t{start}\t{end}\t{id}\n")
            else:
                f.write(f"chr{chrom}:{start}-{end}\n")


def main():
    parser = argparse.ArgumentParser(
        description="Rewrite bed to chr:start-end list. Removing wt targets or adding padding"
    )
    parser.add_argument("--target_bed", type=str)
    parser.add_argument("--output_file", type=str)
    parser.add_argument("--detected_variants", type=str, default="")
    parser.add_argument("--padding", type=int, default=0)
    parser.add_argument("--format", type=str, default="list")

    args = parser.parse_args(sys.argv[1:])
    target_bed = args.target_bed
    output_file = args.output_file
    detected_variants = args.detected_variants
    padding = args.padding
    file_format = args.format
    reform(target_bed, output_file, detected_variants, padding, file_format)


if __name__ == '__main__':
    main()
13
14
15
16
17
18
19
20
shell:
     """
     gatk HaplotypeCaller \
        -R {params.ref} \
        -I {input.bam} \
        --dbsnp {params.dbsnp} \
        -O {output.vcf}
     """
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/JoelAAs/pgx_module
Name: pgx_module
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...