Haplotype Phasing for large-scale genotype datasets

public public 1yr ago 0 bookmarks

This pipeline conducts internal population phasing for assorted datasets using pre-existing software.

Authors

  • Arjun Biddanda (@aabiddanda)

This was largely built while @aabiddanda was employed by 54Gene, but has been

Code Snippets

12
13
shell:
    "tabix -f {input}"
24
25
shell:
    "bcftools index -f {input.bcf}"
39
40
shell:
    "for i in {input.vcfs}; do tabix -l $i; done > {output}"
52
53
shell:
    "bcftools query -l {input.vcf} > {output}"
86
87
shell:
    "bcftools view -r {wildcards.chrom} --threads {threads} {input} -Oz -o {output}"
105
106
shell:
    "bcftools concat -a -D -Ou {input.vcfs} | bcftools sort -Oz -o {output}"
119
120
shell:
    "bcftools view -r {wildcards.chrom} {input.unphased_vcf} --threads {threads} -Ob -o {output.bcf}"
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
run:
    if analysis_configs[wildcards.outfix]["reference_panel"] == "":
        shell("touch {output}")
    else:
        ref_panel_manifest = pd.read_csv(
            analysis_configs[wildcards.outfix]["reference_panel"], sep="\t"
        )
        for c in ref_panel_manifest.chroms:
            assert c in CHROM
        assert (
            ref_panel_manifest.chroms.size
            == np.unique(ref_panel_manifest.chroms.values).size
        )
        index_files = [
            pc.determine_index(r) for r in ref_panel_manifest.ref_panel.values
        ]
        ref_panel_manifest["file_index"] = index_files
        ref_panel_manifest.to_csv(str(output), sep="\t", index=False)
SnakeMake From line 127 of rules/common.smk
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
run:
    if analysis_configs[wildcards.outfix]["recombination_maps"] == "":
        raise ValueError("Cannot convert a recombination map if none are provided!")
    else:
        recomb_map_manifest = pd.read_csv(
            analysis_configs[wildcards.outfix]["recombination_maps"],
            sep="\t",
            dtype=str,
        )
        for c in recomb_map_manifest.chroms:
            assert c in CHROM
        assert (
            recomb_map_manifest.chroms.size
            == np.unique(recomb_map_manifest.chroms.values).size
        )
        assert wildcards.chrom in recomb_map_manifest.chroms.values
        filename = recomb_map_manifest[
            recomb_map_manifest.chroms == wildcards.chrom
        ].recombination_map.values[0]
        transformed_df = pc.convert_hapmap_genmap(filename, wildcards.algo)
        transformed_df.to_csv(str(output), index=False, sep="\t")
SnakeMake From line 172 of rules/common.smk
13
14
15
16
shell:
    """
    cp ${{CONDA_PREFIX}}/share/eagle/tables/genetic_map_{params.hg_notation}_withX.txt.gz {output}
    """
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
shell:
    """
    eagle --chrom={wildcards.chrom}\
        --Kpbwt={params.kpbwt}\
        --pbwtIters={params.pbwt_iters}\
        --histFactor={params.hist_factor}\
        --genoErrProb={params.geno_err_prob}\
        --expectIBDcM={params.expect_ibd}\
        --numThreads={threads}\
        {params.imp_missing}\
        {params.vcf_target}\
        {params.ref_panel}\
        --geneticMapFile={input.genetic_map}\
        --outPrefix={params.outprefix} 2>&1 | tee {log}
    """
14
15
16
17
18
19
20
21
22
23
run:
    sample_ids = [x.rstrip() for x in open(input.sample_list).readlines()]
    fam_file_validator = FamFileValidator(params.fam)
    res = fam_file_validator.validate_fam(sample_ids)
    if res:
        fam_file_validator.fam_df.to_csv(
            output[0], sep=" ", index=False, header=False
        )  # noqa
    else:
        shell("touch {output}")
49
50
shell:
    "switchError --gen {input.gen} --hap {input.hap} --reg {wildcards.chrom} --fam {input.fam_file} --maf {params.maf} --out {params.outprefix} 2>&1 | tee {log}"
25
26
27
28
29
30
31
32
shell:
    """
    mv {input} resources/shapeit4.tar.gz
    cd resources/
    tar -zxvf shapeit4.tar.gz shapeit4-4.2.2/maps/
    cd shapeit4-4.2.2/maps/
    tar -xvf genetic_maps.{params.build}.tar.gz
    """
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
shell:
    """
    shapeit4 --input {input.unphased_bcf}\
        --map {input.genetic_map}\
        --seed {params.seed}\
        --region {wildcards.chrom}\
        --mcmc-iterations {params.mcmc_iterations}\
        {params.sequencing}\
        {params.ref_panel}\
        --thread {threads}\
        --output {output.phased_vcf} 2>&1 | tee {log}
    """
ShowHide 11 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/aabiddanda/haplotype-phasing
Name: haplotype-phasing
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...