A Snakemake workflow for the MSCi framework in BP&P

public public 1yr ago 0 bookmarks

A Snakemake workflow for the MSCi framework in BP&P

Description

Please note that this pipeline is not maintained, and that there is at least one known bug. Please feel free to steal the code to set up your own pipeline, but it wo

Code Snippets

1
awk '$3="CDS"' $1 | awk '{{print $1, $4, $5}}' | awk '!visited[$0]++' | sed '/^#/d' | sed 's/ /\t/g'
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import pandas as pd
import fileinput
import random
import argparse

def generate_control_file(ctl_template, imap, n_loci, tree, theta_beta, tau_beta, mcmc_samples, seqfile, seqtype, rep):

    imap_df = pd.read_csv(imap, sep = " ")
    populations = list(imap_df.iloc[:,1].unique())

    n_samples_per_species = str(round(int(n_loci)/len(populations)))
    list_of_n_samples = [n_samples_per_species] * len(populations)
    n_samples = " ".join(list_of_n_samples)

    replacements = {
        "RUNNAME":f"{seqtype}_{rep}",
        "IMAP": imap,
        "SEED":random.randint(1,10000),
        "SEQFILE":seqfile,
        "SPECIES_LINE":str(len(populations)) + " " + ' '.join(populations),
        "N_SAMPLES_PER_SPECIES": n_samples,
        "TREE":tree,
        "N_LOCI":n_loci,
        "THETA_BETA":theta_beta,
        "TAU_BETA":tau_beta,
        "MCMC_SAMPLES":mcmc_samples,
        "BURNIN":str(round(int(mcmc_samples) * 0.1))
        }

    with open (ctl_template, "r") as f:
        data = f.read()
        for key, replacement in replacements.items():
            data = data.replace(key, str(replacement))
    print(data)

parser = argparse.ArgumentParser()
parser.add_argument("--imap")
parser.add_argument("--n_loci")
parser.add_argument("--tree")
parser.add_argument("--theta_beta")
parser.add_argument("--tau_beta")
parser.add_argument("--mcmc_samples")
parser.add_argument("--seqfile")
parser.add_argument("--seqtype")
parser.add_argument("--rep")
parser.add_argument("--ctl_template")
args = parser.parse_args()


if __name__ == "__main__":
    generate_control_file(
        ctl_template = args.ctl_template, imap = args.imap, n_loci = args.n_loci,
        tree = args.tree, theta_beta = args.theta_beta,
        tau_beta = args.tau_beta, mcmc_samples = args.mcmc_samples,
        seqfile = args.seqfile, seqtype = args.seqtype, rep = args.rep)
1
bedtools makewindows -b $1 -w 1000 | awk '{if($3-$2 <= 1000 && $3-$2 >= 500) print}' | shuf | head -n 1000
10
11
shell:
        "mkdir -p {output}"
SnakeMake From line 10 of main/Snakefile
20
21
shell:
        """awk '{{print $1}}' {params.imap} > {output} """
SnakeMake From line 20 of main/Snakefile
31
32
33
34
35
36
shell:
        """
        grep ">" {input[1]} | sed 's/>//' | nl | awk '$1=$1' > temp/chr_names.txt
        bcftools annotate --rename-chrs temp/chr_names.txt {input[0]} | bgzip > {output}
        tabix {output}
        """
44
45
46
47
shell:
        """
        bcftools view --samples-file {input[0]} {input[1]} --min-alleles 2 --max-alleles 2 --force-samples -Oz > {output}
        """
56
57
58
59
60
shell:
        """
        bcftools +prune -m 0.5 -w 10000 {input[0]} -Ov | bgzip > {output} # note use of old BCFtools; -l is now -m
        tabix {output}
        """
79
80
shell:
        "bedtools complement -i {input[0]} -g {input[1]} > {output}"
89
90
91
92
93
shell:
        """
# awk '$3="CDS"' {input} | awk '{{print $1, $4, $5}}' | awk '!visited[$0]++' | sed '/^#/d' | sed '/ /\\t/g' > {output}
bash scripts/generate_coding_bed.sh {input} > {output}
"""
SnakeMake From line 89 of main/Snakefile
106
107
108
109
110
shell:
        """
bash scripts/makewindows.sh {input[0]} > {output[0]}
bash scripts/makewindows.sh {input[1]} > {output[1]}
        """
SnakeMake From line 106 of main/Snakefile
119
120
121
122
123
shell:
    """
            awk 'BEGIN{{OFS=":"}} {{print $1,$2,$3}}' {input[0]} | sed 's/:/-/2' | sed '/^#/d' > {output[0]}
            awk 'BEGIN{{OFS=":"}} {{print $1,$2,$3}}' {input[1]} | sed 's/:/-/2' | sed '/^#/d' > {output[1]}
            """
SnakeMake From line 119 of main/Snakefile
132
133
134
135
136
shell:
        """
        bcftools query -l {input} > {output}
#comm -12 <(sort temp/vcf_samples_temp.txt) <(sort {input[0]}) > temp/vcf_samples.txt
        """
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
shell:
        """

        mkdir -p temp/sequences

        REGIONS=$(cat {input[0]} {input[1]})

        for region in ${{REGIONS}}
        do
                for sample in $(cat {input[4]})
                do
                        printf '>'$(echo ${{region}} | tr -s -c [:alnum:] _)'^'${{sample}} #header
                samtools faidx {input[2]} ${{region}} | bcftools consensus -s ${{sample}} {input[3]}
                        printf '\\n'
        done | cut -f1,2 -d'>' | awk 'BEGIN {{RS = ">" ; FS = "\\n" ; ORS = ""}} $2 {{print ">"$0}}' > temp/sequences/${{region}}.txt
        done

        for region in ${{REGIONS}}
do
    mafft --leavegappyregion --retree 2 --reorder temp/sequences/${{region}}.txt > temp/sequences/${{region}}_aligned.txt
    trimal -in temp/sequences/${{region}}_aligned.txt -out temp/sequences/${{region}}.ph -phylip_paml -gt 0.2
done

        for region in $(cat {input[0]}) # coding
        do
                FILE=temp/sequences/${{region}}.ph
                if [ -f temp/sequences/${{region}}.ph ]
                then
                        cat ${{FILE}}
                        printf '\\n'
                fi
        done > {output[0]}

        for region in $(cat {input[1]}) # noncoding
do
    FILE=temp/sequences/${{region}}.ph
    if [ -f temp/sequences/${{region}}.ph ]
    then
        cat ${{FILE}}
        printf '\\n'
    fi
done > {output[1]}
        """
199
200
201
202
203
204
205
shell:
        """
        wget https://github.com/bpp/bpp/releases/download/v4.4.1/bpp-4.4.1-linux-x86_64.tar.gz
        tar zxvf bpp-4.4.1-linux-x86_64.tar.gz
        rm bpp-4.4.1-linux-x86_64.tar.gz
        mv bpp-4.4.1-linux-x86_64/bin/bpp .
        """
SnakeMake From line 199 of main/Snakefile
213
214
215
216
shell:
        """
        ./bpp --msci {input} | grep -A1 "Newick tree:" | grep -v "Newick tree:" > tree.txt
        """
SnakeMake From line 213 of main/Snakefile
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
shell:
        """
        mkdir -p control_files

        TREE=$(cat {input[4]})

        for SEQTYPE in "coding" "noncoding"
        do
                END={config[number_of_repeats]}
                for REP in $(seq 1 $END)
                do
                        python scripts/generate_control_files.py --imap {config[imap]} \
                        --n_loci {config[number_of_loci]} --tree "${{TREE}}" \
                                --theta_beta {config[theta_beta]} --tau_beta {config[tau_beta]} \
                                        --mcmc_samples {config[mcmc_samples]} --seqfile ${{SEQTYPE}}_sequences.ph \
                                                --seqtype ${{SEQTYPE}} --rep ${{REP}} --ctl_template {config[ctl_template]} \
                                                        > control_files/${{SEQTYPE}}_${{REP}}.ctl
                done
        done
        """
SnakeMake From line 229 of main/Snakefile
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
shell:
        """
        # Generate run scripts + submit to cluster
        for SEQTYPE in "coding" "noncoding"
        do
                END={config[number_of_repeats]}
                for REP in $(seq 1 $END)
                do
                        RUNFILE=run_${{SEQTYPE}}_${{REP}}.sh
                        CTL_FILE=control_files/${{SEQTYPE}}_${{REP}}.ctl
                        cp config_files/run_bpp.sh ${{RUNFILE}}
                        sed -i "s/RUNNAME/${{SEQTYPE}}_${{REP}}/g" ${{RUNFILE}}
                        sed -i "s/EMAIL/{config[email]}/g" ${{RUNFILE}}
                        sed -i "s/ACCOUNT/{config[account]}/g" ${{RUNFILE}}
                        sed -i "s|CTL_FILE|${{CTL_FILE}}|g" ${{RUNFILE}}

                        bash ${{RUNFILE}}
                done
        done

        echo 'Done' > debug.txt
        """
SnakeMake From line 256 of main/Snakefile
ShowHide 14 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/simonharnqvist/bpp-msci-workflow
Name: bpp-msci-workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...