A Snakemake pipeline for variant calling of genomic FASTQ data using GATK

public public 1yr ago 0 bookmarks

Introduction

This project is a snakemake workflow for processing .fastq.gz files and downstream analyses. The variant calling is primarily based on the GATK best practices for germline short variant discovery. Later analyses include pairwise relationship estimation and admixture.

Dependencies

conda is a package manager that can install many of the tools for this program. Once installed, use the below conda commands to obtain the required packages:

conda install mamba
mamba install -c bioconda snakemake
mamba install -c conda-forge biopython
mamba install -c anaconda pandas

Some will still have to be installed manually outside of conda. Depending on which part of the workflow is being used, the following may need to be installed as well and must be accessible through $PATH . The main one being gatk .

Other necessary libraries have packages and versions stored in workflow/envs . These will be installed automatically in their own conda environment when running this snakemake pipeline.

Configuration

Configuration settings are set with a .py file inside the directory config . By default, the file template.py is active. This can be copied and renamed to create multiple config files for convenient testing with different datasets. The name of the desired config file for a particular run, however, should be set in workflow/Snakefile on the line that says from template import __dict__ as config . For example, say we create a new config rhesus.py . This file would then be imported by changing template in that line to rhesus .

Initial required files to add to config file for first steps of variant calling are the following.

  • Species reference genome

  • Raw FASTQ files (ending with R1.fastq.gz and R2.fastq.gz )

Other subsequent steps require their own additional user-created files as well, which can all be specified in the config file.

Also, be sure to set the value for path for where the resources/ and results/ directory should go.

Setting Target Files

In addition to setting input files and variables inside the config files, target outputs (the files that we want to generate) must also be listed. These are listed inside the variable target_files . When we run, Snakemake will create the files (as well as any intermediates) based on what is listed there.

Writing the output files requires finding the rule in workflow/rules/ for what files are wanted and copying the output . For instance, if I want to make BAM files, I can find the corresponding rule align , which is in variant_calling.smk and find:

output: alignment = config["results"] + "alignments/raw/{sample}.bam",

Now, place that inside list of target files back inside our Snakefile , making sure to replace config["results"] with just results . Also replace {sample} with the actual sample name that we are interested in.

target_files = [ results + "alignments/raw/WGS12345.bam",
]

Often, many samples will be worked on at the same time. To facilitate this, we could just add more to the target list like so:

target_files = [ results + "alignments/raw/WGS12345.bam", results + "alignments/raw/WGS23456.bam",
]

Although a more efficient way is to utilize the expand function:

target_files = [ expand(results + "alignments/raw/{sample}.bam", sample=[12345, 23456]),
]

A couple convenience variables are preset, SAMPLE_NAMES and CHROMOSOMES . SAMPLE_NAMES stores all of the sample names listed in config["samples"] . And CHROMOSOMES stores all numbered chromosomes plus X, Y and MT if in the reference genome.

target_files = [ expand(results + "alignments/raw/{sample}.bam", sample=SAMPLE_NAMES),
]

As another example, to create genotyped VCFs for each chromosome, we can look at rule genotype_passing :

output: config["results"] + "genotypes/pass/{dataset}.{mode}.chr{chr}.vcf.gz",

For our config file, we could have the following below. {dataset} can essentially be named anything, just be consistent. Modifying this same is helpful if we want to have multiple parallel analyses when tweaking parameters. Currently, this workflow only works with {mode} set to SNP. This mode is for future work with indels or indels and SNPs together.

target_files = [ expand(results + "genotypes/pass/rhesus.SNP.chr{chr}.vcf.gz", chr=CHROMOSOMES),
]

Running

Running snakemake directly within the project's main directory on the command line will initiate the program by calling workflow/Snakefile . To perform a "dry-run", you can use the command snakemake -n . This will check that the rules are functional without actually processing any data. This is helpful to quickly test for any obvious problems with new rules. If an upstream file has been modified after a downstream one, Snakemake tries to redo the pipeline from the modified file. If done accidentally, this will delete downstream files, making the rules between run again. So it is good practice to perform the dry-run before each actual run to make sure important files aren't deleted. If a file was accidentally modified and downstream files don't want to be deleted, use Unix command touch -d can be used to set the timestamp to an earlier time. For example, setting file.txt to a timestamp of October 3rd could look like touch -d '3 Oct' file.txt .

To find the reason for Snakemake rerunning files, run snakemake -n -r . One of the proposed jobs should say "reason: Updated input files: _________". Those file(s) are the source of the problem.

On SGE Cluster

To actually run the program on a Sun Grid Engine cluster, the following command can be issued from the project directory:

NAME=smk_variant; LOG=log/dirname; nohup snakemake --profile profile --cluster "qsub -V -b n -cwd -pe smp {threads} -N $NAME -o $NAME.out.log -e $NAME.err.log" > $LOG/$NAME.smk.log 2>&1

Most of the Snakemake settings are stored in profile/config.yaml , which is called by the --profile profile option. The values there can be modified to adjust the maximum number of jobs and other information. The --cluster option sets the options used by the cluster. These can be left as is. However, set the names for variables NAME and LOG . NAME will be the name given to the cluster's queue and viewable with the qstat command. NAME is also used for the log files. LOG is then the directory in which the log files will be stored. This can be changed as needed to help organize logs.

Locally

Ideally, use the cluster configuration above. Otherwise, to run locally, use: nohup snakemake --use-conda -c2 . The integer in -c2 means that two cores will be used. Most rules list the number of threads used, which can be used here. But unlike the cluster method above, this will need to be manually changed. Higher integers (as long as the system has such) will lead to faster processing. This command will leave the error log inside a file in the current directory labelled nohup.out .

Log Files

In addition to generating the files given in target_files , the above cluster command will also generate log files. Check these when errors occur. There are three logs ending with the following patterns:

  • .smk.log tells what jobs have been submitted, what files each is trying to make, and when jobs have completed.

  • .err.log gives details about the actual jobs themselves. When submitting more than one job at once, this can become cluttered with interleaved messages from different jobs. To better find an error message for a specific job, it might require running only a single job at a time, which can be done by setting -j 1 .

  • .out.log prints output messages from the jobs. Once again, these will be interleaved. But in general, there is fewer information in the this file and often not as important as the error log.

Make sure to create the directories in which the log files will be placed prior to running to avoid errors.

Common Errors

Error: Directory cannot be locked - All that is required is to add --unlock to command from earlier as below. Then, once that finishes, run again without the --unlock option.

NAME=smk_variant; LOG=log/dirname; nohup snakemake --use-conda --cluster "qsub -V -b n -cwd -pe smp {threads} -N $NAME -o $NAME.out.log -e $NAME.err.log" -j 20 --unlock > $LOG/$NAME.smk.log 2>&1

MissingOutputException - Usually, can just be rerun and will work. Likely due to system latency. Otherwise, if the same file continues to cause this problem, the rule's output (if any) does not match the file(s) listed under the rule's output .

JSONDecodeError - Removing the .snakemake/metadata directory (which has JSON files) appears to fix this. This issue is documented here: https://github.com/snakemake/snakemake/issues/1342. Newer versions of snakemake may not have this problem.

Sometimes, submitted Snakemake commands will persist even when running of jobs has stopped. To deal with this, occasionally check running processes with ps -ax | grep variant-analysis . The first column of each row shows the process_id. For a "zombie process" with id 12345 , we could clear it with kill -9 12345 .

Code Snippets

10
11
12
13
shell: """
    awk -v START=0 'BEGIN {{PREV_START=START}} {{$4=(($2-PREV_START)*$4*0.000001)+PREV_CM; printf "%s %s %s %0.9f\n", $1, $2, $3, $4; PREV_START=$2; PREV_CM=$4}}' {input.recomb_rates_chrom} \
    > {output.genetic_map}
    """
23
24
25
shell: """
    liftOver {input.recomb_rates} {input.chain} {output.remapped} {output.unmapped}; \
    """
49
50
51
52
53
54
55
56
57
58
59
shell: """
    rfmix \
        -f {input.query_VCF} \
        -r {input.ref_VCF} \
        -m {input.sample_map} \
        -g {input.genetic_map} \
        -o {params.output_base} \
        --chromosome={wildcards.chr} \
        -e 3 \
        --reanalyze-reference \
    """
70
71
72
73
74
75
76
shell: """
    FILES=({input}); \
    head -n 2 ${{FILES[0]}} > {output.concat}; \
    for FILE in "${{FILES[@]}}"; do \
        sed '1d;2d' $FILE >> {output.concat}; \
    done \
    """
11
12
13
14
15
16
shell: """
    bcftools view {input.joint_vcf} \
        -s {wildcards.sample} \
        -Oz \
        -o {output.indiv_vcf} \
    """
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
shell: """
    bcftools annotate {input.vcf} \
        -x INFO,FORMAT \
    | bcftools view \
        -s {wildcards.sample1},{wildcards.sample2} \
    | bcftools view \
        -i 'F_MISSING=0' -H \
    | cut -f 10-11 \
    | sed 's/|/\//g' \
    | sed 's/1\/0/0\/1/g' \
    | awk -v OFS='\t' \
        'BEGIN {{print "SAMPLE1","SAMPLE2","MATCHES","MISMATCHES","CONCORDANCE"}} \
        {{if ($1 == $2) MATCH+=1; else MISMATCH+=1}} \
        END {{CONCORDANCE = MATCH/(MATCH + MISMATCH); print "{wildcards.sample1}","{wildcards.sample2}",MATCH,MISMATCH,CONCORDANCE}}' \
    > {output.concordance} \
    """
61
62
63
64
65
66
shell: """
        cat {input[0]} | head -n 1 > {output}; \
        for FILE in {input}; do \
            cat $FILE | sed '1d' | sed 's/\t\t/\t0\t/g' >> {output}; \
        done \
    """
74
75
76
77
78
79
80
81
shell: """
    bcftools view {input.vcf} \
        -s {wildcards.sample} \
    | bcftools view \
        -i "F_MISSING=0" -H \
    | wc -l \
    > {output.called_count} \
    """
90
91
92
93
shell: """
    cat {input.called_counts} > {output.combined_counts}; \
    paste {config[samples]} {output.combined_counts} > {output.tsv} \
    """
13
14
15
shell: "bedtools genomecov \
            -ibam {input.bam} \
            -bg > {output.counts}"
23
24
25
shell: "bedtools unionbedg \
            -i {input} \
            > {output}"
36
37
38
39
40
41
shell: """
    awk -v 'OFS=\t' '{{for (i=4; i<=NF; i++) {{if ($i > 0) count += 1}}; if (count/(NF - 3) > {params.cutoff}) print $1,$2,$3; count = 0}}' {input}
    | grep '^NC'
    | bedtools merge
        -d 1 
    > {output}"""
52
53
54
55
56
57
58
59
60
61
62
shell: """
    bedtools intersect \
        -a {input.target} \
        -b {input.ref_bed} \
    | samtools view \
    | wc -l \
    > {output}; \
    samtools view {input.target} \
    | wc -l \
    >> {output} \
    """
72
73
74
75
76
77
shell: """
    for i in {{1..2}}; \
    do awk -v i=$i 'FNR==i {{ print }}' {input.intersections} | awk '{{ sum+=$1 }} END {{ print sum }}'; \
    done \
    > {output.total} \
    """
88
89
90
91
92
93
94
95
shell: """
    bcftools view {input.vcf} \
        -R {input.bed} \
        -S {input.subset} \
        --min-ac 3 \
        -Oz \
        -o {output.vcf} \
    """
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
run:
    import allel
    import pandas as pd

    callset = allel.read_vcf(input.vcf, ['samples', 'calldata/GT'])
    g = allel.GenotypeArray(callset['calldata/GT'])
    nonmissing_GT = g.count_called(axis=0)/len(g)

    data = {
        "sample": callset['samples'],
        "nonmissing_GT": nonmissing_GT,
    }
    with open(output, 'w') as f:
        f.write("# Based on the file: " + output)
    df = pd.DataFrame(data).set_index('sample')
    df.to_csv(output, mode="a")
168
169
170
171
172
173
174
shell: """
    mosdepth {params.prefix} {input.bam} \
        --by {params.window_size} \
        -n \
        --fast-mode \
        -t {threads} \
    """
184
185
186
187
188
189
190
191
shell: """
    for BED in {input.bed}; do \
        SAMPLE=$(echo $BED | cut -d "." -f 1 | rev | cut -d "/" -f 1 | rev); \
        gunzip -c $BED \
        | awk -v SAMPLE=$SAMPLE 'BEGIN {{OFS="\\t"}} {{print SAMPLE, $0}}' \
        >> {output.merged_bed}; \
    done \
    """
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
shell: """echo '''
Information about where these files fall in the workflow can be seen under the `.smk` files under `workflow/rules`.

The directories for variant calling from `variant_calling.smk` and `phasing.smk` are created in the following order:

├─ alignments  # Duplicate sequences are marked and sorted
|   |
|   ├─ raw  # .bam files after .fastq files aligned to reference
|   |
|   ├─ markdup  # .bam files after running fixing mates pairs and marking duplicate reads
|   |
|   └─ alignments_recalibrated  # BQSR tables applied to sequences
|       |
|       └─ recal_tables  # BQSR recalibration tables that are applied to sequences
|
├─ gvcf  # Called variants as .g.vcf files
|
├─ db  # GenomicsDB datastore consolidating .g.vcf files. Not human-readable
|
├─ joint_call  # db data converted to a joint .vcf file with VQSR applied
|
├─ vcf  # Multisample VCF
|   |
|   ├─ hard_filtered # (Option 1) Filters by hard values
|   |   |
|   |   ├─ filter_applied  # FILTER column filled included those variants that didn't pass
|   |   |
|   |   └─ pass_only  # Only variants with FILTER=PASS
|   |
|   └─ VQSR  # (Option 2) Variant recalibration, filters by using truth and training data
|       |
|       ├─ model  # Models to be applied to VCFs
|       |
|       ├─ filter_applied  # FILTER column filled included those variants that didn't pass
|       |
|       └─ pass_only  # Only variants with FILTER=PASS
|
├─* plink  # Contains plink files
|
├─ genotypes
|   |
|   ├─ posteriors  # Calculate PP tag (posterior probabilites) and recaluclate MQ tag
|   |
|   ├─ filtered  # Set to missing genotypes with MQ < 20
|   |
|   ├─ filtered  # Set to missing genotypes with MQ < 20
|   |
|   └─ subsets  # Taking subsets of samples based on sequencing methods: GBS, WES, WGS
|
└─ haplotypes  # Phased/imputed VCFs
    |
    ├─ scaffolds  # Preliminary VCFs of genotypes incorportating pedigree information
    |
    └─ SHAPEIT4  # Phased/imputed VCFs using scaffold and original VCF


The directories for relations from `relations.smk` are as follows. `plink` is created first. The others do not follow an order.
However, these require the joint vcf file from `joint_call`:

├─ plink  # Contains plink files
|
├─ relatedness  # Pairwise estimates of relatedness between samples
|
├─ admixture  # Finds rates of admixture among samples
|   |
|   ├─ supervised  # Uses additional samples of known origin
|   |
|   └─ unsupervised  # No known origins
|
└─ aims  # Ancestry informative markers

Directories for determining kinship using LASAR and SEEKIN:

└─ kinship  # Allele frequencies and pairwise relatedness

Other:

└─ structural_variants  # Strutural variants

''' > {output}
"""
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
shell: """echo '''
Recommended layout for resources:

├─ barcodes  # Table of barcodes for samples (used for GBS and AMP reads)
|
├─ reads  # .fastq.gz file for each sample (R1 and R2)
|
├─ ref_fna  # Reference genome
|
├─ ref_vcf  # Reference .vcf.gz files
|
├─ samples  # List of samples

''' > {output}
"""
14
script: "../scripts/fst.py"
SnakeMake From line 14 of rules/fst.smk
32
33
34
35
36
37
38
39
40
shell: """
    vcftools \
        --bcf {input.bcf} \
        --weir-fst-pop {input.pop1} \
        --weir-fst-pop {input.pop2} \
        --out {params.out_prefix} \
        --fst-window-size {params.size} \
        --fst-window-step {params.step} \
    """
39
40
41
42
43
shell: """
    gatk --java-options '-Xmx8g' VariantFiltration \
        -V {input.vcf} \
        -O {output.filtered} \
        {params.filters} """
54
55
56
57
58
shell: """
    bcftools view {input.vcf} \
        -f PASS \
        -Oz \
        -o {output.vcf}"""
76
77
78
79
80
81
shell: """
    bcftools  {input.vcfs} \
        --allow-overlaps \
        -Oz \
        -o {output} \
        --threads {threads}"""
154
155
156
157
158
159
160
161
162
163
164
165
166
shell: """
    bcftools view {input.vcf} \
        -S {input.samples_subset} \
        -Ou \
    | bcftools reheader \
        -s {input.sample_map} \
    | bcftools view \
        --min-ac=1 \
        -Ou \
    | bcftools norm \
        -m+any \
        -Oz \
        -o {output}"""
186
187
188
189
190
191
192
193
194
195
196
197
198
shell: """
    bcftools view {input.vcf} \
        -S {input.samples_subset} \
        -Ou \
    | bcftools norm \
        -m+any \
        -Ov \
    | vcftools \
        --vcf - \
        --max-missing 0.7 \
        -c \
        --recode --recode-INFO-all \
    | bgzip > {output}"""
 7
 8
 9
10
11
12
13
shell: """
    gunzip -c {input.ref_fna} \
    | grep ">" \
    | cut -d ":" -f 4,6 \
    | sed 's/:/\t/' \
    > {output.contig_lengths} \
    """
27
28
29
shell: """
    python3 workflow/scripts/runs_of_homozygosity.py {input.vcf} {input.contig_lengths} {output.roh_pickle} {output.froh_pickle} {wildcards.chr} \
    """
15
16
17
18
19
20
21
shell: """
    makeScaffold \
        --gen {input.vcf} \
        --fam {input.fam} \
        --reg {wildcards.chr} \
        --out {output.scaffold} \
    """
122
123
124
125
126
127
128
shell: """
    bcftools annotate {input.vcf} \
        -a {input.phased} \
        -c FORMAT/GT \
        -o {output.annotated} \
        -Oz \
    """
140
141
142
143
144
shell: """
    sed 's/\\t\\t/\\tNA\\t/g;s/\\t$/\\tNA/g' {input.tsv} \
    | grep -v NA$'\\t'NA \
    > {output.fam} \
    """
160
161
162
163
164
165
166
167
168
shell: """
    SHAPEIT5_phase_common \
        --input {input.vcf} \
        --pedigree {input.fam} \
        --region {wildcards.chr} \
        --output {output.phased} \
        --log {log} \
        --thread {threads} \
    """
186
187
188
189
190
191
192
193
194
195
shell: """
    SHAPEIT5_phase_common \
        --input {input.vcf} \
        --pedigree {input.fam} \
        --reference {input.ref} \
        --region {wildcards.chr} \
        --output {output.phased} \
        --log {log} \
        --thread {threads} \
    """
215
216
217
218
219
220
shell: """
    java -Xmx50g -jar {params.jar} \
        gt={input.vcf} \
        out={output.out} \
        nthreads={threads} \
    """
273
274
275
276
277
278
279
280
281
shell: """
    SHAPEIT5_phase_common \
        --input {input.vcf} \
        --pedigree {input.fam} \
        --region {wildcards.chr} \
        --output {output.phased} \
        --log {log} \
        --thread {threads} \
    """
296
297
298
299
300
301
302
303
304
shell: """
    GLIMPSE2_chunk \
        --input {input.vcf} \
        --region {wildcards.chr} \
        --sequential \
        --output {output.chunks} \
        --log {log} \
        --threads {threads} \
    """
356
357
358
359
360
361
362
363
364
365
shell: """
        ORG=$(grep {wildcards.chr}:{wildcards.start}-{wildcards.end} {input.chunks} | cut -f 4); \
        GLIMPSE2_split_reference \
            --reference {input.vcf} \
            --input-region {wildcards.chr}:{wildcards.start}-{wildcards.end} \
            --output-region $ORG \
            --output {params.out_prefix} \
            --threads {threads} \
            --log {log.log}; \
    """
404
405
406
407
408
shell: """
    for SAMPLE in $(bcftools query -l {input.vcf}); do \
        echo {params.prefix}${{SAMPLE}}{params.suffix}; \
    done > {output.bam_list} \
    """
420
421
422
423
424
425
426
shell: """
    GLIMPSE2_phase \
        --bam-list {input.bam_list} \
        --reference {input.ref} \
        --output {output.vcf} \
        --threads {threads} \
    """
451
452
453
454
run:
    with open(output.txt, "w") as f:
        for line in input.chunk_files:
            f.write(line + "\n")
477
478
479
480
481
482
shell: """
    GLIMPSE2_ligate \
        --input {input.txt} \
        --output {output.vcf} \
        --threads {threads} \
    """
10
shell: "bwa index {input}"
18
shell: "samtools faidx {input}"
39
40
shell: "gatk CreateSequenceDictionary \
        -R {input.fasta}"
49
shell: "samtools index {input}"
60
61
shell: "bcftools index {input} \
            --tbi"
72
73
shell: "bcftools index {input} \
            --csi"
17
18
19
20
21
shell: """
    bcftools concat {input.vcfs} \
        -o {output.vcf} \
        -Oz \
    """
18
19
20
21
22
shell: """
    gatk --java-options "-Xmx8g" CalculateGenotypePosteriors \
        -V {input.vcf} \
        -O {output} \
        --tmp-dir ~/tmp/{rule}"""
38
39
40
41
42
43
44
shell: """
    gatk --java-options "-Xmx8g" VariantFiltration \
        -V {input.vcf} \
        --genotype-filter-expression "GQ < {params.GQ}" \
        --genotype-filter-name "GQ{params.GQ}" \
        --set-filtered-genotype-to-no-call \
        -O {output.vcf}"""
61
62
63
64
65
66
67
68
69
70
71
shell: """
    bcftools view {input.vcf} \
        -S {input.samples} \
        -Ou \
    | bcftools view \
        --min-af {params.min_AF} \
        --max-af {params.max_AF} \
        --min-ac {params.min_AC} \
        -Oz \
        -o {output.vcf} \
    """
81
82
83
84
85
86
87
shell: """
    bcftools query -l {input.vcf} | grep WGS > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -Oz \
        -o {output.vcf} \
    """
 98
 99
100
101
102
103
104
105
shell: """
    bcftools query -l {input.vcf} | grep WES > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -R {input.bed} \
        -Oz \
        -o {output.vcf} \
    """
126
127
128
129
130
131
132
133
134
shell: """
    sed 's/./&\t/3' {input.samples} \
    | awk -v OFS='\t' '{{print $2,$1}}' \
    | sort \
    | awk '$1!=p{{if(p)print s; p=$1; s=$0; next}}{{sub(p,x); s=s $0}} END {{print s}}' \
    | awk -v OFS='' '{{print $NF,$1}}' \
    | grep -v "_" \
    | sort > {output.largest_samples} \
    """
147
148
149
150
151
152
153
shell: """
    sed 's/AMP/AMP\t/g;s/GBS/GBS\t/g; s/WES/WES\t/g; s/WGS/WGS\t/g' {input.samples} \
    | sort -k 2 \
    | join - {input.parents} -1 2 -2 1 \
    | awk '{{print $2$1"\t"$3"\t"$4}}' \
    > {output.parents} \
    """
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
run:
    samples = None
    with open(input.samples, "r") as f:
        samples = f.read().strip().split("\n")
    with open(input.parents, "r") as f:
        with open(output.tsv, "a") as out:
            for line in f:
                child, sire, dam = line.strip("\n").split("\t")
                seq_type = child[:3]
                if f"WGS{sire}" in samples:
                    sire = f"WGS{sire}"
                elif f"{seq_type}{sire}" in samples:
                    sire = f"{seq_type}{sire}"
                else:
                    sire = ""
                if f"WGS{dam}" in samples:
                    dam = f"WGS{dam}"
                elif f"{seq_type}{dam}" in samples:
                    dam = f"{seq_type}{dam}"
                else:
                    dam = ""
                out.write(f"{child}\t{sire}\t{dam}\n")
199
200
201
202
203
204
shell: """
    awk 'BEGIN {{OFS="\t"}} {{print 0,$1,$2,$3,0,0}}' {input.tsv} \
    | sed 's/\t\t/\t0\t/g' \
    | sed 's/\t\t/\t0\t/g' \
    > {output.ped} \
    """
213
214
215
216
shell: """
    grep -E "(WGS.*|WES.*|GBS.*|AMP.*){{3}}" {input.tsv} \
    > {output.tsv} \
    """
228
229
230
231
232
233
234
235
236
237
238
239
240
shell: """
    read -r CHILD SIRE DAM <<< $(grep ^{wildcards.sample} {input.tsv}); \
    if [ -n "$SIRE" ]; \
    then SIRE=",$SIRE"; \
    fi; \
    if [ -n "$DAM" ]; \
    then DAM=",$DAM"; \
    fi; \
    bcftools view {input.vcf} \
        -s {wildcards.sample}$SIRE$DAM \
        -Oz \
        -o {output.trio} \
    """
253
254
255
256
257
258
259
260
shell: """
    bcftools annotate -x INFO,^FORMAT/GT {input.trio} \
    | bcftools view -H -i "F_MISSING==0" \
    | awk '{{print $10,$11,$12}}' \
    | python3 {params.script} \
    | sort \
    | uniq -c > {output.counts} \
    """
298
299
300
301
302
303
304
shell: """
    whatshap phase {input.vcf} {params.child_bam} {params.sire_bam} {params.dam_bam} \
        --reference={input.ref_fasta} \
        --ped={input.ped} \
        --tag=PS \
        -o {output.phased} \
    """
315
316
317
318
319
320
321
shell: """
    SAMPLE=$(bcftools query -l {input.trio} | head -n 1); \
    bcftools view {input.trio} \
        -s $SAMPLE \
        -Ou \
        -o {output.indiv} \
    """
340
341
342
343
344
shell: """
    bcftools merge {input.trios} \
        -Oz \
        -o {output.merged} \
    """
354
355
356
357
358
359
360
shell: """
    bcftools query -l {input.vcf} | grep {wildcards.seq} > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -Oz \
        -o {output.vcf} \
    """
370
371
372
373
374
375
376
377
shell: """
    bcftools query -l {input.vcf} | grep {wildcards.seq} > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -R {input.bed} \
        -Oz \
        -o {output.vcf} \
    """
23
24
25
26
27
shell: """
    bcftools concat {input.vcfs} \
        -o {output} \
        -Oz \
    """
40
41
42
43
44
shell: """
    bcftools concat {input.vcfs} \
        -o {output} \
        -Oz \
    """
63
64
65
66
67
68
69
70
71
shell: """
    plink \
        --vcf {input.vcf} \
        --update-parents {input.parents_table} \
        --update-sex {input.sex_table} \
        --recode12 \
        --make-bed \
        --out {params.out_path}/{wildcards.dataset} \
    """
85
86
87
88
89
90
91
shell: """
    plink \
        --bfile {params.path}/{wildcards.dataset} \
        --update-ids {input.update_fids} \
        --make-bed \
        --out {params.path}/{wildcards.dataset}.same_fid \
    """
106
107
108
109
110
111
shell: """
    king \
        -b {input.bed} \
        --kinship \
        --prefix {params.prefix} \
    """
133
134
135
136
137
138
139
140
141
shell: """
    plink \
        --vcf {input.vcf} \
        --update-parents {input.parents_table} \
        --update-sex {input.sex_table} \
        --recode12 \
        --make-bed \
        --out {params.out_path}/{wildcards.dataset}.chr{wildcards.chr} \
    """
155
156
157
158
159
160
161
shell: """
    plink \
        --bfile {params.path}/{wildcards.dataset}.chr{wildcards.chr} \
        --update-ids {input.update_fids} \
        --make-bed \
        --out {params.path}/{wildcards.dataset}.chr{wildcards.chr}.same_fid \
    """
176
177
178
179
180
181
shell: """
    king \
        -b {input.bed} \
        --kinship \
        --prefix {params.prefix} \
    """
196
197
198
199
200
201
202
shell: """
    vcftools \
        --gzvcf {input.vcf} \
        --het \
        --stdout \
    > {output.het} \
    """
212
213
214
215
216
217
218
shell: "lcmlkin \
            -i {input.vcf} \
            -o {output} \
            -g all \
            -l phred \
            -u {input.founders} \
            -t {threads}"
231
script: "../scripts/fst.py"
239
script: "../scripts/diversity.py"
272
273
274
275
276
    shell: "plink \
                --file  " + config["results"] + " plink/{wildcards.dataset} \
                --out " + config["results"] + " plink/recode12/{wildcards.dataset}_recode12 \
                --recode12"
'''
287
288
289
290
291
shell: "admixture {input.ped} {wildcards.clusters} \
            -j{threads} \
            --cv | tee {output.out}; \
        mv {wildcards.dataset}_recode12.{wildcards.clusters}.Q {output.q}; \
        mv {wildcards.dataset}_recode12.{wildcards.clusters}.P {output.p}"
305
306
307
308
309
310
shell: "echo -e 'sample_id\tCluster_1\tCluster_2' > {output}; \
        TMP=delimited.tmp; \
        sed 's/ /\t/g' {input.q} > $TMP; \
        cut {input.ped} -d ' ' -f 2 \
        | paste - $TMP >> {output}; \
        rm $TMP;"
331
332
333
334
335
shell: "plink \
            --vcf {input.vcf} \
            --recode12 \
            --make-bed \
            --out {config[results]}admixture/supervised/plink/{wildcards.sample}"
349
350
351
352
353
shell: "admixture {input.bed} {wildcards.clusters} \
            --supervised \
            -j{threads}; \
        mv {wildcards.dataset}.{wildcards.clusters}.Q {output.q}; \
        mv {wildcards.dataset}.{wildcards.clusters}.P {output.p}"
364
365
366
367
shell: "plink \
            --bfile " + config["results"] + "admixture/supervised/plink/{wildcards.dataset} \
            --recode \
            --out " + config["results"] + "admixture/supervised/plink/{wildcards.dataset}"
383
384
385
386
387
388
389
390
shell: "echo -e 'Sample_id\tChinese\tIndian' > {output}; \
        TMP=delimited.tmp; \
        TMP2=pasted.tmp; \
        sed 's/ /\t/g' {input.q} > $TMP; \
        cut {input.ped} -d ' ' -f 2 \
        | paste - $TMP >> $TMP2; \
        tail -n 674 $TMP2 >> {output}; \
        rm $TMP; rm $TMP2"
397
398
399
400
401
402
403
404
405
406
run:
    import matplotlib.pyplot as plt
    import pandas as pd
    import seaborn as sns

    data = pd.read_table(str(input))
    sns.set_theme()
    ax = sns.histplot(data=data, bins=20, multiple="stack")
    ax.set(title="Ancestral Admixture in SNPRC Rhesus Macaques", xlabel="Ancestry ratio", xlim=(0.5, 1), ylabel="Number of individuals")
    plt.savefig(str(output))
419
420
421
422
shell: "awk '{{print $1,$4}}' {input.map} \
        | paste - {input.p} \
        | awk '$4 - $3 > {params.max_diff} || $3 - $4 > {params.max_diff} {{print $0}}' - \
        > {output}"
16
17
18
19
20
shell: """
    delly call {input.bam} \
        -g {input.ref_fasta} \
    > {output.bcf} \
    """
31
32
33
34
shell: """
    delly merge {input.bcfs} \
        -o {output.bcf} \
    """
47
48
49
50
51
52
shell: """
    delly call {input.bam} \
        -g {input.ref_fasta} \
        -v {input.bcf} \
        -o {output.bcf} \
    """
66
67
68
69
70
71
shell: """
    bcftools merge {input.bcfs}\
        -m id \
        -Ob \
        -o {output.bcf} \
    """
82
83
84
85
86
shell: """
    delly filter {input.bcf} \
        -f germline \
        -o {output.bcf} \
    """
 99
100
101
102
103
104
shell: """
    bcftools view {input.bcf} \
        -i "ALT='<{wildcards.SV_type}>'" \
        -Ob \
        -o {output.bcf} \
    """
121
122
123
124
125
shell: """
    dicey chop {input.ref_fasta} \
        --fq1 {params.R1} \
        --fq2 {params.R2} \
    """
142
143
144
145
146
147
148
shell: """
    bwa mem {input.ref_fasta} {input.R1} {input.R2} \
        -t {params.bwa_threads} \
    | samtools sort \
        -@ {params.samtools_threads} \
        -o {output.bam} \
    """
162
163
164
165
166
shell: """
    dicey mappability2 {input.bam} \
        -o {output.map}; \
    gunzip {output.map} && bgzip {params.map} \
    """
182
183
184
185
186
187
shell: """
    delly cnv {input.bam} \
        -g {input.ref_fasta} \
        -l {input.map} \
        -o {output.bcf} \
    """
198
199
200
201
202
203
204
205
shell: """
    delly merge {input.bcfs} \
        --cnvmode \
        --pass \
        --minsize 1000 \
        --maxsize 100000 \
        -o {output.bcf} \
    """
220
221
222
223
224
225
226
227
shell: """
    delly cnv {input.bam} \
        --segmentation \
        -v {input.bcf} \
        -g {input.ref_fasta} \
        -m {input.map} \
        -o {output.bcf} \
    """
241
242
243
244
245
246
shell: """
    bcftools merge {input.bcfs} \
        -m id \
        -Ob \
        -o {output.bcf} \
    """
256
257
258
259
260
shell: """
    delly classify {input.bcf} \
        --filter germline \
        -o {output.bcf} \
    """
34
35
36
37
38
39
40
41
42
43
44
45
shell: """
    # ADAPTERS=$(gunzip -c {input.reads[0]} | head -n 1 | cut -d " " -f 2 | cut -d ":" -f 4);
    # i7=$(echo $ADAPTERS | cut -d "+" -f 1);
    # i5=$(echo $ADAPTERS | cut -d "+" -f 2);
    R1_END_ADAPTER="{params.pre_i7}";  # ${{i7}}{params.post_i7}";
    R2_END_ADAPTER="{params.pre_i5}";  # ${{i5}}{params.post_i5}";
    cutadapt {input.reads} \
        -a $R1_END_ADAPTER \
        -A $R2_END_ADAPTER \
        --cores {threads} \
        -o {output.trimmed[0]} \
        -p {output.trimmed[1]}"""
112
113
114
115
116
117
118
119
120
121
shell: """
    ADAPTERS=$(gunzip -c {input.reads[0]} | head -n 1 | cut -d " " -f 2 | cut -d ":" -f 4);
    R1_END_ADAPTER="{params.R1_adapter}";
    R2_END_ADAPTER="{params.barcode}{params.R2_adapter}";
    cutadapt {input.reads} \
        -a $R1_END_ADAPTER \
        -A $R2_END_ADAPTER \
        --cores {threads} \
        -o {output.trimmed[0]} \
        -p {output.trimmed[1]}"""
141
142
shell: """
    bash workflow/scripts/align.sh {input.trimmed[0]} {input.trimmed[1]} {input.ref} {output} {threads} {wildcards.sample}"""
155
156
157
158
159
160
161
162
163
164
165
shell: """
    samtools sort {input.alignment} \
        -n \
        -T ~/tmp/{rule} \
    | samtools fixmate - - \
        -m \
    | samtools sort - \
        -T ~/tmp/{rule} \
    | samtools markdup - {output.alignment} \
        -T ~/tmp/{rule} \
    """
179
180
181
182
183
184
185
186
187
188
shell: """
    samtools sort {input.alignment} \
        -n \
        -T ~/tmp/{rule} \
    | samtools fixmate - - \
        -m \
    | samtools sort - \
        -T ~/tmp/{rule} \
        -o {output.alignment} \
    """
206
207
208
209
210
211
212
shell: """
    gatk --java-options '-Xmx8g' BaseRecalibrator \
        -R {input.ref} \
        -I {input.bam} \
        --known-sites {input.known_variants} \
        -O {output} \
        --tmp-dir ~/tmp/{rule}"""
227
228
229
230
231
232
233
234
shell: """
    gatk --java-options '-Xmx8g' ApplyBQSR \
        -R {input.ref} \
        -I {input.bam} \
        --bqsr-recal-file {input.recal} \
        -O {output} \
        --tmp-dir ~/tmp/{rule} \
        --create-output-bam-index false"""
249
250
251
252
253
254
255
256
shell: """
    gatk --java-options '-Xmx16g' HaplotypeCaller \
        -R {input.ref} \
        -I {input.bam} \
        -O {output.vcf} \
        -ERC GVCF \
        --native-pair-hmm-threads {threads} \
        --tmp-dir ~/tmp/{rule}"""
271
272
273
274
275
run:
    with open(output.sample_map, "w") as sample_map:
        for gvcf in input.gvcfs:
            sample = gvcf.split("/")[-1].split(".")[0]
            sample_map.write(f"{sample}\t{gvcf}\n")
290
291
292
shell: """
    zcat {input.ref} | grep "^>" | cut -c 2- | cut -d " " -f 1 | grep -x -E "^[0-9]+|X|Y|MT" > {output}
    """
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
shell:  """
    CONTIGS=$(awk 'BEGIN {{ORS = ","}} {{print $0}}' {input.contigs}); \
    if [ -d {params.db} ]; \
    then WORKSPACE_FLAG="genomicsdb-update-workspace-path"; \
    else WORKSPACE_FLAG="genomicsdb-workspace-path"; \
    fi; \
    gatk --java-options '-Xmx8g' GenomicsDBImport \
        --$WORKSPACE_FLAG {params.db} \
        --intervals {input.contigs} \
        --sample-name-map {input.sample_map} \
        --batch-size 50 \
        --genomicsdb-shared-posixfs-optimizations true \
        --reader-threads {threads} \
        --max-num-intervals-to-import-in-parallel {params.parallel_intervals} \
    && touch {output}
    """
342
343
344
345
346
347
348
shell: """
    gatk --java-options '-Xmx16g' GenotypeGVCFs \
        -R {input.ref} \
        -V gendb://{params.db} \
        -O {output.vcf} \
        -L {wildcards.chr} \
    """
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
shell: """
    bcftools norm {input.vcf} \
        -m-any \
        --fasta-ref {input.ref_fasta} \
        -Ou \
    | bcftools view \
        -e'type{params.equality}"snp"' \
        -Ou \
    | bcftools norm \
        -m+any \
        -Ou \
    | bcftools view \
        -M2 \
        -m2 \
        -Oz \
        -o {output.split} \
    """
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
READ1=$1
READ2=$2
REF=$3
OUTPUT=$4
THREADS=$5
SAMPLE=$6

echo "In .sh file";
ID=$(gunzip -c "$READ1" | head -n 1 | cut -c 2- | awk -v FS=':' '{print "ID:"$1"."$2}');
echo $ID;
echo post_id;
PU=$(gunzip -c "$READ1" | head -n 1 | cut -c 2- | awk -v FS=':' '{print "PU:"$3"."$4}');
echo $PU;
echo post_pu;

echo "$SAMPLE"
echo '@RG\tSM:'$SAMPLE'\tLB:'$SAMPLE'\t'$ID'\t'$PU'.'$SAMPLE'\tPL:ILLUMINA'

bwa mem $REF $READ1 $READ2 \
    -t $THREADS \
    -R '@RG\tSM:'$SAMPLE'\tLB:'$SAMPLE'\t'$ID'\t'$PU'.'$SAMPLE'\tPL:ILLUMINA' \
| samtools view -Sb - > $OUTPUT
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import allel
import numpy as np
import pandas as pd

# Variables taken from Snakemake rule
vcf = snakemake.input.vcf
chromosome_file = snakemake.output.chromosomes
annotation_file = snakemake.output.annotations

# Reference genome Mmul_8.0.1 chromosomes
chromosomes = {"name": ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
                        "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19",
                        "chr20", "chrX"],
               "start": [1 for i in range(1, 22)],
               "stop": [225_584_828, 204_787_373, 185_818_997, 172_585_720, 190_429_646, 180_051_392, 169_600_520, 144_306_982, 129_882_849,
               92_844_088, 133_663_169, 125_506_784, 108_979_918, 127_894_412, 111_343_173, 77_216_781, 95_684_472, 70_235_451, 53_671_032,
               74_971_481, 149_150_640]}

chromosomes_df = pd.DataFrame(data=chromosomes)
chromosomes_df.to_csv(chromosome_file, header=False, index=False, sep="\t")

# Current ids don't include any individuals with missing years.
# Colony 1
subpop = ['17509', '17515', '17511', '17514', '17516', '17584', '17534', '17535', '17527', '17536', '17525', '17553', '17546', '17552', '17551', '17557', '8796', '17563', '17572', '17570', '17567', '9793', '17562', '18314', '17578', '17581', '17582', '17521', '17575', '17588', '17583', '17585', '17740', '10244', '17590', '19325', '17591', '17597', '19323', '17606', '12908',
    '17640', '17620', '17639', '17619', '17626', '17624', '17615', '17622', '17629', '16340', '16345', '16346', '16347', '16349', '16350', '16351', '16353', '16355', '16356', '16358', '16639', '16640', '16641', '16643', '16644', '16647', '16649', '16653', '16655', '16656', '16658', '16342', '16344', '16348', '16354', '17657', '16657', '16645', '16646', '16650', '16652', '17674', '17663', '17661', '17647', '17679', '17678', '17656', '17642', '17660', '17641', '14970', '17666', '17654', '17658', '17659', '17662', '17645',
    '16339', '17701', '17684', '17710', '17680', '17723', '17732', '17733', '17730', '17714', '17717', '17745', '18400', '18408', '17133', '17724', '18416', '18401', '17737', '17736', '19223', '19224', '19150', '19177', '19312', '18987', '19315', '18419', '18994', '18421', '19316', '18990', '19314', '18983', '18985', '18422', '19699', '19702', '19852', '19857', '19853', '19846', '20064', '19689', '19695', '19845', '19691', '20077', '20082', '20078', '20080', '20083', '19851', '19856', '19688', '19700', '19318',
    '29495', '29501', '26837', '26858', '26353', '26935', '27010', '26957', '29453', '29461', '29454', '29455', '29458', '26786', '26875', '26955', '26899', '27128', '27441', '26515', '27037', '26695', '26177', '26813', '26232', '27056', '26992', '26732', '28035', '28324', '28282', '28303', '28403', '28356', '28092', '28138', '28302', '28215', '27646', '28168', '28195', '28210', '27783', '28196', '27682', '28109', '28110', '28066', '28240', '28189', '28144', '28060', '28071', '28310', '27886', '28829', '28796', '28898', '28935', '29141', '28927', '28937', '28794',
    '29576', '29372', '29429', '29445', '29555', '29532', '29364', '29362', '29560', '29437', '29436', '29507', '30618', '30142', '30009', '30125', '30118', '30019', '30204', '30024', '30122', '30128', '30139', '30285', '30003', '30034', '30086', '30739', '30649', '30506', '30513', '30633', '30732', '30644', '30663', '30742', '30736', '30735', '30643', '30876', '30881', '30651', '30956', '30638', '30755', '30565', '30855', '30505', '30654', '30610', '30641', '30848', '31310', '31204', '31209', '31225', '31163', '31127', '31414', '31357', '31160', '31397', '31236', '31143', '31229', '31154', '31351', '31137', '31144', '31161', '31194', '31147',
    '31808', '31798', '31776', '31809', '31802', '31810', '31833', '32351', '31834', '32111', '31715', '31691', '31832', '32022', '32217', '31806', '31814', '31885', '31762', '31781', '31982', '32017', '31783', '31840', '32767', '32642', '32578', '32594', '32593', '32609', '32872', '32592', '32764', '32914', '32693', '32789', '32571', '32542', '33055', '32694', '32639', '33202', '33229', '33397', '33326', '33221', '33177', '33442', '33312', '33171', '33576', '33231', '33166', '33325', '33313', '33302',
    '34746', '34921', '34353', '34561', '35034', '34717', '34901', '34986', '34696', '34722', '34362', '34223', '34186', '34723', '34764', '35422', '35955', '35325', '35351', '36041', '35454', '36089', '36033', '35949', '36063', '35444', '36066', '36032', '35964', '35443', '35948', '35947', '35958', '35975', '35956', '36806', '36671', '37218', '37005', '36793', '36992', '36931', '36906', '36894', '36878', '36881', '36862', '36875', '36672', '36805', '36932'
    ]

# Colony 2
# subpop = ['33914', '33916', '33917', '33919', '33918', '33920', '33923', '33926', '33935', '33929', '33932', '33938', '33924', '33937', '33933', '33930', '33931', '33943', '34229', '34235', '34241', '34228', '33945', '34231', '34230', '33942', '33947',
#     '33952', '33951', '33949', '33948', '34257', '34245', '34246', '34252', '34256', '34243', '34244', '34249', '34247', '34250', '34254', '34273', '34272', '34274', '34266', '34271', '34261', '34260', '34265', '34258', '34270', '34264', '34259',
#     '34283', '34285', '34282', '34281', '34276', '34284', '33959', '34277', '33960', '34280', '34296', '34298', '34294', '34297', '34293', '34292', '33963', '34291', '34295', '34290', '34289', '34286', '34288', '34299', '34287', '33965', '33964',
#     '33990', '33997', '34007', '34002', '34004', '34305', '33979', '33982', '33984', '33993', '33998', '33992', '34008', '34009', '34006', '33971', '34311', '34309', '34308', '33980', '33981', '33968', '34301', '33967', '33976', '34319', '33987', '33978', '34001', '34005', '33999', '33991', '33995', '33986', '34003', '34000', '34320', '34313', '34312', '34310', '34306', '34304', '34321', '34316', '34307', '34318', '34314', '34322', '33970', '33994', '33988', '33973', '34300', '34302', '33966', '33975', '33972', '33989', '33977', '33983', '33996', '33985', '34303', '34315', '33969', '34023', '34018', '34342', '34024', '34021', '34010', '34343', '34349', '34346', '34326', '34329', '34328', '34014', '34019', '34020', '34344', '34331', '34332', '34341', '34012', '34011', '34013', '34017', '34352', '34333', '34334', '34015',
#     '35162', '35087', '35043', '34911', '35169', '34899', '34898', '35196', '35191', '35434', '35195', '35268', '35170', '34846', '35038', '34707', '34841', '34695', '34988', '34697', '34698', '34948', '34845', '34980', '34969', '36405', '35270', '35373', '35326', '35269', '35475', '36044', '36029', '35969', '35372', '35324', '36258', '36131', '35430', '36102', '36239', '35961', '36421', '35366', '36198', '36380', '35301', '35455', '36176', '36873', '36930', '36720', '36936', '36905', '36880', '36911', '37169', '36997', '36699', '36710', '36856', '36874', '36821', '37006', '37076', '36929', '36677', '37053'
#     ]


# Pull genotype data from VCF.
callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'samples', 'calldata/GT'])

#sample_filter = np.isin(callset["samples"], subpop)
sample_subpop_indices = [list(callset["samples"]).index(id) for id in subpop]

# Create pandas dataframe for chromosome file (for chromoMap)
header = ["id", "chrom", "win_start", "win_end", "diversity"]
df = pd.DataFrame(columns=header)

# Create pandas dataframe for annotation file (for chromoMap)
header = ["id", "chrom", "win_start", "win_end", "diversity"]
df = pd.DataFrame(columns=header)

# Loop through each chromosome
for chrom in sorted(list(set(callset['variants/CHROM']))):
    chrom_filter = np.equal(callset['variants/CHROM'], chrom)

    # Take genotypes from current chromosome
    genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_filter])

    # Keep only genotypes from subpopulation
    genotypes = genotypes.take(sample_subpop_indices, axis=1)

    ac = genotypes.count_alleles()
    pos = callset["variants/POS"][chrom_filter]
    end = pos.max()

    pi, windows, n_bases, counts = allel.windowed_diversity(pos, ac, size=1_000_000, start=1, stop=end)

    data_dict = {"id": [f"chr{chrom}:{window[0]}" for window in windows],
                "chrom": [f"chr{chrom}"]*len(pi),
                "win_start": [window[0] for window in windows],
                "win_end": [window[1] for window in windows],
                "diversity": pi,
                }
    chr_df = pd.DataFrame.from_dict(data_dict)
    df = pd.concat([df, chr_df])

df = df.sort_values(["chrom", "win_start"])
df.to_csv(annotation_file, header=False, index=False, sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import allel
import numpy as np
import snakemake as snek

# Variables taken from Snakemake rule
vcf = snakemake.input.vcf
pickle = snakemake.output.pickle

# Taken from config file. Needs an implementation to pull from there
subpops_dict = {
    '1980-1995': ['10235', '10244', '10250', '10814', '11452', '11491', '12001', '12147', '12540', '12543', '12908', '12932', '14392', '14393', '14396', '17504', '17505', '17507', '17508', '17509', '17511', '17514', '17515', '17516', '17521', '17522', '17525', '17527', '17534', '17535', '17536', '17537', '17539', '17544', '17546', '17551', '17552', '17553', '17557', '17562', '17563', '17566', '17567', '17568', '17569', '17570', '17572', '17573', '17575', '17578', '17581', '17582', '17583', '17584', '17585', '17588', '17590', '17591', '17597', '17606', '17740', '18314', '19323', '19325', '7139', '7893', '8796', '8896', '9038', '9590', '9710', '9793', '9908'],
    '1996-2001': ['14970', '16339', '16340', '16342', '16344', '16345', '16346', '16347', '16348', '16349', '16350', '16351', '16353', '16354', '16355', '16356', '16358', '16639', '16640', '16641', '16643', '16644', '16645', '16646', '16647', '16649', '16650', '16652', '16653', '16655', '16656', '16657', '16658', '17133', '17615', '17619', '17620', '17622', '17624', '17626', '17629', '17639', '17640', '17641', '17642', '17645', '17647', '17654', '17656', '17657', '17658', '17659', '17660', '17661', '17662', '17663', '17666', '17674', '17678', '17679', '17680', '17684', '17701', '17710', '17714', '17717', '17723', '17724', '17730', '17732', '17733', '17745', '18400', '18408', '33914', '33916', '33917', '33918', '33919', '33920'],
    '2002-2005': ['17736', '17737', '18401', '18416', '18419', '18421', '18422', '18983', '18985', '18987', '18990', '18994', '19150', '19177', '19223', '19224', '19312', '19314', '19315', '19316', '19318', '19688', '19689', '19691', '19695', '19699', '19700', '19702', '19845', '19846', '19851', '19852', '19853', '19856', '19857', '20064', '20077', '20078', '20080', '20082', '20083', '26177', '26232', '26353', '26515', '26695', '26732', '26786', '26813', '26837', '26858', '26875', '26899', '26935', '26955', '26957', '26992', '27010', '27037', '27056', '27128', '27441', '29453', '29454', '29455', '29458', '29461', '29495', '29501', '33923', '33924', '33926', '33929', '33930', '33931', '33932', '33933', '33935', '33937', '33938', '33942', '33943', '33945', '33947', '33948', '33949', '33951', '33952', '34228', '34229', '34230', '34231', '34235', '34241'],
    '2006-2009': ['27646', '27682', '27783', '27886', '28035', '28060', '28066', '28071', '28092', '28109', '28110', '28138', '28144', '28168', '28189', '28195', '28196', '28210', '28215', '28240', '28282', '28302', '28303', '28310', '28324', '28356', '28403', '28794', '28796', '28829', '28898', '28927', '28935', '28937', '29141', '29362', '29364', '29372', '29429', '29436', '29437', '29445', '29507', '29532', '29555', '29560', '29576', '30003', '30009', '30019', '30024', '30034', '30086', '30118', '30122', '30125', '30128', '30139', '30142', '30204', '30285', '30618', '33959', '33960', '34243', '34244', '34245', '34246', '34247', '34249', '34250', '34252', '34254', '34256', '34257', '34258', '34259', '34260', '34261', '34264', '34265', '34266', '34270', '34271', '34272', '34273', '34274', '34276', '34277', '34280', '34281', '34282', '34283', '34284', '34285'],
    '2010-2012': ['30505', '30506', '30513', '30565', '30610', '30633', '30638', '30641', '30643', '30644', '30649', '30651', '30654', '30663', '30732', '30735', '30736', '30739', '30742', '30755', '30848', '30855', '30876', '30881', '30956', '31127', '31137', '31143', '31144', '31147', '31154', '31160', '31161', '31163', '31194', '31204', '31209', '31225', '31229', '31236', '31310', '31351', '31357', '31397', '31414', '31691', '31715', '31762', '31776', '31781', '31783', '31798', '31802', '31806', '31808', '31809', '31810', '31814', '31832', '31833', '31834', '31840', '31885', '31982', '32017', '32022', '32111', '32217', '32351', '33963', '33964', '33965', '34286', '34287', '34288', '34289', '34290', '34291', '34292', '34293', '34294', '34295', '34296', '34297', '34298', '34299'],
    '2013': ['32542', '32571', '32578', '32592', '32593', '32594', '32609', '32639', '32642', '32693', '32694', '32764', '32767', '32789', '32872', '32914', '33055', '33966', '33967', '33968', '33969', '33970', '33971', '33972', '33973', '33975', '33976', '33977', '33978', '33979', '33980', '33981', '33982', '33983', '33984', '33985', '33986', '33987', '33988', '33989', '33990', '33991', '33992', '33993', '33994', '33995', '33996', '33997', '33998', '33999', '34000', '34001', '34002', '34003', '34004', '34005', '34006', '34007', '34008', '34009', '34300', '34301', '34302', '34303', '34304', '34305', '34306', '34307', '34308', '34309', '34310', '34311', '34312', '34313', '34314', '34315', '34316', '34318', '34319', '34320', '34321', '34322'],
    '2014-2015': ['33166', '33171', '33177', '33202', '33221', '33229', '33231', '33302', '33312', '33313', '33325', '33326', '33397', '33442', '33576', '34010', '34011', '34012', '34013', '34014', '34015', '34017', '34018', '34019', '34020', '34021', '34023', '34024', '34186', '34223', '34326', '34328', '34329', '34331', '34332', '34333', '34334', '34341', '34342', '34343', '34344', '34346', '34349', '34352', '34353', '34362', '34561', '34695', '34696', '34697', '34698', '34707', '34717', '34722', '34723', '34746', '34764', '34841', '34845', '34846', '34898', '34899', '34901', '34911', '34921', '34948', '34969', '34980', '34986', '34988', '35034', '35038', '35043', '35087', '35162', '35169', '35170', '35191', '35195', '35196', '35268', '35434'],
    '2016-2017': ['35269', '35270', '35301', '35324', '35325', '35326', '35351', '35366', '35372', '35373', '35422', '35430', '35443', '35444', '35454', '35455', '35475', '35947', '35948', '35949', '35955', '35956', '35958', '35961', '35964', '35969', '35975', '36029', '36032', '36033', '36041', '36044', '36063', '36066', '36089', '36102', '36131', '36176', '36198', '36239', '36258', '36380', '36405', '36421', '36671', '36672', '36677', '36699', '36710', '36720', '36793', '36805', '36806', '36821', '36856', '36862', '36873', '36874', '36875', '36878', '36880', '36881', '36894', '36905', '36906', '36911', '36929', '36930', '36931', '36932', '36936', '36992', '36997', '37005', '37006', '37053', '37076', '37169', '37218'],
}

# # Stringify subpops (because ids later were interpreted as strings and subpops as integers)
# stringified_subpops = []
# for subpop in subpops:
#     stringified_subpops.append([str(item) for item in subpop])
# subpops = stringified_subpops

ids = list(snek.shell(f'bcftools query -l {vcf}', iterable=True))
print("ids:", ids)

# Find indicies of individuals
#subpops_by_indices = []
subpops_by_indices = {}
#print("subs:", subpops_dict)
for subpop_name, subpop_members in subpops_dict.items():
    print("subpop_name:", subpop_name, "subpop_members", subpop_members)
    indices = [ids.index(id) for id in subpop_members]
    #subpops_by_indices.append(indices)
    subpops_by_indices[subpop_name] = indices
print("subpops_by_indices:", subpops_by_indices)

# Pull genotype data from VCF.
callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'calldata/GT'])


from itertools import combinations
from statistics import mean
import pandas as pd


# Create pandas dataframe
header = ["chrom", "center", "fst", "pop1", "pop2"]
df = pd.DataFrame(columns=header)

# Loop through populations pariwise
for pair_names, pair_indices in zip(combinations(subpops_by_indices.keys(), 2), combinations(subpops_by_indices.values(), 2)):

    # Loop through each chromosome
    for chrom in sorted(list(set(callset['variants/CHROM']))):
        chrom_sites = np.equal(callset['variants/CHROM'], chrom)

        genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_sites])
        positions = callset['variants/POS'][chrom_sites]

        # Run Fst.
        fsts, windows, counts = allel.windowed_weir_cockerham_fst(positions, genotypes, pair_indices, size=500000)

        # Set negative Fst values to 0.
        for idx, fst in enumerate(fsts):
            if fst < 0:
                fsts[idx] = 0

        # Create chromosome dataframe
        window_centers = [mean(positions) for positions in windows]
        print("Lengths:")
        print("chrom", len([chrom]*len(fsts)))
        print("center", len(window_centers))
        print("fst", len(fsts))
        print("pop1", len([pair_names[0]]*len(fsts)))
        print("pop2", len([pair_names[1]]*len(fsts)))
        data_dict = {"chrom": [chrom]*len(fsts),
                    "center": window_centers,
                    "fst": fsts,
                    "pop1": [pair_names[1]]*len(fsts),
                    "pop2": [pair_names[0]]*len(fsts)}
        chr_df=pd.DataFrame.from_dict(data_dict)
        df = pd.concat([df, chr_df])

df = df.sort_values(["chrom", "center"])

# Store data as a pickle file.
# This can then be opened into a pandas dataframe and used to make graphs with Seaborn
df.to_pickle(pickle)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import allel
import numpy as np
import pandas as pd

# Variables taken from Snakemake rule
vcf = snakemake.input.vcf
roh_pickle = snakemake.output.roh_pickle
froh_pickle = snakemake.output.froh_pickle

# Reference genome Mmul_8.0.1 chromosomes
chromosome_lengths = {"1": 225_584_828,
               "2": 204_787_373,
               "3": 185_818_997,
               "4": 172_585_720,
               "5": 190_429_646,
               "6": 180_051_392,
               "7": 169_600_520,
               "8": 144_306_982,
               "9": 129_882_849,
               "10": 92_844_088,
               "11": 133_663_169,
               "12": 125_506_784,
               "13": 108_979_918,
               "14": 127_894_412,
               "15": 111_343_173,
               "16": 77_216_781,
               "17": 95_684_472,
               "18": 70_235_451,
               "19": 53_671_032,
               "20": 74_971_481,
               "X": 149_150_640}

# Create empty roh list for all samples
roh_list = []

# Create empty froh dictionary, which will contain proportion of genome in a ROH for each sample.
froh_list = []

# Pull genotype data from VCF
callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'samples', 'calldata/GT'])

# Loop through each sample
for sample_idx, sample in enumerate(callset['samples']):
    for chrom in sorted(list(set(callset['variants/CHROM']))):
        chrom_sites = np.equal(callset['variants/CHROM'], chrom)

        # Select genotypes by chromosome and sample
        genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_sites])
        genotypes = genotypes[:, sample_idx]

        positions = callset['variants/POS'][chrom_sites]

        roh_df, froh = allel.roh_poissonhmm(genotypes, positions, min_roh=1_000_000, contig_size=chromosome_lengths[chrom])

        # Append to samples list
        roh_list.append(roh_df)

    # Add froh to dictionary
    froh_list.append((sample, chrom, froh))


# Store data as pickle files.
# These can then be opened into a pandas dataframe and used to make graphs with Seaborn
pd.concat(roh_list, ignore_index=True).to_pickle(roh_pickle)
pd.DataFrame(froh_list, columns=['samples', 'chrom', 'froh']).to_pickle(froh_pickle)
ShowHide 101 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/allytrope/variant-analysis
Name: variant-analysis
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...