A Snakemake pipeline for variant calling of genomic FASTQ data using GATK

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction

This project is a snakemake workflow for processing .fastq.gz files and downstream analyses. The variant calling is primarily based on the GATK best practices for germline short variant discovery. Later analyses include pairwise relationship estimation and admixture.

Dependencies

conda is a package manager that can install many of the tools for this program. Once installed, use the below conda commands to obtain the required packages:

conda install mamba
mamba install -c bioconda snakemake
mamba install -c conda-forge biopython
mamba install -c anaconda pandas

Some will still have to be installed manually outside of conda. Depending on which part of the workflow is being used, the following may need to be installed as well and must be accessible through $PATH . The main one being gatk .

Other necessary libraries have packages and versions stored in workflow/envs . These will be installed automatically in their own conda environment when running this snakemake pipeline.

Configuration

Configuration settings are set with a .py file inside the directory config . By default, the file template.py is active. This can be copied and renamed to create multiple config files for convenient testing with different datasets. The name of the desired config file for a particular run, however, should be set in workflow/Snakefile on the line that says from template import __dict__ as config . For example, say we create a new config rhesus.py . This file would then be imported by changing template in that line to rhesus .

Initial required files to add to config file for first steps of variant calling are the following.

Species reference genome
Raw FASTQ files (ending with R1.fastq.gz and R2.fastq.gz )

Other subsequent steps require their own additional user-created files as well, which can all be specified in the config file.

Also, be sure to set the value for path for where the resources/ and results/ directory should go.

Setting Target Files

In addition to setting input files and variables inside the config files, target outputs (the files that we want to generate) must also be listed. These are listed inside the variable target_files . When we run, Snakemake will create the files (as well as any intermediates) based on what is listed there.

Writing the output files requires finding the rule in workflow/rules/ for what files are wanted and copying the output . For instance, if I want to make BAM files, I can find the corresponding rule align , which is in variant_calling.smk and find:

output: alignment = config["results"] + "alignments/raw/{sample}.bam",

Now, place that inside list of target files back inside our Snakefile , making sure to replace config["results"] with just results . Also replace {sample} with the actual sample name that we are interested in.

target_files = [ results + "alignments/raw/WGS12345.bam",
]

Often, many samples will be worked on at the same time. To facilitate this, we could just add more to the target list like so:

target_files = [ results + "alignments/raw/WGS12345.bam", results + "alignments/raw/WGS23456.bam",
]

Although a more efficient way is to utilize the expand function:

target_files = [ expand(results + "alignments/raw/{sample}.bam", sample=[12345, 23456]),
]

A couple convenience variables are preset, SAMPLE_NAMES and CHROMOSOMES . SAMPLE_NAMES stores all of the sample names listed in config["samples"] . And CHROMOSOMES stores all numbered chromosomes plus X, Y and MT if in the reference genome.

target_files = [ expand(results + "alignments/raw/{sample}.bam", sample=SAMPLE_NAMES),
]

As another example, to create genotyped VCFs for each chromosome, we can look at rule genotype_passing :

output: config["results"] + "genotypes/pass/{dataset}.{mode}.chr{chr}.vcf.gz",

For our config file, we could have the following below. {dataset} can essentially be named anything, just be consistent. Modifying this same is helpful if we want to have multiple parallel analyses when tweaking parameters. Currently, this workflow only works with {mode} set to SNP. This mode is for future work with indels or indels and SNPs together.

target_files = [ expand(results + "genotypes/pass/rhesus.SNP.chr{chr}.vcf.gz", chr=CHROMOSOMES),
]

Running

Running snakemake directly within the project's main directory on the command line will initiate the program by calling workflow/Snakefile . To perform a "dry-run", you can use the command snakemake -n . This will check that the rules are functional without actually processing any data. This is helpful to quickly test for any obvious problems with new rules. If an upstream file has been modified after a downstream one, Snakemake tries to redo the pipeline from the modified file. If done accidentally, this will delete downstream files, making the rules between run again. So it is good practice to perform the dry-run before each actual run to make sure important files aren't deleted. If a file was accidentally modified and downstream files don't want to be deleted, use Unix command touch -d can be used to set the timestamp to an earlier time. For example, setting file.txt to a timestamp of October 3rd could look like touch -d '3 Oct' file.txt .

To find the reason for Snakemake rerunning files, run snakemake -n -r . One of the proposed jobs should say "reason: Updated input files: _________". Those file(s) are the source of the problem.

On SGE Cluster

To actually run the program on a Sun Grid Engine cluster, the following command can be issued from the project directory:

NAME=smk_variant; LOG=log/dirname; nohup snakemake --profile profile --cluster "qsub -V -b n -cwd -pe smp {threads} -N $NAME -o $NAME.out.log -e $NAME.err.log" > $LOG/$NAME.smk.log 2>&1

Most of the Snakemake settings are stored in profile/config.yaml , which is called by the --profile profile option. The values there can be modified to adjust the maximum number of jobs and other information. The --cluster option sets the options used by the cluster. These can be left as is. However, set the names for variables NAME and LOG . NAME will be the name given to the cluster's queue and viewable with the qstat command. NAME is also used for the log files. LOG is then the directory in which the log files will be stored. This can be changed as needed to help organize logs.

Locally

Ideally, use the cluster configuration above. Otherwise, to run locally, use: nohup snakemake --use-conda -c2 . The integer in -c2 means that two cores will be used. Most rules list the number of threads used, which can be used here. But unlike the cluster method above, this will need to be manually changed. Higher integers (as long as the system has such) will lead to faster processing. This command will leave the error log inside a file in the current directory labelled nohup.out .

Log Files

In addition to generating the files given in target_files , the above cluster command will also generate log files. Check these when errors occur. There are three logs ending with the following patterns:

.smk.log tells what jobs have been submitted, what files each is trying to make, and when jobs have completed.
.err.log gives details about the actual jobs themselves. When submitting more than one job at once, this can become cluttered with interleaved messages from different jobs. To better find an error message for a specific job, it might require running only a single job at a time, which can be done by setting -j 1 .
.out.log prints output messages from the jobs. Once again, these will be interleaved. But in general, there is fewer information in the this file and often not as important as the error log.

Make sure to create the directories in which the log files will be placed prior to running to avoid errors.

Common Errors

Error: Directory cannot be locked - All that is required is to add --unlock to command from earlier as below. Then, once that finishes, run again without the --unlock option.

NAME=smk_variant; LOG=log/dirname; nohup snakemake --use-conda --cluster "qsub -V -b n -cwd -pe smp {threads} -N $NAME -o $NAME.out.log -e $NAME.err.log" -j 20 --unlock > $LOG/$NAME.smk.log 2>&1

MissingOutputException - Usually, can just be rerun and will work. Likely due to system latency. Otherwise, if the same file continues to cause this problem, the rule's output (if any) does not match the file(s) listed under the rule's output .

JSONDecodeError - Removing the .snakemake/metadata directory (which has JSON files) appears to fix this. This issue is documented here: https://github.com/snakemake/snakemake/issues/1342. Newer versions of snakemake may not have this problem.

Sometimes, submitted Snakemake commands will persist even when running of jobs has stopped. To deal with this, occasionally check running processes with ps -ax | grep variant-analysis . The first column of each row shows the process_id. For a "zombie process" with id 12345 , we could clear it with kill -9 12345 .

Code Snippets

shell: """
    awk -v START=0 'BEGIN {{PREV_START=START}} {{$4=(($2-PREV_START)*$4*0.000001)+PREV_CM; printf "%s %s %s %0.9f\n", $1, $2, $3, $4; PREV_START=$2; PREV_CM=$4}}' {input.recomb_rates_chrom} \
    > {output.genetic_map}
    """

SnakeMake From line 10 of rules/admixture.smk

shell: """
    liftOver {input.recomb_rates} {input.chain} {output.remapped} {output.unmapped}; \
    """

SnakeMake ucsc-liftover From line 23 of rules/admixture.smk

shell: """
    rfmix \
        -f {input.query_VCF} \
        -r {input.ref_VCF} \
        -m {input.sample_map} \
        -g {input.genetic_map} \
        -o {params.output_base} \
        --chromosome={wildcards.chr} \
        -e 3 \
        --reanalyze-reference \
    """

SnakeMake From line 49 of rules/admixture.smk

shell: """
    FILES=({input}); \
    head -n 2 ${{FILES[0]}} > {output.concat}; \
    for FILE in "${{FILES[@]}}"; do \
        sed '1d;2d' $FILE >> {output.concat}; \
    done \
    """

SnakeMake From line 70 of rules/admixture.smk

shell: """
    bcftools view {input.joint_vcf} \
        -s {wildcards.sample} \
        -Oz \
        -o {output.indiv_vcf} \
    """

SnakeMake BCFtools From line 11 of rules/concordance.smk

shell: """
    bcftools annotate {input.vcf} \
        -x INFO,FORMAT \
    | bcftools view \
        -s {wildcards.sample1},{wildcards.sample2} \
    | bcftools view \
        -i 'F_MISSING=0' -H \
    | cut -f 10-11 \
    | sed 's/|/\//g' \
    | sed 's/1\/0/0\/1/g' \
    | awk -v OFS='\t' \
        'BEGIN {{print "SAMPLE1","SAMPLE2","MATCHES","MISMATCHES","CONCORDANCE"}} \
        {{if ($1 == $2) MATCH+=1; else MISMATCH+=1}} \
        END {{CONCORDANCE = MATCH/(MATCH + MISMATCH); print "{wildcards.sample1}","{wildcards.sample2}",MATCH,MISMATCH,CONCORDANCE}}' \
    > {output.concordance} \
    """

SnakeMake BCFtools From line 33 of rules/concordance.smk

shell: """
        cat {input[0]} | head -n 1 > {output}; \
        for FILE in {input}; do \
            cat $FILE | sed '1d' | sed 's/\t\t/\t0\t/g' >> {output}; \
        done \
    """

SnakeMake From line 61 of rules/concordance.smk

shell: """
    bcftools view {input.vcf} \
        -s {wildcards.sample} \
    | bcftools view \
        -i "F_MISSING=0" -H \
    | wc -l \
    > {output.called_count} \
    """

SnakeMake BCFtools From line 74 of rules/concordance.smk

shell: """
    cat {input.called_counts} > {output.combined_counts}; \
    paste {config[samples]} {output.combined_counts} > {output.tsv} \
    """

SnakeMake From line 90 of rules/concordance.smk

shell: "bedtools genomecov \
            -ibam {input.bam} \
            -bg > {output.counts}"

SnakeMake BEDTools From line 13 of rules/coverage.smk

shell: "bedtools unionbedg \
            -i {input} \
            > {output}"

SnakeMake BEDTools From line 23 of rules/coverage.smk

shell: """
    awk -v 'OFS=\t' '{{for (i=4; i<=NF; i++) {{if ($i > 0) count += 1}}; if (count/(NF - 3) > {params.cutoff}) print $1,$2,$3; count = 0}}' {input}
    | grep '^NC'
    | bedtools merge
        -d 1 
    > {output}"""

SnakeMake BEDTools From line 36 of rules/coverage.smk

shell: """
    bedtools intersect \
        -a {input.target} \
        -b {input.ref_bed} \
    | samtools view \
    | wc -l \
    > {output}; \
    samtools view {input.target} \
    | wc -l \
    >> {output} \
    """

SnakeMake SAMtools BEDTools From line 52 of rules/coverage.smk

shell: """
    for i in {{1..2}}; \
    do awk -v i=$i 'FNR==i {{ print }}' {input.intersections} | awk '{{ sum+=$1 }} END {{ print sum }}'; \
    done \
    > {output.total} \
    """

SnakeMake From line 72 of rules/coverage.smk

shell: """
    bcftools view {input.vcf} \
        -R {input.bed} \
        -S {input.subset} \
        --min-ac 3 \
        -Oz \
        -o {output.vcf} \
    """

SnakeMake BCFtools From line 88 of rules/coverage.smk

run:
    import allel
    import pandas as pd

    callset = allel.read_vcf(input.vcf, ['samples', 'calldata/GT'])
    g = allel.GenotypeArray(callset['calldata/GT'])
    nonmissing_GT = g.count_called(axis=0)/len(g)

    data = {
        "sample": callset['samples'],
        "nonmissing_GT": nonmissing_GT,
    }
    with open(output, 'w') as f:
        f.write("# Based on the file: " + output)
    df = pd.DataFrame(data).set_index('sample')
    df.to_csv(output, mode="a")

SnakeMake Pandas From line 102 of rules/coverage.smk

shell: """
    mosdepth {params.prefix} {input.bam} \
        --by {params.window_size} \
        -n \
        --fast-mode \
        -t {threads} \
    """

SnakeMake mosdepth From line 168 of rules/coverage.smk

shell: """
    for BED in {input.bed}; do \
        SAMPLE=$(echo $BED | cut -d "." -f 1 | rev | cut -d "/" -f 1 | rev); \
        gunzip -c $BED \
        | awk -v SAMPLE=$SAMPLE 'BEGIN {{OFS="\\t"}} {{print SAMPLE, $0}}' \
        >> {output.merged_bed}; \
    done \
    """

SnakeMake From line 184 of rules/coverage.smk

shell: """echo '''
Information about where these files fall in the workflow can be seen under the `.smk` files under `workflow/rules`.

The directories for variant calling from `variant_calling.smk` and `phasing.smk` are created in the following order:

├─ alignments  # Duplicate sequences are marked and sorted
|   |
|   ├─ raw  # .bam files after .fastq files aligned to reference
|   |
|   ├─ markdup  # .bam files after running fixing mates pairs and marking duplicate reads
|   |
|   └─ alignments_recalibrated  # BQSR tables applied to sequences
|       |
|       └─ recal_tables  # BQSR recalibration tables that are applied to sequences
|
├─ gvcf  # Called variants as .g.vcf files
|
├─ db  # GenomicsDB datastore consolidating .g.vcf files. Not human-readable
|
├─ joint_call  # db data converted to a joint .vcf file with VQSR applied
|
├─ vcf  # Multisample VCF
|   |
|   ├─ hard_filtered # (Option 1) Filters by hard values
|   |   |
|   |   ├─ filter_applied  # FILTER column filled included those variants that didn't pass
|   |   |
|   |   └─ pass_only  # Only variants with FILTER=PASS
|   |
|   └─ VQSR  # (Option 2) Variant recalibration, filters by using truth and training data
|       |
|       ├─ model  # Models to be applied to VCFs
|       |
|       ├─ filter_applied  # FILTER column filled included those variants that didn't pass
|       |
|       └─ pass_only  # Only variants with FILTER=PASS
|
├─* plink  # Contains plink files
|
├─ genotypes
|   |
|   ├─ posteriors  # Calculate PP tag (posterior probabilites) and recaluclate MQ tag
|   |
|   ├─ filtered  # Set to missing genotypes with MQ < 20
|   |
|   ├─ filtered  # Set to missing genotypes with MQ < 20
|   |
|   └─ subsets  # Taking subsets of samples based on sequencing methods: GBS, WES, WGS
|
└─ haplotypes  # Phased/imputed VCFs
    |
    ├─ scaffolds  # Preliminary VCFs of genotypes incorportating pedigree information
    |
    └─ SHAPEIT4  # Phased/imputed VCFs using scaffold and original VCF


The directories for relations from `relations.smk` are as follows. `plink` is created first. The others do not follow an order.
However, these require the joint vcf file from `joint_call`:

├─ plink  # Contains plink files
|
├─ relatedness  # Pairwise estimates of relatedness between samples
|
├─ admixture  # Finds rates of admixture among samples
|   |
|   ├─ supervised  # Uses additional samples of known origin
|   |
|   └─ unsupervised  # No known origins
|
└─ aims  # Ancestry informative markers

Directories for determining kinship using LASAR and SEEKIN:

└─ kinship  # Allele frequencies and pairwise relatedness

Other:

└─ structural_variants  # Strutural variants

''' > {output}
"""

SnakeMake pLink From line 7 of rules/directory_structure.smk

shell: """echo '''
Recommended layout for resources:

├─ barcodes  # Table of barcodes for samples (used for GBS and AMP reads)
|
├─ reads  # .fastq.gz file for each sample (R1 and R2)
|
├─ ref_fna  # Reference genome
|
├─ ref_vcf  # Reference .vcf.gz files
|
├─ samples  # List of samples

''' > {output}
"""

SnakeMake From line 92 of rules/directory_structure.smk

script: "../scripts/fst.py"

SnakeMake From line 14 of rules/fst.smk

shell: """
    vcftools \
        --bcf {input.bcf} \
        --weir-fst-pop {input.pop1} \
        --weir-fst-pop {input.pop2} \
        --out {params.out_prefix} \
        --fst-window-size {params.size} \
        --fst-window-step {params.step} \
    """

SnakeMake VCFtools From line 32 of rules/fst.smk

shell: """
    gatk --java-options '-Xmx8g' VariantFiltration \
        -V {input.vcf} \
        -O {output.filtered} \
        {params.filters} """

SnakeMake gatk From line 39 of rules/hard_filter.smk

shell: """
    bcftools view {input.vcf} \
        -f PASS \
        -Oz \
        -o {output.vcf}"""

SnakeMake BCFtools From line 54 of rules/hard_filter.smk

shell: """
    bcftools  {input.vcfs} \
        --allow-overlaps \
        -Oz \
        -o {output} \
        --threads {threads}"""

SnakeMake BCFtools From line 76 of rules/hard_filter.smk

shell: """
    bcftools view {input.vcf} \
        -S {input.samples_subset} \
        -Ou \
    | bcftools reheader \
        -s {input.sample_map} \
    | bcftools view \
        --min-ac=1 \
        -Ou \
    | bcftools norm \
        -m+any \
        -Oz \
        -o {output}"""

SnakeMake BCFtools From line 154 of rules/hard_filter.smk

shell: """
    bcftools view {input.vcf} \
        -S {input.samples_subset} \
        -Ou \
    | bcftools norm \
        -m+any \
        -Ov \
    | vcftools \
        --vcf - \
        --max-missing 0.7 \
        -c \
        --recode --recode-INFO-all \
    | bgzip > {output}"""

SnakeMake BCFtools VCFtools From line 186 of rules/hard_filter.smk

shell: """
    gunzip -c {input.ref_fna} \
    | grep ">" \
    | cut -d ":" -f 4,6 \
    | sed 's/:/\t/' \
    > {output.contig_lengths} \
    """

SnakeMake From line 7 of rules/homozygosity.smk

shell: """
    python3 workflow/scripts/runs_of_homozygosity.py {input.vcf} {input.contig_lengths} {output.roh_pickle} {output.froh_pickle} {wildcards.chr} \
    """

SnakeMake From line 27 of rules/homozygosity.smk

shell: """
    makeScaffold \
        --gen {input.vcf} \
        --fam {input.fam} \
        --reg {wildcards.chr} \
        --out {output.scaffold} \
    """

SnakeMake From line 15 of rules/imputation.smk

shell: """
    bcftools annotate {input.vcf} \
        -a {input.phased} \
        -c FORMAT/GT \
        -o {output.annotated} \
        -Oz \
    """

SnakeMake BCFtools From line 122 of rules/imputation.smk

shell: """
    sed 's/\\t\\t/\\tNA\\t/g;s/\\t$/\\tNA/g' {input.tsv} \
    | grep -v NA$'\\t'NA \
    > {output.fam} \
    """

SnakeMake From line 140 of rules/imputation.smk

shell: """
    SHAPEIT5_phase_common \
        --input {input.vcf} \
        --pedigree {input.fam} \
        --region {wildcards.chr} \
        --output {output.phased} \
        --log {log} \
        --thread {threads} \
    """

SnakeMake From line 160 of rules/imputation.smk

shell: """
    SHAPEIT5_phase_common \
        --input {input.vcf} \
        --pedigree {input.fam} \
        --reference {input.ref} \
        --region {wildcards.chr} \
        --output {output.phased} \
        --log {log} \
        --thread {threads} \
    """

SnakeMake From line 186 of rules/imputation.smk

shell: """
    java -Xmx50g -jar {params.jar} \
        gt={input.vcf} \
        out={output.out} \
        nthreads={threads} \
    """

SnakeMake From line 215 of rules/imputation.smk

shell: """
    SHAPEIT5_phase_common \
        --input {input.vcf} \
        --pedigree {input.fam} \
        --region {wildcards.chr} \
        --output {output.phased} \
        --log {log} \
        --thread {threads} \
    """

SnakeMake From line 273 of rules/imputation.smk

shell: """
    GLIMPSE2_chunk \
        --input {input.vcf} \
        --region {wildcards.chr} \
        --sequential \
        --output {output.chunks} \
        --log {log} \
        --threads {threads} \
    """

SnakeMake From line 296 of rules/imputation.smk

shell: """
        ORG=$(grep {wildcards.chr}:{wildcards.start}-{wildcards.end} {input.chunks} | cut -f 4); \
        GLIMPSE2_split_reference \
            --reference {input.vcf} \
            --input-region {wildcards.chr}:{wildcards.start}-{wildcards.end} \
            --output-region $ORG \
            --output {params.out_prefix} \
            --threads {threads} \
            --log {log.log}; \
    """

SnakeMake From line 356 of rules/imputation.smk

shell: """
    for SAMPLE in $(bcftools query -l {input.vcf}); do \
        echo {params.prefix}${{SAMPLE}}{params.suffix}; \
    done > {output.bam_list} \
    """

SnakeMake From line 404 of rules/imputation.smk

shell: """
    GLIMPSE2_phase \
        --bam-list {input.bam_list} \
        --reference {input.ref} \
        --output {output.vcf} \
        --threads {threads} \
    """

SnakeMake From line 420 of rules/imputation.smk

run:
    with open(output.txt, "w") as f:
        for line in input.chunk_files:
            f.write(line + "\n")

SnakeMake From line 451 of rules/imputation.smk

shell: """
    GLIMPSE2_ligate \
        --input {input.txt} \
        --output {output.vcf} \
        --threads {threads} \
    """

SnakeMake From line 477 of rules/imputation.smk

shell: "bwa index {input}"

SnakeMake BWA From line 10 of rules/indices.smk

shell: "samtools faidx {input}"

SnakeMake SAMtools From line 18 of rules/indices.smk

shell: "gatk CreateSequenceDictionary \
        -R {input.fasta}"

SnakeMake gatk From line 39 of rules/indices.smk

shell: "samtools index {input}"

SnakeMake SAMtools From line 49 of rules/indices.smk

shell: "bcftools index {input} \
            --tbi"

SnakeMake BCFtools From line 60 of rules/indices.smk

shell: "bcftools index {input} \
            --csi"

SnakeMake BCFtools From line 72 of rules/indices.smk

shell: """
    bcftools concat {input.vcfs} \
        -o {output.vcf} \
        -Oz \
    """

SnakeMake BCFtools From line 17 of rules/misc.smk

shell: """
    gatk --java-options "-Xmx8g" CalculateGenotypePosteriors \
        -V {input.vcf} \
        -O {output} \
        --tmp-dir ~/tmp/{rule}"""

SnakeMake gatk From line 18 of rules/phasing.smk

shell: """
    gatk --java-options "-Xmx8g" VariantFiltration \
        -V {input.vcf} \
        --genotype-filter-expression "GQ < {params.GQ}" \
        --genotype-filter-name "GQ{params.GQ}" \
        --set-filtered-genotype-to-no-call \
        -O {output.vcf}"""

SnakeMake gatk From line 38 of rules/phasing.smk

shell: """
    bcftools view {input.vcf} \
        -S {input.samples} \
        -Ou \
    | bcftools view \
        --min-af {params.min_AF} \
        --max-af {params.max_AF} \
        --min-ac {params.min_AC} \
        -Oz \
        -o {output.vcf} \
    """

SnakeMake BCFtools From line 61 of rules/phasing.smk

shell: """
    bcftools query -l {input.vcf} | grep WGS > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -Oz \
        -o {output.vcf} \
    """

SnakeMake BCFtools From line 81 of rules/phasing.smk

shell: """
    bcftools query -l {input.vcf} | grep WES > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -R {input.bed} \
        -Oz \
        -o {output.vcf} \
    """

SnakeMake BCFtools From line 98 of rules/phasing.smk

shell: """
    sed 's/./&\t/3' {input.samples} \
    | awk -v OFS='\t' '{{print $2,$1}}' \
    | sort \
    | awk '$1!=p{{if(p)print s; p=$1; s=$0; next}}{{sub(p,x); s=s $0}} END {{print s}}' \
    | awk -v OFS='' '{{print $NF,$1}}' \
    | grep -v "_" \
    | sort > {output.largest_samples} \
    """

SnakeMake From line 126 of rules/phasing.smk

shell: """
    sed 's/AMP/AMP\t/g;s/GBS/GBS\t/g; s/WES/WES\t/g; s/WGS/WGS\t/g' {input.samples} \
    | sort -k 2 \
    | join - {input.parents} -1 2 -2 1 \
    | awk '{{print $2$1"\t"$3"\t"$4}}' \
    > {output.parents} \
    """

SnakeMake From line 147 of rules/phasing.smk

run:
    samples = None
    with open(input.samples, "r") as f:
        samples = f.read().strip().split("\n")
    with open(input.parents, "r") as f:
        with open(output.tsv, "a") as out:
            for line in f:
                child, sire, dam = line.strip("\n").split("\t")
                seq_type = child[:3]
                if f"WGS{sire}" in samples:
                    sire = f"WGS{sire}"
                elif f"{seq_type}{sire}" in samples:
                    sire = f"{seq_type}{sire}"
                else:
                    sire = ""
                if f"WGS{dam}" in samples:
                    dam = f"WGS{dam}"
                elif f"{seq_type}{dam}" in samples:
                    dam = f"{seq_type}{dam}"
                else:
                    dam = ""
                out.write(f"{child}\t{sire}\t{dam}\n")

SnakeMake From line 166 of rules/phasing.smk

shell: """
    awk 'BEGIN {{OFS="\t"}} {{print 0,$1,$2,$3,0,0}}' {input.tsv} \
    | sed 's/\t\t/\t0\t/g' \
    | sed 's/\t\t/\t0\t/g' \
    > {output.ped} \
    """

SnakeMake From line 199 of rules/phasing.smk

shell: """
    grep -E "(WGS.*|WES.*|GBS.*|AMP.*){{3}}" {input.tsv} \
    > {output.tsv} \
    """

SnakeMake From line 213 of rules/phasing.smk

shell: """
    read -r CHILD SIRE DAM <<< $(grep ^{wildcards.sample} {input.tsv}); \
    if [ -n "$SIRE" ]; \
    then SIRE=",$SIRE"; \
    fi; \
    if [ -n "$DAM" ]; \
    then DAM=",$DAM"; \
    fi; \
    bcftools view {input.vcf} \
        -s {wildcards.sample}$SIRE$DAM \
        -Oz \
        -o {output.trio} \
    """

SnakeMake BCFtools From line 228 of rules/phasing.smk

shell: """
    bcftools annotate -x INFO,^FORMAT/GT {input.trio} \
    | bcftools view -H -i "F_MISSING==0" \
    | awk '{{print $10,$11,$12}}' \
    | python3 {params.script} \
    | sort \
    | uniq -c > {output.counts} \
    """

SnakeMake BCFtools From line 253 of rules/phasing.smk

shell: """
    whatshap phase {input.vcf} {params.child_bam} {params.sire_bam} {params.dam_bam} \
        --reference={input.ref_fasta} \
        --ped={input.ped} \
        --tag=PS \
        -o {output.phased} \
    """

SnakeMake WhatsHap From line 298 of rules/phasing.smk

shell: """
    SAMPLE=$(bcftools query -l {input.trio} | head -n 1); \
    bcftools view {input.trio} \
        -s $SAMPLE \
        -Ou \
        -o {output.indiv} \
    """

SnakeMake BCFtools From line 315 of rules/phasing.smk

shell: """
    bcftools merge {input.trios} \
        -Oz \
        -o {output.merged} \
    """

SnakeMake BCFtools From line 340 of rules/phasing.smk

shell: """
    bcftools query -l {input.vcf} | grep {wildcards.seq} > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -Oz \
        -o {output.vcf} \
    """

SnakeMake BCFtools From line 354 of rules/phasing.smk

shell: """
    bcftools query -l {input.vcf} | grep {wildcards.seq} > {output.samples};
    bcftools view {input.vcf} \
        -S {output.samples} \
        -R {input.bed} \
        -Oz \
        -o {output.vcf} \
    """

SnakeMake BCFtools From line 370 of rules/phasing.smk

shell: """
    bcftools concat {input.vcfs} \
        -o {output} \
        -Oz \
    """

SnakeMake BCFtools From line 23 of rules/relations.smk

shell: """
    bcftools concat {input.vcfs} \
        -o {output} \
        -Oz \
    """

SnakeMake BCFtools From line 40 of rules/relations.smk

shell: """
    plink \
        --vcf {input.vcf} \
        --update-parents {input.parents_table} \
        --update-sex {input.sex_table} \
        --recode12 \
        --make-bed \
        --out {params.out_path}/{wildcards.dataset} \
    """

SnakeMake pLink From line 63 of rules/relations.smk

shell: """
    plink \
        --bfile {params.path}/{wildcards.dataset} \
        --update-ids {input.update_fids} \
        --make-bed \
        --out {params.path}/{wildcards.dataset}.same_fid \
    """

SnakeMake pLink From line 85 of rules/relations.smk

shell: """
    king \
        -b {input.bed} \
        --kinship \
        --prefix {params.prefix} \
    """

SnakeMake king From line 106 of rules/relations.smk

shell: """
    plink \
        --vcf {input.vcf} \
        --update-parents {input.parents_table} \
        --update-sex {input.sex_table} \
        --recode12 \
        --make-bed \
        --out {params.out_path}/{wildcards.dataset}.chr{wildcards.chr} \
    """

SnakeMake pLink From line 133 of rules/relations.smk

shell: """
    plink \
        --bfile {params.path}/{wildcards.dataset}.chr{wildcards.chr} \
        --update-ids {input.update_fids} \
        --make-bed \
        --out {params.path}/{wildcards.dataset}.chr{wildcards.chr}.same_fid \
    """

SnakeMake pLink From line 155 of rules/relations.smk

shell: """
    king \
        -b {input.bed} \
        --kinship \
        --prefix {params.prefix} \
    """

SnakeMake king From line 176 of rules/relations.smk

shell: """
    vcftools \
        --gzvcf {input.vcf} \
        --het \
        --stdout \
    > {output.het} \
    """

SnakeMake VCFtools From line 196 of rules/relations.smk

shell: "lcmlkin \
            -i {input.vcf} \
            -o {output} \
            -g all \
            -l phred \
            -u {input.founders} \
            -t {threads}"

SnakeMake From line 212 of rules/relations.smk

script: "../scripts/fst.py"

SnakeMake From line 231 of rules/relations.smk

script: "../scripts/diversity.py"

SnakeMake From line 239 of rules/relations.smk

    shell: "plink \
                --file  " + config["results"] + " plink/{wildcards.dataset} \
                --out " + config["results"] + " plink/recode12/{wildcards.dataset}_recode12 \
                --recode12"
'''

SnakeMake pLink From line 272 of rules/relations.smk

shell: "admixture {input.ped} {wildcards.clusters} \
            -j{threads} \
            --cv | tee {output.out}; \
        mv {wildcards.dataset}_recode12.{wildcards.clusters}.Q {output.q}; \
        mv {wildcards.dataset}_recode12.{wildcards.clusters}.P {output.p}"

SnakeMake From line 287 of rules/relations.smk

shell: "echo -e 'sample_id\tCluster_1\tCluster_2' > {output}; \
        TMP=delimited.tmp; \
        sed 's/ /\t/g' {input.q} > $TMP; \
        cut {input.ped} -d ' ' -f 2 \
        | paste - $TMP >> {output}; \
        rm $TMP;"

SnakeMake From line 305 of rules/relations.smk

shell: "plink \
            --vcf {input.vcf} \
            --recode12 \
            --make-bed \
            --out {config[results]}admixture/supervised/plink/{wildcards.sample}"

SnakeMake pLink From line 331 of rules/relations.smk

shell: "admixture {input.bed} {wildcards.clusters} \
            --supervised \
            -j{threads}; \
        mv {wildcards.dataset}.{wildcards.clusters}.Q {output.q}; \
        mv {wildcards.dataset}.{wildcards.clusters}.P {output.p}"

SnakeMake From line 349 of rules/relations.smk

shell: "plink \
            --bfile " + config["results"] + "admixture/supervised/plink/{wildcards.dataset} \
            --recode \
            --out " + config["results"] + "admixture/supervised/plink/{wildcards.dataset}"

SnakeMake pLink From line 364 of rules/relations.smk

shell: "echo -e 'Sample_id\tChinese\tIndian' > {output}; \
        TMP=delimited.tmp; \
        TMP2=pasted.tmp; \
        sed 's/ /\t/g' {input.q} > $TMP; \
        cut {input.ped} -d ' ' -f 2 \
        | paste - $TMP >> $TMP2; \
        tail -n 674 $TMP2 >> {output}; \
        rm $TMP; rm $TMP2"

SnakeMake From line 383 of rules/relations.smk

run:
    import matplotlib.pyplot as plt
    import pandas as pd
    import seaborn as sns

    data = pd.read_table(str(input))
    sns.set_theme()
    ax = sns.histplot(data=data, bins=20, multiple="stack")
    ax.set(title="Ancestral Admixture in SNPRC Rhesus Macaques", xlabel="Ancestry ratio", xlim=(0.5, 1), ylabel="Number of individuals")
    plt.savefig(str(output))

SnakeMake Pandas matplotlib seaborn From line 397 of rules/relations.smk

shell: "awk '{{print $1,$4}}' {input.map} \
        | paste - {input.p} \
        | awk '$4 - $3 > {params.max_diff} || $3 - $4 > {params.max_diff} {{print $0}}' - \
        > {output}"

SnakeMake From line 419 of rules/relations.smk

shell: """
    delly call {input.bam} \
        -g {input.ref_fasta} \
    > {output.bcf} \
    """

SnakeMake delly From line 16 of rules/structural_variation.smk

shell: """
    delly merge {input.bcfs} \
        -o {output.bcf} \
    """

SnakeMake delly From line 31 of rules/structural_variation.smk

shell: """
    delly call {input.bam} \
        -g {input.ref_fasta} \
        -v {input.bcf} \
        -o {output.bcf} \
    """

SnakeMake delly From line 47 of rules/structural_variation.smk

shell: """
    bcftools merge {input.bcfs}\
        -m id \
        -Ob \
        -o {output.bcf} \
    """

SnakeMake BCFtools From line 66 of rules/structural_variation.smk

shell: """
    delly filter {input.bcf} \
        -f germline \
        -o {output.bcf} \
    """

SnakeMake delly From line 82 of rules/structural_variation.smk

shell: """
    bcftools view {input.bcf} \
        -i "ALT='<{wildcards.SV_type}>'" \
        -Ob \
        -o {output.bcf} \
    """

SnakeMake BCFtools From line 99 of rules/structural_variation.smk

shell: """
    dicey chop {input.ref_fasta} \
        --fq1 {params.R1} \
        --fq2 {params.R2} \
    """

SnakeMake From line 121 of rules/structural_variation.smk

shell: """
    bwa mem {input.ref_fasta} {input.R1} {input.R2} \
        -t {params.bwa_threads} \
    | samtools sort \
        -@ {params.samtools_threads} \
        -o {output.bam} \
    """

SnakeMake SAMtools BWA From line 142 of rules/structural_variation.smk

shell: """
    dicey mappability2 {input.bam} \
        -o {output.map}; \
    gunzip {output.map} && bgzip {params.map} \
    """

SnakeMake From line 162 of rules/structural_variation.smk

shell: """
    delly cnv {input.bam} \
        -g {input.ref_fasta} \
        -l {input.map} \
        -o {output.bcf} \
    """

SnakeMake delly From line 182 of rules/structural_variation.smk

shell: """
    delly merge {input.bcfs} \
        --cnvmode \
        --pass \
        --minsize 1000 \
        --maxsize 100000 \
        -o {output.bcf} \
    """

SnakeMake delly From line 198 of rules/structural_variation.smk

shell: """
    delly cnv {input.bam} \
        --segmentation \
        -v {input.bcf} \
        -g {input.ref_fasta} \
        -m {input.map} \
        -o {output.bcf} \
    """

SnakeMake delly From line 220 of rules/structural_variation.smk

shell: """
    bcftools merge {input.bcfs} \
        -m id \
        -Ob \
        -o {output.bcf} \
    """

SnakeMake BCFtools From line 241 of rules/structural_variation.smk

shell: """
    delly classify {input.bcf} \
        --filter germline \
        -o {output.bcf} \
    """

SnakeMake delly From line 256 of rules/structural_variation.smk

shell: """
    # ADAPTERS=$(gunzip -c {input.reads[0]} | head -n 1 | cut -d " " -f 2 | cut -d ":" -f 4);
    # i7=$(echo $ADAPTERS | cut -d "+" -f 1);
    # i5=$(echo $ADAPTERS | cut -d "+" -f 2);
    R1_END_ADAPTER="{params.pre_i7}";  # ${{i7}}{params.post_i7}";
    R2_END_ADAPTER="{params.pre_i5}";  # ${{i5}}{params.post_i5}";
    cutadapt {input.reads} \
        -a $R1_END_ADAPTER \
        -A $R2_END_ADAPTER \
        --cores {threads} \
        -o {output.trimmed[0]} \
        -p {output.trimmed[1]}"""

SnakeMake Cutadapt From line 34 of rules/variant_calling.smk

shell: """
    ADAPTERS=$(gunzip -c {input.reads[0]} | head -n 1 | cut -d " " -f 2 | cut -d ":" -f 4);
    R1_END_ADAPTER="{params.R1_adapter}";
    R2_END_ADAPTER="{params.barcode}{params.R2_adapter}";
    cutadapt {input.reads} \
        -a $R1_END_ADAPTER \
        -A $R2_END_ADAPTER \
        --cores {threads} \
        -o {output.trimmed[0]} \
        -p {output.trimmed[1]}"""

SnakeMake Cutadapt From line 112 of rules/variant_calling.smk

shell: """
    bash workflow/scripts/align.sh {input.trimmed[0]} {input.trimmed[1]} {input.ref} {output} {threads} {wildcards.sample}"""

SnakeMake From line 141 of rules/variant_calling.smk

shell: """
    samtools sort {input.alignment} \
        -n \
        -T ~/tmp/{rule} \
    | samtools fixmate - - \
        -m \
    | samtools sort - \
        -T ~/tmp/{rule} \
    | samtools markdup - {output.alignment} \
        -T ~/tmp/{rule} \
    """

SnakeMake SAMtools From line 155 of rules/variant_calling.smk

shell: """
    samtools sort {input.alignment} \
        -n \
        -T ~/tmp/{rule} \
    | samtools fixmate - - \
        -m \
    | samtools sort - \
        -T ~/tmp/{rule} \
        -o {output.alignment} \
    """

SnakeMake SAMtools From line 179 of rules/variant_calling.smk

shell: """
    gatk --java-options '-Xmx8g' BaseRecalibrator \
        -R {input.ref} \
        -I {input.bam} \
        --known-sites {input.known_variants} \
        -O {output} \
        --tmp-dir ~/tmp/{rule}"""

SnakeMake gatk From line 206 of rules/variant_calling.smk

shell: """
    gatk --java-options '-Xmx8g' ApplyBQSR \
        -R {input.ref} \
        -I {input.bam} \
        --bqsr-recal-file {input.recal} \
        -O {output} \
        --tmp-dir ~/tmp/{rule} \
        --create-output-bam-index false"""

SnakeMake gatk From line 227 of rules/variant_calling.smk

shell: """
    gatk --java-options '-Xmx16g' HaplotypeCaller \
        -R {input.ref} \
        -I {input.bam} \
        -O {output.vcf} \
        -ERC GVCF \
        --native-pair-hmm-threads {threads} \
        --tmp-dir ~/tmp/{rule}"""

SnakeMake gatk From line 249 of rules/variant_calling.smk

run:
    with open(output.sample_map, "w") as sample_map:
        for gvcf in input.gvcfs:
            sample = gvcf.split("/")[-1].split(".")[0]
            sample_map.write(f"{sample}\t{gvcf}\n")

SnakeMake From line 271 of rules/variant_calling.smk

shell: """
    zcat {input.ref} | grep "^>" | cut -c 2- | cut -d " " -f 1 | grep -x -E "^[0-9]+|X|Y|MT" > {output}
    """

SnakeMake From line 290 of rules/variant_calling.smk

shell:  """
    CONTIGS=$(awk 'BEGIN {{ORS = ","}} {{print $0}}' {input.contigs}); \
    if [ -d {params.db} ]; \
    then WORKSPACE_FLAG="genomicsdb-update-workspace-path"; \
    else WORKSPACE_FLAG="genomicsdb-workspace-path"; \
    fi; \
    gatk --java-options '-Xmx8g' GenomicsDBImport \
        --$WORKSPACE_FLAG {params.db} \
        --intervals {input.contigs} \
        --sample-name-map {input.sample_map} \
        --batch-size 50 \
        --genomicsdb-shared-posixfs-optimizations true \
        --reader-threads {threads} \
        --max-num-intervals-to-import-in-parallel {params.parallel_intervals} \
    && touch {output}
    """

SnakeMake gatk From line 309 of rules/variant_calling.smk

shell: """
    gatk --java-options '-Xmx16g' GenotypeGVCFs \
        -R {input.ref} \
        -V gendb://{params.db} \
        -O {output.vcf} \
        -L {wildcards.chr} \
    """

SnakeMake gatk From line 342 of rules/variant_calling.smk

shell: """
    bcftools norm {input.vcf} \
        -m-any \
        --fasta-ref {input.ref_fasta} \
        -Ou \
    | bcftools view \
        -e'type{params.equality}"snp"' \
        -Ou \
    | bcftools norm \
        -m+any \
        -Ou \
    | bcftools view \
        -M2 \
        -m2 \
        -Oz \
        -o {output.split} \
    """

SnakeMake BCFtools From line 374 of rules/variant_calling.smk

READ1=$1
READ2=$2
REF=$3
OUTPUT=$4
THREADS=$5
SAMPLE=$6

echo "In .sh file";
ID=$(gunzip -c "$READ1" | head -n 1 | cut -c 2- | awk -v FS=':' '{print "ID:"$1"."$2}');
echo $ID;
echo post_id;
PU=$(gunzip -c "$READ1" | head -n 1 | cut -c 2- | awk -v FS=':' '{print "PU:"$3"."$4}');
echo $PU;
echo post_pu;

echo "$SAMPLE"
echo '@RG\tSM:'$SAMPLE'\tLB:'$SAMPLE'\t'$ID'\t'$PU'.'$SAMPLE'\tPL:ILLUMINA'

bwa mem $REF $READ1 $READ2 \
    -t $THREADS \
    -R '@RG\tSM:'$SAMPLE'\tLB:'$SAMPLE'\t'$ID'\t'$PU'.'$SAMPLE'\tPL:ILLUMINA' \
| samtools view -Sb - > $OUTPUT

Shell SAMtools BWA From line 3 of scripts/align.sh

import allel
import numpy as np
import pandas as pd

# Variables taken from Snakemake rule
vcf = snakemake.input.vcf
chromosome_file = snakemake.output.chromosomes
annotation_file = snakemake.output.annotations

# Reference genome Mmul_8.0.1 chromosomes
chromosomes = {"name": ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
                        "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19",
                        "chr20", "chrX"],
               "start": [1 for i in range(1, 22)],
               "stop": [225_584_828, 204_787_373, 185_818_997, 172_585_720, 190_429_646, 180_051_392, 169_600_520, 144_306_982, 129_882_849,
               92_844_088, 133_663_169, 125_506_784, 108_979_918, 127_894_412, 111_343_173, 77_216_781, 95_684_472, 70_235_451, 53_671_032,
               74_971_481, 149_150_640]}

chromosomes_df = pd.DataFrame(data=chromosomes)
chromosomes_df.to_csv(chromosome_file, header=False, index=False, sep="\t")

# Current ids don't include any individuals with missing years.
# Colony 1
subpop = ['17509', '17515', '17511', '17514', '17516', '17584', '17534', '17535', '17527', '17536', '17525', '17553', '17546', '17552', '17551', '17557', '8796', '17563', '17572', '17570', '17567', '9793', '17562', '18314', '17578', '17581', '17582', '17521', '17575', '17588', '17583', '17585', '17740', '10244', '17590', '19325', '17591', '17597', '19323', '17606', '12908',
    '17640', '17620', '17639', '17619', '17626', '17624', '17615', '17622', '17629', '16340', '16345', '16346', '16347', '16349', '16350', '16351', '16353', '16355', '16356', '16358', '16639', '16640', '16641', '16643', '16644', '16647', '16649', '16653', '16655', '16656', '16658', '16342', '16344', '16348', '16354', '17657', '16657', '16645', '16646', '16650', '16652', '17674', '17663', '17661', '17647', '17679', '17678', '17656', '17642', '17660', '17641', '14970', '17666', '17654', '17658', '17659', '17662', '17645',
    '16339', '17701', '17684', '17710', '17680', '17723', '17732', '17733', '17730', '17714', '17717', '17745', '18400', '18408', '17133', '17724', '18416', '18401', '17737', '17736', '19223', '19224', '19150', '19177', '19312', '18987', '19315', '18419', '18994', '18421', '19316', '18990', '19314', '18983', '18985', '18422', '19699', '19702', '19852', '19857', '19853', '19846', '20064', '19689', '19695', '19845', '19691', '20077', '20082', '20078', '20080', '20083', '19851', '19856', '19688', '19700', '19318',
    '29495', '29501', '26837', '26858', '26353', '26935', '27010', '26957', '29453', '29461', '29454', '29455', '29458', '26786', '26875', '26955', '26899', '27128', '27441', '26515', '27037', '26695', '26177', '26813', '26232', '27056', '26992', '26732', '28035', '28324', '28282', '28303', '28403', '28356', '28092', '28138', '28302', '28215', '27646', '28168', '28195', '28210', '27783', '28196', '27682', '28109', '28110', '28066', '28240', '28189', '28144', '28060', '28071', '28310', '27886', '28829', '28796', '28898', '28935', '29141', '28927', '28937', '28794',
    '29576', '29372', '29429', '29445', '29555', '29532', '29364', '29362', '29560', '29437', '29436', '29507', '30618', '30142', '30009', '30125', '30118', '30019', '30204', '30024', '30122', '30128', '30139', '30285', '30003', '30034', '30086', '30739', '30649', '30506', '30513', '30633', '30732', '30644', '30663', '30742', '30736', '30735', '30643', '30876', '30881', '30651', '30956', '30638', '30755', '30565', '30855', '30505', '30654', '30610', '30641', '30848', '31310', '31204', '31209', '31225', '31163', '31127', '31414', '31357', '31160', '31397', '31236', '31143', '31229', '31154', '31351', '31137', '31144', '31161', '31194', '31147',
    '31808', '31798', '31776', '31809', '31802', '31810', '31833', '32351', '31834', '32111', '31715', '31691', '31832', '32022', '32217', '31806', '31814', '31885', '31762', '31781', '31982', '32017', '31783', '31840', '32767', '32642', '32578', '32594', '32593', '32609', '32872', '32592', '32764', '32914', '32693', '32789', '32571', '32542', '33055', '32694', '32639', '33202', '33229', '33397', '33326', '33221', '33177', '33442', '33312', '33171', '33576', '33231', '33166', '33325', '33313', '33302',
    '34746', '34921', '34353', '34561', '35034', '34717', '34901', '34986', '34696', '34722', '34362', '34223', '34186', '34723', '34764', '35422', '35955', '35325', '35351', '36041', '35454', '36089', '36033', '35949', '36063', '35444', '36066', '36032', '35964', '35443', '35948', '35947', '35958', '35975', '35956', '36806', '36671', '37218', '37005', '36793', '36992', '36931', '36906', '36894', '36878', '36881', '36862', '36875', '36672', '36805', '36932'
    ]

# Colony 2
# subpop = ['33914', '33916', '33917', '33919', '33918', '33920', '33923', '33926', '33935', '33929', '33932', '33938', '33924', '33937', '33933', '33930', '33931', '33943', '34229', '34235', '34241', '34228', '33945', '34231', '34230', '33942', '33947',
#     '33952', '33951', '33949', '33948', '34257', '34245', '34246', '34252', '34256', '34243', '34244', '34249', '34247', '34250', '34254', '34273', '34272', '34274', '34266', '34271', '34261', '34260', '34265', '34258', '34270', '34264', '34259',
#     '34283', '34285', '34282', '34281', '34276', '34284', '33959', '34277', '33960', '34280', '34296', '34298', '34294', '34297', '34293', '34292', '33963', '34291', '34295', '34290', '34289', '34286', '34288', '34299', '34287', '33965', '33964',
#     '33990', '33997', '34007', '34002', '34004', '34305', '33979', '33982', '33984', '33993', '33998', '33992', '34008', '34009', '34006', '33971', '34311', '34309', '34308', '33980', '33981', '33968', '34301', '33967', '33976', '34319', '33987', '33978', '34001', '34005', '33999', '33991', '33995', '33986', '34003', '34000', '34320', '34313', '34312', '34310', '34306', '34304', '34321', '34316', '34307', '34318', '34314', '34322', '33970', '33994', '33988', '33973', '34300', '34302', '33966', '33975', '33972', '33989', '33977', '33983', '33996', '33985', '34303', '34315', '33969', '34023', '34018', '34342', '34024', '34021', '34010', '34343', '34349', '34346', '34326', '34329', '34328', '34014', '34019', '34020', '34344', '34331', '34332', '34341', '34012', '34011', '34013', '34017', '34352', '34333', '34334', '34015',
#     '35162', '35087', '35043', '34911', '35169', '34899', '34898', '35196', '35191', '35434', '35195', '35268', '35170', '34846', '35038', '34707', '34841', '34695', '34988', '34697', '34698', '34948', '34845', '34980', '34969', '36405', '35270', '35373', '35326', '35269', '35475', '36044', '36029', '35969', '35372', '35324', '36258', '36131', '35430', '36102', '36239', '35961', '36421', '35366', '36198', '36380', '35301', '35455', '36176', '36873', '36930', '36720', '36936', '36905', '36880', '36911', '37169', '36997', '36699', '36710', '36856', '36874', '36821', '37006', '37076', '36929', '36677', '37053'
#     ]


# Pull genotype data from VCF.
callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'samples', 'calldata/GT'])

#sample_filter = np.isin(callset["samples"], subpop)
sample_subpop_indices = [list(callset["samples"]).index(id) for id in subpop]

# Create pandas dataframe for chromosome file (for chromoMap)
header = ["id", "chrom", "win_start", "win_end", "diversity"]
df = pd.DataFrame(columns=header)

# Create pandas dataframe for annotation file (for chromoMap)
header = ["id", "chrom", "win_start", "win_end", "diversity"]
df = pd.DataFrame(columns=header)

# Loop through each chromosome
for chrom in sorted(list(set(callset['variants/CHROM']))):
    chrom_filter = np.equal(callset['variants/CHROM'], chrom)

    # Take genotypes from current chromosome
    genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_filter])

    # Keep only genotypes from subpopulation
    genotypes = genotypes.take(sample_subpop_indices, axis=1)

    ac = genotypes.count_alleles()
    pos = callset["variants/POS"][chrom_filter]
    end = pos.max()

    pi, windows, n_bases, counts = allel.windowed_diversity(pos, ac, size=1_000_000, start=1, stop=end)

    data_dict = {"id": [f"chr{chrom}:{window[0]}" for window in windows],
                "chrom": [f"chr{chrom}"]*len(pi),
                "win_start": [window[0] for window in windows],
                "win_end": [window[1] for window in windows],
                "diversity": pi,
                }
    chr_df = pd.DataFrame.from_dict(data_dict)
    df = pd.concat([df, chr_df])

df = df.sort_values(["chrom", "win_start"])
df.to_csv(annotation_file, header=False, index=False, sep="\t")

Python Pandas numpy From line 1 of scripts/diversity.py

import allel
import numpy as np
import snakemake as snek

# Variables taken from Snakemake rule
vcf = snakemake.input.vcf
pickle = snakemake.output.pickle

# Taken from config file. Needs an implementation to pull from there
subpops_dict = {
    '1980-1995': ['10235', '10244', '10250', '10814', '11452', '11491', '12001', '12147', '12540', '12543', '12908', '12932', '14392', '14393', '14396', '17504', '17505', '17507', '17508', '17509', '17511', '17514', '17515', '17516', '17521', '17522', '17525', '17527', '17534', '17535', '17536', '17537', '17539', '17544', '17546', '17551', '17552', '17553', '17557', '17562', '17563', '17566', '17567', '17568', '17569', '17570', '17572', '17573', '17575', '17578', '17581', '17582', '17583', '17584', '17585', '17588', '17590', '17591', '17597', '17606', '17740', '18314', '19323', '19325', '7139', '7893', '8796', '8896', '9038', '9590', '9710', '9793', '9908'],
    '1996-2001': ['14970', '16339', '16340', '16342', '16344', '16345', '16346', '16347', '16348', '16349', '16350', '16351', '16353', '16354', '16355', '16356', '16358', '16639', '16640', '16641', '16643', '16644', '16645', '16646', '16647', '16649', '16650', '16652', '16653', '16655', '16656', '16657', '16658', '17133', '17615', '17619', '17620', '17622', '17624', '17626', '17629', '17639', '17640', '17641', '17642', '17645', '17647', '17654', '17656', '17657', '17658', '17659', '17660', '17661', '17662', '17663', '17666', '17674', '17678', '17679', '17680', '17684', '17701', '17710', '17714', '17717', '17723', '17724', '17730', '17732', '17733', '17745', '18400', '18408', '33914', '33916', '33917', '33918', '33919', '33920'],
    '2002-2005': ['17736', '17737', '18401', '18416', '18419', '18421', '18422', '18983', '18985', '18987', '18990', '18994', '19150', '19177', '19223', '19224', '19312', '19314', '19315', '19316', '19318', '19688', '19689', '19691', '19695', '19699', '19700', '19702', '19845', '19846', '19851', '19852', '19853', '19856', '19857', '20064', '20077', '20078', '20080', '20082', '20083', '26177', '26232', '26353', '26515', '26695', '26732', '26786', '26813', '26837', '26858', '26875', '26899', '26935', '26955', '26957', '26992', '27010', '27037', '27056', '27128', '27441', '29453', '29454', '29455', '29458', '29461', '29495', '29501', '33923', '33924', '33926', '33929', '33930', '33931', '33932', '33933', '33935', '33937', '33938', '33942', '33943', '33945', '33947', '33948', '33949', '33951', '33952', '34228', '34229', '34230', '34231', '34235', '34241'],
    '2006-2009': ['27646', '27682', '27783', '27886', '28035', '28060', '28066', '28071', '28092', '28109', '28110', '28138', '28144', '28168', '28189', '28195', '28196', '28210', '28215', '28240', '28282', '28302', '28303', '28310', '28324', '28356', '28403', '28794', '28796', '28829', '28898', '28927', '28935', '28937', '29141', '29362', '29364', '29372', '29429', '29436', '29437', '29445', '29507', '29532', '29555', '29560', '29576', '30003', '30009', '30019', '30024', '30034', '30086', '30118', '30122', '30125', '30128', '30139', '30142', '30204', '30285', '30618', '33959', '33960', '34243', '34244', '34245', '34246', '34247', '34249', '34250', '34252', '34254', '34256', '34257', '34258', '34259', '34260', '34261', '34264', '34265', '34266', '34270', '34271', '34272', '34273', '34274', '34276', '34277', '34280', '34281', '34282', '34283', '34284', '34285'],
    '2010-2012': ['30505', '30506', '30513', '30565', '30610', '30633', '30638', '30641', '30643', '30644', '30649', '30651', '30654', '30663', '30732', '30735', '30736', '30739', '30742', '30755', '30848', '30855', '30876', '30881', '30956', '31127', '31137', '31143', '31144', '31147', '31154', '31160', '31161', '31163', '31194', '31204', '31209', '31225', '31229', '31236', '31310', '31351', '31357', '31397', '31414', '31691', '31715', '31762', '31776', '31781', '31783', '31798', '31802', '31806', '31808', '31809', '31810', '31814', '31832', '31833', '31834', '31840', '31885', '31982', '32017', '32022', '32111', '32217', '32351', '33963', '33964', '33965', '34286', '34287', '34288', '34289', '34290', '34291', '34292', '34293', '34294', '34295', '34296', '34297', '34298', '34299'],
    '2013': ['32542', '32571', '32578', '32592', '32593', '32594', '32609', '32639', '32642', '32693', '32694', '32764', '32767', '32789', '32872', '32914', '33055', '33966', '33967', '33968', '33969', '33970', '33971', '33972', '33973', '33975', '33976', '33977', '33978', '33979', '33980', '33981', '33982', '33983', '33984', '33985', '33986', '33987', '33988', '33989', '33990', '33991', '33992', '33993', '33994', '33995', '33996', '33997', '33998', '33999', '34000', '34001', '34002', '34003', '34004', '34005', '34006', '34007', '34008', '34009', '34300', '34301', '34302', '34303', '34304', '34305', '34306', '34307', '34308', '34309', '34310', '34311', '34312', '34313', '34314', '34315', '34316', '34318', '34319', '34320', '34321', '34322'],
    '2014-2015': ['33166', '33171', '33177', '33202', '33221', '33229', '33231', '33302', '33312', '33313', '33325', '33326', '33397', '33442', '33576', '34010', '34011', '34012', '34013', '34014', '34015', '34017', '34018', '34019', '34020', '34021', '34023', '34024', '34186', '34223', '34326', '34328', '34329', '34331', '34332', '34333', '34334', '34341', '34342', '34343', '34344', '34346', '34349', '34352', '34353', '34362', '34561', '34695', '34696', '34697', '34698', '34707', '34717', '34722', '34723', '34746', '34764', '34841', '34845', '34846', '34898', '34899', '34901', '34911', '34921', '34948', '34969', '34980', '34986', '34988', '35034', '35038', '35043', '35087', '35162', '35169', '35170', '35191', '35195', '35196', '35268', '35434'],
    '2016-2017': ['35269', '35270', '35301', '35324', '35325', '35326', '35351', '35366', '35372', '35373', '35422', '35430', '35443', '35444', '35454', '35455', '35475', '35947', '35948', '35949', '35955', '35956', '35958', '35961', '35964', '35969', '35975', '36029', '36032', '36033', '36041', '36044', '36063', '36066', '36089', '36102', '36131', '36176', '36198', '36239', '36258', '36380', '36405', '36421', '36671', '36672', '36677', '36699', '36710', '36720', '36793', '36805', '36806', '36821', '36856', '36862', '36873', '36874', '36875', '36878', '36880', '36881', '36894', '36905', '36906', '36911', '36929', '36930', '36931', '36932', '36936', '36992', '36997', '37005', '37006', '37053', '37076', '37169', '37218'],
}

# # Stringify subpops (because ids later were interpreted as strings and subpops as integers)
# stringified_subpops = []
# for subpop in subpops:
#     stringified_subpops.append([str(item) for item in subpop])
# subpops = stringified_subpops

ids = list(snek.shell(f'bcftools query -l {vcf}', iterable=True))
print("ids:", ids)

# Find indicies of individuals
#subpops_by_indices = []
subpops_by_indices = {}
#print("subs:", subpops_dict)
for subpop_name, subpop_members in subpops_dict.items():
    print("subpop_name:", subpop_name, "subpop_members", subpop_members)
    indices = [ids.index(id) for id in subpop_members]
    #subpops_by_indices.append(indices)
    subpops_by_indices[subpop_name] = indices
print("subpops_by_indices:", subpops_by_indices)

# Pull genotype data from VCF.
callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'calldata/GT'])


from itertools import combinations
from statistics import mean
import pandas as pd


# Create pandas dataframe
header = ["chrom", "center", "fst", "pop1", "pop2"]
df = pd.DataFrame(columns=header)

# Loop through populations pariwise
for pair_names, pair_indices in zip(combinations(subpops_by_indices.keys(), 2), combinations(subpops_by_indices.values(), 2)):

    # Loop through each chromosome
    for chrom in sorted(list(set(callset['variants/CHROM']))):
        chrom_sites = np.equal(callset['variants/CHROM'], chrom)

        genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_sites])
        positions = callset['variants/POS'][chrom_sites]

        # Run Fst.
        fsts, windows, counts = allel.windowed_weir_cockerham_fst(positions, genotypes, pair_indices, size=500000)

        # Set negative Fst values to 0.
        for idx, fst in enumerate(fsts):
            if fst < 0:
                fsts[idx] = 0

        # Create chromosome dataframe
        window_centers = [mean(positions) for positions in windows]
        print("Lengths:")
        print("chrom", len([chrom]*len(fsts)))
        print("center", len(window_centers))
        print("fst", len(fsts))
        print("pop1", len([pair_names[0]]*len(fsts)))
        print("pop2", len([pair_names[1]]*len(fsts)))
        data_dict = {"chrom": [chrom]*len(fsts),
                    "center": window_centers,
                    "fst": fsts,
                    "pop1": [pair_names[1]]*len(fsts),
                    "pop2": [pair_names[0]]*len(fsts)}
        chr_df=pd.DataFrame.from_dict(data_dict)
        df = pd.concat([df, chr_df])

df = df.sort_values(["chrom", "center"])

# Store data as a pickle file.
# This can then be opened into a pandas dataframe and used to make graphs with Seaborn
df.to_pickle(pickle)

Python Snakemake Pandas numpy BCFtools From line 1 of scripts/fst.py

import allel
import numpy as np
import pandas as pd

# Variables taken from Snakemake rule
vcf = snakemake.input.vcf
roh_pickle = snakemake.output.roh_pickle
froh_pickle = snakemake.output.froh_pickle

# Reference genome Mmul_8.0.1 chromosomes
chromosome_lengths = {"1": 225_584_828,
               "2": 204_787_373,
               "3": 185_818_997,
               "4": 172_585_720,
               "5": 190_429_646,
               "6": 180_051_392,
               "7": 169_600_520,
               "8": 144_306_982,
               "9": 129_882_849,
               "10": 92_844_088,
               "11": 133_663_169,
               "12": 125_506_784,
               "13": 108_979_918,
               "14": 127_894_412,
               "15": 111_343_173,
               "16": 77_216_781,
               "17": 95_684_472,
               "18": 70_235_451,
               "19": 53_671_032,
               "20": 74_971_481,
               "X": 149_150_640}

# Create empty roh list for all samples
roh_list = []

# Create empty froh dictionary, which will contain proportion of genome in a ROH for each sample.
froh_list = []

# Pull genotype data from VCF
callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'samples', 'calldata/GT'])

# Loop through each sample
for sample_idx, sample in enumerate(callset['samples']):
    for chrom in sorted(list(set(callset['variants/CHROM']))):
        chrom_sites = np.equal(callset['variants/CHROM'], chrom)

        # Select genotypes by chromosome and sample
        genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_sites])
        genotypes = genotypes[:, sample_idx]

        positions = callset['variants/POS'][chrom_sites]

        roh_df, froh = allel.roh_poissonhmm(genotypes, positions, min_roh=1_000_000, contig_size=chromosome_lengths[chrom])

        # Append to samples list
        roh_list.append(roh_df)

    # Add froh to dictionary
    froh_list.append((sample, chrom, froh))


# Store data as pickle files.
# These can then be opened into a pandas dataframe and used to make graphs with Seaborn
pd.concat(roh_list, ignore_index=True).to_pickle(roh_pickle)
pd.DataFrame(froh_list, columns=['samples', 'chrom', 'froh']).to_pickle(froh_pickle)