Demographic analysis of Brassica oleracea

public public 1yr ago 0 bookmarks

Bo_demography

Demographic analysis of Brassica oleracea

Alignment of sequence data

Follows GATK4 best practices (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145)

Implemented as a Snakemake workflow (https://snakemake.readthedocs.io/en/stable/index.html)

TODO

  • Clean up filtering steps

  • Add adjustable parameters & metrics for filtering criteria (depth, missingness)

  • More efficient use of scratch space

  • Include sample ids & read groups in bwa-mem step (thanks, Michelle!): bwa mem -R $(echo "@RG\tID:${LANE}_${SAMPLE}\tSM:$SAMPLE\tLB:${SAMPLE}.1\tPL:ILLUMINA") -a $B73FA $R1 $R2

  • Add a config file to specify samples/reference/paths/etc. (will make workflow more general)

    • include variable for scratch directory
  • add rule for generating mappability mask using SNPable

  • Find a better solution to create input for ADMIXTURE (right now requires manual edit of chr names and snp ids)

Workflow

Snakefile provides general organization (names of samples, calls other scripts/rules)

/rules contains rules (*.smk) that tell snakemake what jobs to run; organized broadly into categories

  1. mapping.smk - map reads to reference genome

    • download fastq files ( fasterq-dump )

    • trim adapters ( Trimmomatic PE )

    • map to reference genome ( bwa mem ; B. oleracea HDEM reference)

    • sort BAM files ( gatk/SortSam )

    • mark duplicates with Picard ( gatk/MarkDuplicates )

    • add read groups with Picard ( gatk/AddOrReplaceReadGroups )

  2. calling.smk - call variants

    • identify raw SNPs/indels ( gatk/HaplotypeCaller )

    • combine GVCFs ( gatk/GenomicsDBImport )

    • joing genotyping ( gatk/GenotypeGVCFs )

  3. filtering.smk - filter SNPs

    • select SNPs ( gatk/SelectVariants )

    • apply hard filters ( gatk/VariantFiltration )

    • export diagnostics (qual, quality depth, depth, marker quality, etc.) ( gatk/VariantsToTable )

    • remove sites with missing data and select biallelic SNPs ( gat/SelectVariants )

    • filter on individual sample depth ( gatk/VariantFiltration + gatk/SelectVariants )

    • export filtered depth for each accession ( gatk/VariantsToTable )

    • combine vcfs into a single file ( bcftools concat )

  4. smc.smk - estimate population size history with SMC++

    • convert vcf to smc format - requires mappability mask ( smc++ vcf2smc )

    • fit population size history with cross-validation ( smc++ cv )

    • fit population size history without cross-validation ( smc++ estimate )

  5. admixture.smk - maximum likelihood estimation of individual ancestries

    • create input file (some bash commands + LD pruning in plink )

    • run admixture ( admixture )

To run the workflow:

  1. Create a conda environment (only need to do this once)
    conda env create bo-demography --file environment.yaml

    • to access for future use, start a screen session with screen and use:
      source activate bo-demography
  2. If running on a cluster, update the cluster config file (submit.json)

  3. Dry run to test that everything is in place... snakemake -n

  4. Submit the workflow with ./submit.sh

Project Organization

├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
 ├── external <- Data from third party sources.
 ├── interim <- Intermediate data that has been transformed.
 ├── processed <- The final, canonical data sets for modeling.
 └── raw <- The original, immutable data dump.

├── docs <- A default Sphinx project; see sphinx-doc.org for details

├── models <- Trained and serialized models, model predictions, or model summaries

├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
 the creator's initials, and a short `-` delimited description, e.g.
 `1.0-jqp-initial-data-exploration`.

├── references <- Data dictionaries, manuals, and all other explanatory materials.

├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
 └── figures <- Generated graphics and figures to be used in reporting

├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
 generated with `pip freeze > requirements.txt`

├── src <- Source code for use in this project.
 ├── data <- Scripts to download or generate data
 ├── features <- Scripts to turn raw data into features for modeling
 ├── models <- Scripts to train models and then use trained models to make
  predictions
 └── visualization <- Scripts to create exploratory and results oriented visualizations

Project based on the cookiecutter data science project template . #cookiecutterdatascience

Code Snippets

16
17
18
19
20
21
run:
	shell("fastqc \
	-t {threads} \
	-o {params} \
	--noextract \
	{input}")
41
42
43
44
run:
	shell("multiqc {params.indir}/*L004* -o {params.outdir} -n STJRI01_multiqc.html")
	shell("multiqc {params.indir}/*L001* -o {params.outdir} -n STJRI02_multiqc.html")
	shell("multiqc {params.indir}/*L002* -o {params.outdir} -n STJRI03_multiqc.html")
31
32
33
34
35
36
37
38
run:
    shell("wget --directory-prefix=data/external/ref \
    http://www.genoscope.cns.fr/externe/plants/data/Boleracea_chromosomes.fasta")
    shell("bwa index -a bwtsw {output.ref}")
    shell("samtools faidx {output.ref}")
    shell("gatk CreateSequenceDictionary \
    -R={output.ref} \
    -O={output.dict}")
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
run:
    print({params.sample}, {params.SRR})
    shell("mkdir -p {params.tmp}")
    shell("fasterq-dump \
    {params.SRR} \
    -O {params.tmp} \
    -o {params.sample} \
    -t {params.tmp} \
    -e {threads} \
    -p")
    shell("java -jar /share/apps/Trimmomatic-0.36/trimmomatic.jar PE \
    {params.stem}_1.fastq {params.stem}_2.fastq \
    {params.stem}.forward.1.fastq \
    {params.stem}.unpaired.1.fastq \
    {params.stem}.reverse.2.fastq \
    {params.stem}.unpaired.2.fastq \
    LEADING:3 \
117
118
119
120
121
122
123
124
run:
    shell("mkdir -p {params.tmp}")
    shell("java -jar /share/apps/Trimmomatic-0.36/trimmomatic.jar PE \
    {input.fastq1} {input.fastq2} \
    {params.stem}.forward.1.fastq \
    {params.stem}.unpaired.1.fastq \
    {params.stem}.reverse.2.fastq \
    {params.stem}.unpaired.2.fastq \
150
151
run:
    shell("mkdir -p {params.tmp}")
184
185
186
187
188
189
190
191
192
193
194
run:
    shell("mkdir -p {params.tmp}")
    shell("gatk MarkDuplicates \
    -I={input} \
    -O={output.bam} \
    --METRICS_FILE={output.metrics} \
    --CREATE_INDEX=true \
    -MAX_FILE_HANDLES=1000 \
    --ASSUME_SORT_ORDER=coordinate \
    --TMP_DIR={params.tmp}")
    shell("rm -rf {params.tmp}")
210
211
run:
    shell("mkdir -p {params.tmp}")
239
240
241
242
243
244
245
246
247
run:
	shell("qualimap bamqc \
	-bam {input.bam} \
	--paint-chromosome-limits \
	-gff {input.gff} \
	-nt {threads} \
	-outdir {params.dir} \
	-outformat HTML \
	--skip-duplicated")
263
264
265
266
267
268
269
270
271
272
273
274
275
276
run:
	shell("find reports/bamqc -mindepth 1 -maxdepth 1 -type d | grep SamC > models/ALL.bamqclist.txt")
	import pandas as pd
	data = pd.read_csv("models/ALL.bamqclist.txt", sep = " ", header = None, names = ['filename'])
	data['sample'] = data['filename'].str.split('.').str[0].str.split('/').str[2].str.split('_stats').str[0]
	data = data.sort_values('sample', axis = 0, ascending = True)
	data = data[['sample','filename']]
	data.to_csv(r'models/bamqc_list.txt', header = None, index = None, sep = ' ', mode = 'a')
	shell("qualimap multi-bamqc \
	--data {params.infile} \
	--paint-chromosome-limits \
	-outdir {params.outdir} \
	-outformat html \
	--java-mem-size=40G")
27
28
29
30
31
32
33
34
35
run:
	shell("gatk HaplotypeCaller \
	-I {input.bam} \
	-O {output} \
	-R {input.ref} \
	-L {params.regions} \
	-G StandardAnnotation \
	-G AS_StandardAnnotation \
	--emit-ref-confidence BP_RESOLUTION")
53
54
55
56
57
58
run:
	shell("gatk SplitIntervals \
	-R {input.ref} \
	-L {params.regions} \
	--scatter-count {params.intervals} \
	-O {params.outdir}")
77
78
79
80
81
82
83
84
85
86
87
run:
	shell("mkdir -p {params.tmp}")
	shell("gatk --java-options \"-Xmx60g -Xms60g\" \
	GenomicsDBImport \
	--genomicsdb-workspace-path {output} \
	--batch-size 50 \
	--reader-threads 6 \
	--sample-name-map {input.map} \
	--intervals {input.region} \
	--tmp-dir {params.tmp}")
	shell("rm -rf {params.tmp}")
102
103
104
105
106
107
108
109
110
111
run:
	shell("gatk GenotypeGVCFs \
	-R {input.ref} \
	-V {params.db} \
	-L {params.region} \
	-new-qual \
	-G StandardAnnotation \
	-G AS_StandardAnnotation \
	--include-non-variant-sites \
	-O {output}")
122
123
124
run:
	shell("bcftools concat {input} -Oz -o {output}")
	shell("tabix -p vcf {output}")
23
24
25
26
27
28
29
run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.vcf} \
	-select-type SNP \
	--restrict-alleles-to BIALLELIC \
	-O {output}")
41
42
43
44
45
46
run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.vcf} \
	-select-type NO_VARIATION \
	-O {output}")
64
65
66
67
68
69
70
71
72
73
74
75
run:
	shell("gatk VariantsToTable \
	-R {input.ref} \
	-V {input.variant_vcf} \
	-F CHROM -F POS -F QUAL -F QD -F DP -F MQ -F MQRankSum -F FS -F ReadPosRankSum -F SOR -F AD \
	-O {output.variant_stats}")
	shell("gatk VariantsToTable \
	-R {input.ref} \
	-V {input.invariant_vcf} \
	-F CHROM -F POS -F QUAL -F QD -F DP -F MQ -F MQRankSum -F FS -F ReadPosRankSum -F SOR -F AD \
	-O {output.invariant_stats}")
	shell("Rscript src/filtering_diagnostics.R")
112
113
114
115
116
117
118
119
120
121
122
123
124
125
run:
	shell("gatk VariantFiltration \
	-V {input.raw_snp_vcf} \
	--intervals {params.chr} \
	-filter \"QD < 2.0\" --filter-name \"QD2\" \
	-filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \
	-filter \"SOR > 3.0\" --filter-name \"SOR3\" \
	-filter \"FS > 60.0\" --filter-name \"FS60\" \
	-filter \"MQ < 40.0\" --filter-name \"MQ40\" \
	-filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \
	-filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \
	-G-filter \"DP < 5 || DP > 200\" \
	-G-filter-name \"DP_5-200\" \
	-O {output.snps_flagged}")
134
135
136
137
138
run:
	shell("gatk VariantFiltration \
	-V {input.snps_flagged} \
	--set-filtered-genotype-to-no-call true \
	-O {output.snps_nocall}")
148
149
150
151
152
153
154
155
156
run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.snps_nocall} \
	-select-type SNP \
	--restrict-alleles-to BIALLELIC \
	--max-nocall-fraction 0.1 \
	--exclude-filtered true \
	-O {output.filtered_snps}")
166
167
168
169
170
171
172
173
174
175
176
run:
	shell("gatk VariantFiltration \
	-V {input.raw_invariant_vcf} \
	--intervals {params.chr} \
	-filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \
	-filter \"MQ < 40.0\" --filter-name \"MQ40\" \
	-filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \
	-filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \
	-G-filter \"DP < 5 || DP > 200\" \
	-G-filter-name \"DP_5-200\" \
	-O {output.invariant_flagged}")
185
186
187
188
189
run:
	shell("gatk VariantFiltration \
	-V {input.invariant_flagged} \
	--set-filtered-genotype-to-no-call true \
	-O {output.invariant_nocall}")
199
200
201
202
203
204
205
run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.invariant_nocall} \
	--max-nocall-fraction 0.1 \
	--exclude-filtered true \
	-O {output.filtered_invariant}")
214
215
216
run:
	shell("bgzip {input}")
	shell("tabix -p vcf {input}.gz")
225
226
227
228
run:
	shell("gatk MergeVcfs \
	-I={params.input} \
	-O={output.merged_snps}")
236
237
238
239
240
run:
	shell("gatk MergeVcfs \
	-I={input.snps} \
	-I={input.invariant} \
	-O={output.allsites_vcf}")
248
249
250
251
252
253
254
255
256
run:
	shell("vcftools \
	--gzvcf {input.merged_snps} \
	--missing-indv \
	--out reports/filtered.qual.dp5_200.maxnocall10")
	shell("vcftools \
	--gzvcf {input.merged_snps} \
	--het \
	--out reports/filtered.qual.dp5_200.maxnocall10")
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
run:
    shell("{params.plink2} --vcf {input.vcf} \
    --allow-extra-chr \
    --max-alleles 2 \
    --vcf-filter \
    --geno 0.1 \
    --mind {params.mind} \
    --maf 0.05 \
    --make-bed \
    --out {params.stem}")
    shell("cp {params.stem}.bim {params.stem}_temp.bim")
    shell("""sed \"s/^C//g" {params.stem}_temp.bim > {params.stem}_temp2.bim""")
    shell("./src/replace_rsids.sh {params.stem}_temp2.bim > {params.stem}.bim")
    shell("rm {params.stem}_temp*.bim")
    shell("{params.plink2} --bfile {params.stem} \
    --allow-extra-chr \
    --indep-pairwise 50 10 0.1 \
    --out {params.stem}")
    shell("{params.plink2} --bfile {params.stem} \
    --allow-extra-chr \
    --extract {params.stem}.prune.in \
    --out {params.pruned} \
    --make-bed")
60
61
62
run:
    shell("admixture -B --cv -j{threads} {input.bed} {params.k}")
    shell("mv {params.stem}.{params.k}.* models/admixture")
80
81
82
83
84
85
86
87
88
89
run:
    shell("tabix -p vcf -f {input.allsites_vcf}")
    shell("pixy --stats pi \
    --vcf {input.allsites_vcf} \
    --populations {input.samps} \
    --window_size {params.window_size} \
    --n_cores {threads} \
    --output_folder models/pixy \
    --output_prefix B_oleracea_grouped_{params.chr} \
    --chunk_size 50000")
38
39
40
41
42
43
44
45
46
47
48
run:
    shell("mkdir -p {params.tmp}")
    shell("samtools faidx {input} {params.chr} > {params.newfasta}")
    shell("{snpable}/splitfa {params.newfasta} 100 | split -l 80000000 - {params.tmp}/x")
    shell("cat {params.tmp}/x* > {params.tmp}/simu_reads.fq")
    shell("bwa aln -R 1000000 -O 3 -E 3 -t {threads} {input} {params.tmp}/simu_reads.fq > {params.tmp}/simu_reads.sai")
    shell("bwa samse {input} {params.tmp}/simu_reads.sai {params.tmp}/simu_reads.fq > {params.tmp}/simu_reads.sam")
    shell("perl {snpable}/gen_raw_mask.pl {params.tmp}/simu_reads.sam > {params.tmp}/raw_mask_100.fa")
    shell("{snpable}/gen_mask -l 100 -r 0.5  {params.tmp}/raw_mask_100.fa > {params.famask}")
    shell("rm -rf {params.tmp}")
    shell("./src/makeMappabilityMask.py {params.famask}")
121
122
123
124
125
shell:
	"smc++ cv \
    --cores {threads} \
    --spline cubic \
	-o {params.model_out_dir} {params.mu} {params.model_in}"
143
144
145
146
147
148
shell:
    "smc++ plot \
    --csv \
    -g {params.gen} \
    {output} \
    {input.smc_out}"
167
168
169
170
171
172
173
174
shell:
    "python3 src/smc_bootstrap.py \
    --nr_bootstraps {params.nr_bootstraps} \
    --chunk_size {params.chunk_size} \
    --chunks_per_chromosome 10 \
    --nr_chromosomes {params.nr_chr} \
    models/smc/bootstrap_input/{params.population}.{params.distind}_rep \
    {params.input_dir}"
189
190
191
192
193
shell:
	"smc++ cv \
    --cores {threads} \
    --spline cubic \
	-o {params.model_out_dir} {params.mu} {params.model_in}"
204
205
206
207
208
209
shell:
    "smc++ plot \
    --csv \
    -g {params.gen} \
    {output} \
    {input.smc_out}"
16
17
18
19
20
21
22
shell:
    """
    smc++ vcf2smc \
    --mask {input.mask} \
    -d {params.distind1} {params.distind1} \
    {input.vcf} {output.out12} {params.chrom} {params.pop_pair_string12}
    """
37
38
39
40
41
42
43
shell:
    """
    smc++ vcf2smc \
    --mask {input.mask} \
    -d {params.distind2} {params.distind2} \
    {input.vcf} {output.out21} {params.chrom} {params.pop_pair_string21}
    """
54
55
56
57
58
shell:
    "smc++ split \
    --cores 8 \
    -o {params.model_out_dir} \
    {input}"
74
75
76
77
78
79
80
81
shell:
    "python3 scripts/smc_bootstrap.py \
    --nr_bootstraps {params.nr_bootstraps} \
    --chunk_size {params.chunk_size} \
    --chunks_per_chromosome 10 \
    --nr_chromosomes {params.nr_chr} \
    models/smc/split_bootstrap_input/{params.pop_pair}_12.{params.distind1}_rep \
    {params.input_dir}"
 95
 96
 97
 98
 99
100
101
102
shell:
    "python3 scripts/smc_bootstrap.py \
    --nr_bootstraps {params.nr_bootstraps} \
    --chunk_size {params.chunk_size} \
    --chunks_per_chromosome 10 \
    --nr_chromosomes {params.nr_chr} \
    models/smc/split_bootstrap_input/{params.pop_pair}_21.{params.distind1}_rep \
    {params.input_dir}"
113
114
115
116
117
shell:
    "smc++ split \
    --cores 2 \
    -o {params.model_out_dir} \
    {input}"
130
131
132
133
134
135
136
137
138
139
140
141
142
shell:
    """
    smc++ plot \
    --csv \
    -g {params.gen} \
    {output.split} \
    {input.smc_split}
    smc++ plot \
    --csv \
    -g {params.gen} \
    {output.bootstrap} \
    {input.smc_split_bootstrap}
    """
26
27
28
29
shell:
    """
    cat {input.allsites_vcf} | sed -e 's/chr//' | awk '{{OFS="\\t"; if (!/^#/){{print $1,$2,$2}}}}' | bedtools complement -i stdin -g {input.chr_lengths} | grep {params.chr} > {output}
    """
36
37
run:
    shell("gunzip -c {input} > {output}")
50
51
52
run:
    shell("RAiSD -R -s -m 0.05 -P -X {input.excluded_sites} -n {params.population} -I {input.allsites_vcf} -S {input.samples} -w 500 -f -c 2")
    shell("mv {params.out}* {params.outdir}")
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
library(tidyverse)
library(cowplot)

VCFsnps <- read.csv('reports/filtering_bpres/all_samps.raw.snps.table',
                    header = T,
                    na.strings=c("","NA"),
                    sep = "\t")

dim(VCFsnps)

VCFinvariant <- read.csv('reports/filtering_bpres/all_samps.raw.invariant.table',
                         header = T,
                         na.strings=c("","NA"),
                         sep = "\t")

dim(VCFinvariant)

VCF <- bind_rows(list(variant = VCFsnps,
                      invariant = VCFinvariant),
                 .id = "class")

dim(VCF)

xmax <- mean(VCF$DP)*3

# colors
snps <- '#A9E2E4'
invariant <- '#F4CCCA'

DP <- ggplot(VCF, aes(x=DP, fill=class)) +
  geom_density() +
  geom_vline(xintercept=c(10,xmax)) +
  xlim(c(0, 6000)) +
  theme_bw()

QD <- ggplot(VCF, aes(x=QD, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=2, size=0.7) +
  theme_bw()

FS <- ggplot(VCF, aes(x=FS, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=c(60, 200), size=0.7) +
  ylim(0,0.1) +
  theme_bw()

MQ <- ggplot(VCF, aes(x=MQ, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=40, size=0.7) +
  theme_bw()

MQRankSum <- ggplot(VCF, aes(x=MQRankSum, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=-20, size=0.7) +
  theme_bw()

SOR <- ggplot(VCF, aes(x=SOR, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=c(4, 10), size=1, colour = c(snps, invariant)) +
  theme_bw()

ReadPosRankSum <- ggplot(VCF, aes(x=ReadPosRankSum, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=c(-10,10), size=1) +
  xlim(-30, 30) +
  theme_bw()

plot_grid(QD, DP, FS, MQ, MQRankSum, SOR, ReadPosRankSum, nrow=4)

ggsave("reports/filtering_bpres/unfiltered_diagnostics.png")
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import click
import argparse
import os
import sys
import random
import gzip
import time


@click.command()
@click.option('--nr_bootstraps',  type=int, help="nr of bootstraps [20]", default=20)
@click.option("--chunk_size", type=int, help="size of bootstrap chunks [5000000]", default=5000000)
@click.option("--chunks_per_chromosome", type=int,
              help="nr of chunks to put on one chromosome in the bootstrap [20]", default=20)
@click.option("--nr_chromosomes", type=int, help="nr of chromosomes to write [30]", default=24)
@click.option("--seed", type=int, help="initialize the random number generator", default=None)
@click.argument("out_dir_prefix")
@click.argument("files", nargs=-1)
def main(nr_bootstraps, chunk_size, chunks_per_chromosome, nr_chromosomes, seed, out_dir_prefix, files):
    chunks = []
    offset = 0
    chunks_in_chrom = []
    if not seed:
        seed = int(time.time())
    random.seed(seed)
    print('seed: %s' % seed)
    for file in files:
        print(file)
        with gzip.open(file, 'rb') as f:
            header = f.readline().decode()
            prev_chunk_idx = -1
            for line_count,line in enumerate(f):
                line = line.decode()
                line = [int(x) for x in line.strip().split()]
                if line_count == 0:
                    pos = 1
                    offset += 1
                else:
                    pos = line[0] + offset
                    offset += line[0]
                chunk_index = (pos - 1) // chunk_size
                # The code below won't work if there are super large hom. stretches in the smc input file
                # This might happen if, for example you're only analyzing part of the chromosome, and you
                # mask everything outside of that part. The last line in the input file might look like:
                # 72402527 -1 0 0
                # Which tells SMC++ to mask the final ~72MB of the chromosome.
                # I've written a check below that should deal with this and stop the for loop if the next
                # position is more than (2*chunk_size) away
                if (chunk_index - prev_chunk_idx) >= 2:
                    print('\n')
                    print('Next position is in %s bases' % pos)
                    print('Last relative position was at %s bases' % lastpos)
                    print('This might be the result of how vcf2smc masks large stretches of the vcf (say for subsampling a chromosome), or it might be a real error.\nDouble check that this input file is formatted how you intend.\nStopping "chunking" of input file and moving on\n')
                    break
                print('Processing chunk %s of input file' % chunk_index)
                ##
                if chunk_index > len(chunks_in_chrom)-1:
                    chunks_in_chrom.append([])
                chunks_in_chrom[chunk_index].append(line)
                ##
                prev_chunk_idx = prev_chunk_idx + (chunk_index - prev_chunk_idx)
                lastpos = pos

        chunks.extend(chunks_in_chrom)

    for bootstrap_id in range(1, nr_bootstraps +1):
        for chr_ in range(1, nr_chromosomes + 1):
            chr_dir = "{}_{}".format(out_dir_prefix, bootstrap_id)
            if not os.path.exists(chr_dir):
                os.makedirs(chr_dir)
            chr_file = "{}/bootstrap_chr{}.gz".format(chr_dir, chr_)
            print("writing", chr_file, file=sys.stderr)
            with gzip.open(chr_file, 'wb') as f:
                f.write(header.encode())
                for i in range(chunks_per_chromosome):
                    chunk_id = random.randrange(len(chunks))
                    for line in chunks[chunk_id]:
                        line = ' '.join([str(x) for x in line]) + '\n'
                        f.write(line.encode())



if __name__ == "__main__":
    main()
ShowHide 39 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/mishaploid/Bo-demography
Name: bo-demography
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...