Demographic analysis of Brassica oleracea

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Bo_demography

Demographic analysis of Brassica oleracea

Alignment of sequence data

Follows GATK4 best practices (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145)

Implemented as a Snakemake workflow (https://snakemake.readthedocs.io/en/stable/index.html)

TODO

Clean up filtering steps
Add adjustable parameters & metrics for filtering criteria (depth, missingness)
More efficient use of scratch space
Include sample ids & read groups in bwa-mem step (thanks, Michelle!): bwa mem -R $(echo "@RG\tID:${LANE}_${SAMPLE}\tSM:$SAMPLE\tLB:${SAMPLE}.1\tPL:ILLUMINA") -a $B73FA $R1 $R2
Add a config file to specify samples/reference/paths/etc. (will make workflow more general)
- include variable for scratch directory
add rule for generating mappability mask using SNPable
Find a better solution to create input for ADMIXTURE (right now requires manual edit of chr names and snp ids)

Workflow

Snakefile provides general organization (names of samples, calls other scripts/rules)

/rules contains rules (*.smk) that tell snakemake what jobs to run; organized broadly into categories

mapping.smk - map reads to reference genome
- download fastq files ( fasterq-dump )
- trim adapters ( Trimmomatic PE )
- map to reference genome ( bwa mem ; B. oleracea HDEM reference)
- sort BAM files ( gatk/SortSam )
- mark duplicates with Picard ( gatk/MarkDuplicates )
- add read groups with Picard ( gatk/AddOrReplaceReadGroups )
calling.smk - call variants
- identify raw SNPs/indels ( gatk/HaplotypeCaller )
- combine GVCFs ( gatk/GenomicsDBImport )
- joing genotyping ( gatk/GenotypeGVCFs )
filtering.smk - filter SNPs
- select SNPs ( gatk/SelectVariants )
- apply hard filters ( gatk/VariantFiltration )
- export diagnostics (qual, quality depth, depth, marker quality, etc.) ( gatk/VariantsToTable )
- remove sites with missing data and select biallelic SNPs ( gat/SelectVariants )
- filter on individual sample depth ( gatk/VariantFiltration + gatk/SelectVariants )
- export filtered depth for each accession ( gatk/VariantsToTable )
- combine vcfs into a single file ( bcftools concat )
smc.smk - estimate population size history with SMC++
- convert vcf to smc format - requires mappability mask ( smc++ vcf2smc )
- fit population size history with cross-validation ( smc++ cv )
- fit population size history without cross-validation ( smc++ estimate )
admixture.smk - maximum likelihood estimation of individual ancestries
- create input file (some bash commands + LD pruning in plink )
- run admixture ( admixture )

To run the workflow:

Create a conda environment (only need to do this once)
conda env create bo-demography --file environment.yaml
- to access for future use, start a screen session with screen and use:
  source activate bo-demography
If running on a cluster, update the cluster config file (submit.json)
Dry run to test that everything is in place... snakemake -n
Submit the workflow with ./submit.sh

Project Organization

├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── data <- Scripts to download or generate data
│ ├── features <- Scripts to turn raw data into features for modeling
│ ├── models <- Scripts to train models and then use trained models to make
│ │ predictions
│ └── visualization <- Scripts to create exploratory and results oriented visualizations

Project based on the cookiecutter data science project template . #cookiecutterdatascience

Code Snippets

run:
	shell("fastqc \
	-t {threads} \
	-o {params} \
	--noextract \
	{input}")

SnakeMake FastQC From line 16 of rules/01-fastqc.smk

run:
	shell("multiqc {params.indir}/*L004* -o {params.outdir} -n STJRI01_multiqc.html")
	shell("multiqc {params.indir}/*L001* -o {params.outdir} -n STJRI02_multiqc.html")
	shell("multiqc {params.indir}/*L002* -o {params.outdir} -n STJRI03_multiqc.html")

SnakeMake MultiQC From line 41 of rules/01-fastqc.smk

run:
    shell("wget --directory-prefix=data/external/ref \
    http://www.genoscope.cns.fr/externe/plants/data/Boleracea_chromosomes.fasta")
    shell("bwa index -a bwtsw {output.ref}")
    shell("samtools faidx {output.ref}")
    shell("gatk CreateSequenceDictionary \
    -R={output.ref} \
    -O={output.dict}")

SnakeMake SAMtools gatk BWA From line 31 of rules/02-mapping.smk

run:
    print({params.sample}, {params.SRR})
    shell("mkdir -p {params.tmp}")
    shell("fasterq-dump \
    {params.SRR} \
    -O {params.tmp} \
    -o {params.sample} \
    -t {params.tmp} \
    -e {threads} \
    -p")
    shell("java -jar /share/apps/Trimmomatic-0.36/trimmomatic.jar PE \
    {params.stem}_1.fastq {params.stem}_2.fastq \
    {params.stem}.forward.1.fastq \
    {params.stem}.unpaired.1.fastq \
    {params.stem}.reverse.2.fastq \
    {params.stem}.unpaired.2.fastq \
    LEADING:3 \

SnakeMake From line 65 of rules/02-mapping.smk

run:
    shell("mkdir -p {params.tmp}")
    shell("java -jar /share/apps/Trimmomatic-0.36/trimmomatic.jar PE \
    {input.fastq1} {input.fastq2} \
    {params.stem}.forward.1.fastq \
    {params.stem}.unpaired.1.fastq \
    {params.stem}.reverse.2.fastq \
    {params.stem}.unpaired.2.fastq \

SnakeMake From line 117 of rules/02-mapping.smk

run:
    shell("mkdir -p {params.tmp}")

SnakeMake From line 150 of rules/02-mapping.smk

run:
    shell("mkdir -p {params.tmp}")
    shell("gatk MarkDuplicates \
    -I={input} \
    -O={output.bam} \
    --METRICS_FILE={output.metrics} \
    --CREATE_INDEX=true \
    -MAX_FILE_HANDLES=1000 \
    --ASSUME_SORT_ORDER=coordinate \
    --TMP_DIR={params.tmp}")
    shell("rm -rf {params.tmp}")

SnakeMake gatk From line 184 of rules/02-mapping.smk

run:
    shell("mkdir -p {params.tmp}")

SnakeMake From line 210 of rules/02-mapping.smk

run:
	shell("qualimap bamqc \
	-bam {input.bam} \
	--paint-chromosome-limits \
	-gff {input.gff} \
	-nt {threads} \
	-outdir {params.dir} \
	-outformat HTML \
	--skip-duplicated")

SnakeMake QualiMap From line 239 of rules/02-mapping.smk

run:
	shell("find reports/bamqc -mindepth 1 -maxdepth 1 -type d | grep SamC > models/ALL.bamqclist.txt")
	import pandas as pd
	data = pd.read_csv("models/ALL.bamqclist.txt", sep = " ", header = None, names = ['filename'])
	data['sample'] = data['filename'].str.split('.').str[0].str.split('/').str[2].str.split('_stats').str[0]
	data = data.sort_values('sample', axis = 0, ascending = True)
	data = data[['sample','filename']]
	data.to_csv(r'models/bamqc_list.txt', header = None, index = None, sep = ' ', mode = 'a')
	shell("qualimap multi-bamqc \
	--data {params.infile} \
	--paint-chromosome-limits \
	-outdir {params.outdir} \
	-outformat html \
	--java-mem-size=40G")

SnakeMake Pandas QualiMap From line 263 of rules/02-mapping.smk

run:
	shell("gatk HaplotypeCaller \
	-I {input.bam} \
	-O {output} \
	-R {input.ref} \
	-L {params.regions} \
	-G StandardAnnotation \
	-G AS_StandardAnnotation \
	--emit-ref-confidence BP_RESOLUTION")

SnakeMake gatk From line 27 of rules/03-calling.smk

run:
	shell("gatk SplitIntervals \
	-R {input.ref} \
	-L {params.regions} \
	--scatter-count {params.intervals} \
	-O {params.outdir}")

SnakeMake gatk From line 53 of rules/03-calling.smk

run:
	shell("mkdir -p {params.tmp}")
	shell("gatk --java-options \"-Xmx60g -Xms60g\" \
	GenomicsDBImport \
	--genomicsdb-workspace-path {output} \
	--batch-size 50 \
	--reader-threads 6 \
	--sample-name-map {input.map} \
	--intervals {input.region} \
	--tmp-dir {params.tmp}")
	shell("rm -rf {params.tmp}")

SnakeMake gatk From line 77 of rules/03-calling.smk

run:
	shell("gatk GenotypeGVCFs \
	-R {input.ref} \
	-V {params.db} \
	-L {params.region} \
	-new-qual \
	-G StandardAnnotation \
	-G AS_StandardAnnotation \
	--include-non-variant-sites \
	-O {output}")

SnakeMake gatk From line 102 of rules/03-calling.smk

run:
	shell("bcftools concat {input} -Oz -o {output}")
	shell("tabix -p vcf {output}")

SnakeMake BCFtools tabix From line 122 of rules/03-calling.smk

run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.vcf} \
	-select-type SNP \
	--restrict-alleles-to BIALLELIC \
	-O {output}")

SnakeMake gatk From line 23 of rules/04-filtering.smk

run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.vcf} \
	-select-type NO_VARIATION \
	-O {output}")

SnakeMake gatk From line 41 of rules/04-filtering.smk

run:
	shell("gatk VariantsToTable \
	-R {input.ref} \
	-V {input.variant_vcf} \
	-F CHROM -F POS -F QUAL -F QD -F DP -F MQ -F MQRankSum -F FS -F ReadPosRankSum -F SOR -F AD \
	-O {output.variant_stats}")
	shell("gatk VariantsToTable \
	-R {input.ref} \
	-V {input.invariant_vcf} \
	-F CHROM -F POS -F QUAL -F QD -F DP -F MQ -F MQRankSum -F FS -F ReadPosRankSum -F SOR -F AD \
	-O {output.invariant_stats}")
	shell("Rscript src/filtering_diagnostics.R")

SnakeMake gatk From line 64 of rules/04-filtering.smk

run:
	shell("gatk VariantFiltration \
	-V {input.raw_snp_vcf} \
	--intervals {params.chr} \
	-filter \"QD < 2.0\" --filter-name \"QD2\" \
	-filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \
	-filter \"SOR > 3.0\" --filter-name \"SOR3\" \
	-filter \"FS > 60.0\" --filter-name \"FS60\" \
	-filter \"MQ < 40.0\" --filter-name \"MQ40\" \
	-filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \
	-filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \
	-G-filter \"DP < 5 || DP > 200\" \
	-G-filter-name \"DP_5-200\" \
	-O {output.snps_flagged}")

SnakeMake gatk From line 112 of rules/04-filtering.smk

run:
	shell("gatk VariantFiltration \
	-V {input.snps_flagged} \
	--set-filtered-genotype-to-no-call true \
	-O {output.snps_nocall}")

SnakeMake gatk From line 134 of rules/04-filtering.smk

run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.snps_nocall} \
	-select-type SNP \
	--restrict-alleles-to BIALLELIC \
	--max-nocall-fraction 0.1 \
	--exclude-filtered true \
	-O {output.filtered_snps}")

SnakeMake gatk From line 148 of rules/04-filtering.smk

run:
	shell("gatk VariantFiltration \
	-V {input.raw_invariant_vcf} \
	--intervals {params.chr} \
	-filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \
	-filter \"MQ < 40.0\" --filter-name \"MQ40\" \
	-filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \
	-filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \
	-G-filter \"DP < 5 || DP > 200\" \
	-G-filter-name \"DP_5-200\" \
	-O {output.invariant_flagged}")

SnakeMake gatk From line 166 of rules/04-filtering.smk

run:
	shell("gatk VariantFiltration \
	-V {input.invariant_flagged} \
	--set-filtered-genotype-to-no-call true \
	-O {output.invariant_nocall}")

SnakeMake gatk From line 185 of rules/04-filtering.smk

run:
	shell("gatk SelectVariants \
	-R {input.ref} \
	-V {input.invariant_nocall} \
	--max-nocall-fraction 0.1 \
	--exclude-filtered true \
	-O {output.filtered_invariant}")

SnakeMake gatk From line 199 of rules/04-filtering.smk

run:
	shell("bgzip {input}")
	shell("tabix -p vcf {input}.gz")

SnakeMake tabix From line 214 of rules/04-filtering.smk

run:
	shell("gatk MergeVcfs \
	-I={params.input} \
	-O={output.merged_snps}")

SnakeMake gatk From line 225 of rules/04-filtering.smk

run:
	shell("gatk MergeVcfs \
	-I={input.snps} \
	-I={input.invariant} \
	-O={output.allsites_vcf}")

SnakeMake gatk From line 236 of rules/04-filtering.smk

run:
	shell("vcftools \
	--gzvcf {input.merged_snps} \
	--missing-indv \
	--out reports/filtered.qual.dp5_200.maxnocall10")
	shell("vcftools \
	--gzvcf {input.merged_snps} \
	--het \
	--out reports/filtered.qual.dp5_200.maxnocall10")

SnakeMake VCFtools From line 248 of rules/04-filtering.smk

run:
    shell("{params.plink2} --vcf {input.vcf} \
    --allow-extra-chr \
    --max-alleles 2 \
    --vcf-filter \
    --geno 0.1 \
    --mind {params.mind} \
    --maf 0.05 \
    --make-bed \
    --out {params.stem}")
    shell("cp {params.stem}.bim {params.stem}_temp.bim")
    shell("""sed \"s/^C//g" {params.stem}_temp.bim > {params.stem}_temp2.bim""")
    shell("./src/replace_rsids.sh {params.stem}_temp2.bim > {params.stem}.bim")
    shell("rm {params.stem}_temp*.bim")
    shell("{params.plink2} --bfile {params.stem} \
    --allow-extra-chr \
    --indep-pairwise 50 10 0.1 \
    --out {params.stem}")
    shell("{params.plink2} --bfile {params.stem} \
    --allow-extra-chr \
    --extract {params.stem}.prune.in \
    --out {params.pruned} \
    --make-bed")

SnakeMake From line 23 of rules/05-pop_structure.smk

run:
    shell("admixture -B --cv -j{threads} {input.bed} {params.k}")
    shell("mv {params.stem}.{params.k}.* models/admixture")

SnakeMake From line 60 of rules/05-pop_structure.smk

run:
    shell("tabix -p vcf -f {input.allsites_vcf}")
    shell("pixy --stats pi \
    --vcf {input.allsites_vcf} \
    --populations {input.samps} \
    --window_size {params.window_size} \
    --n_cores {threads} \
    --output_folder models/pixy \
    --output_prefix B_oleracea_grouped_{params.chr} \
    --chunk_size 50000")

SnakeMake tabix From line 80 of rules/05-pop_structure.smk

run:
    shell("mkdir -p {params.tmp}")
    shell("samtools faidx {input} {params.chr} > {params.newfasta}")
    shell("{snpable}/splitfa {params.newfasta} 100 | split -l 80000000 - {params.tmp}/x")
    shell("cat {params.tmp}/x* > {params.tmp}/simu_reads.fq")
    shell("bwa aln -R 1000000 -O 3 -E 3 -t {threads} {input} {params.tmp}/simu_reads.fq > {params.tmp}/simu_reads.sai")
    shell("bwa samse {input} {params.tmp}/simu_reads.sai {params.tmp}/simu_reads.fq > {params.tmp}/simu_reads.sam")
    shell("perl {snpable}/gen_raw_mask.pl {params.tmp}/simu_reads.sam > {params.tmp}/raw_mask_100.fa")
    shell("{snpable}/gen_mask -l 100 -r 0.5  {params.tmp}/raw_mask_100.fa > {params.famask}")
    shell("rm -rf {params.tmp}")
    shell("./src/makeMappabilityMask.py {params.famask}")

SnakeMake SAMtools BWA From line 38 of rules/06-demography.smk

shell:
	"smc++ cv \
    --cores {threads} \
    --spline cubic \
	-o {params.model_out_dir} {params.mu} {params.model_in}"

SnakeMake From line 121 of rules/06-demography.smk

shell:
    "smc++ plot \
    --csv \
    -g {params.gen} \
    {output} \
    {input.smc_out}"

SnakeMake From line 143 of rules/06-demography.smk

shell:
    "python3 src/smc_bootstrap.py \
    --nr_bootstraps {params.nr_bootstraps} \
    --chunk_size {params.chunk_size} \
    --chunks_per_chromosome 10 \
    --nr_chromosomes {params.nr_chr} \
    models/smc/bootstrap_input/{params.population}.{params.distind}_rep \
    {params.input_dir}"

SnakeMake From line 167 of rules/06-demography.smk

shell:
	"smc++ cv \
    --cores {threads} \
    --spline cubic \
	-o {params.model_out_dir} {params.mu} {params.model_in}"

SnakeMake From line 189 of rules/06-demography.smk

shell:
    "smc++ plot \
    --csv \
    -g {params.gen} \
    {output} \
    {input.smc_out}"

SnakeMake From line 204 of rules/06-demography.smk

shell:
    """
    smc++ vcf2smc \
    --mask {input.mask} \
    -d {params.distind1} {params.distind1} \
    {input.vcf} {output.out12} {params.chrom} {params.pop_pair_string12}
    """

SnakeMake From line 16 of rules/06-split-times.smk

shell:
    """
    smc++ vcf2smc \
    --mask {input.mask} \
    -d {params.distind2} {params.distind2} \
    {input.vcf} {output.out21} {params.chrom} {params.pop_pair_string21}
    """

SnakeMake From line 37 of rules/06-split-times.smk

shell:
    "smc++ split \
    --cores 8 \
    -o {params.model_out_dir} \
    {input}"

SnakeMake From line 54 of rules/06-split-times.smk

shell:
    "python3 scripts/smc_bootstrap.py \
    --nr_bootstraps {params.nr_bootstraps} \
    --chunk_size {params.chunk_size} \
    --chunks_per_chromosome 10 \
    --nr_chromosomes {params.nr_chr} \
    models/smc/split_bootstrap_input/{params.pop_pair}_12.{params.distind1}_rep \
    {params.input_dir}"

SnakeMake From line 74 of rules/06-split-times.smk

shell:
    "python3 scripts/smc_bootstrap.py \
    --nr_bootstraps {params.nr_bootstraps} \
    --chunk_size {params.chunk_size} \
    --chunks_per_chromosome 10 \
    --nr_chromosomes {params.nr_chr} \
    models/smc/split_bootstrap_input/{params.pop_pair}_21.{params.distind1}_rep \
    {params.input_dir}"

SnakeMake From line 95 of rules/06-split-times.smk

shell:
    "smc++ split \
    --cores 2 \
    -o {params.model_out_dir} \
    {input}"

SnakeMake From line 113 of rules/06-split-times.smk

shell:
    """
    smc++ plot \
    --csv \
    -g {params.gen} \
    {output.split} \
    {input.smc_split}
    smc++ plot \
    --csv \
    -g {params.gen} \
    {output.bootstrap} \
    {input.smc_split_bootstrap}
    """

SnakeMake From line 130 of rules/06-split-times.smk

shell:
    """
    cat {input.allsites_vcf} | sed -e 's/chr//' | awk '{{OFS="\\t"; if (!/^#/){{print $1,$2,$2}}}}' | bedtools complement -i stdin -g {input.chr_lengths} | grep {params.chr} > {output}
    """

SnakeMake BEDTools From line 26 of rules/07-selective_sweeps.smk

run:
    shell("gunzip -c {input} > {output}")

SnakeMake From line 36 of rules/07-selective_sweeps.smk

run:
    shell("RAiSD -R -s -m 0.05 -P -X {input.excluded_sites} -n {params.population} -I {input.allsites_vcf} -S {input.samples} -w 500 -f -c 2")
    shell("mv {params.out}* {params.outdir}")

SnakeMake From line 50 of rules/07-selective_sweeps.smk

library(tidyverse)
library(cowplot)

VCFsnps <- read.csv('reports/filtering_bpres/all_samps.raw.snps.table',
                    header = T,
                    na.strings=c("","NA"),
                    sep = "\t")

dim(VCFsnps)

VCFinvariant <- read.csv('reports/filtering_bpres/all_samps.raw.invariant.table',
                         header = T,
                         na.strings=c("","NA"),
                         sep = "\t")

dim(VCFinvariant)

VCF <- bind_rows(list(variant = VCFsnps,
                      invariant = VCFinvariant),
                 .id = "class")

dim(VCF)

xmax <- mean(VCF$DP)*3

# colors
snps <- '#A9E2E4'
invariant <- '#F4CCCA'

DP <- ggplot(VCF, aes(x=DP, fill=class)) +
  geom_density() +
  geom_vline(xintercept=c(10,xmax)) +
  xlim(c(0, 6000)) +
  theme_bw()

QD <- ggplot(VCF, aes(x=QD, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=2, size=0.7) +
  theme_bw()

FS <- ggplot(VCF, aes(x=FS, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=c(60, 200), size=0.7) +
  ylim(0,0.1) +
  theme_bw()

MQ <- ggplot(VCF, aes(x=MQ, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=40, size=0.7) +
  theme_bw()

MQRankSum <- ggplot(VCF, aes(x=MQRankSum, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=-20, size=0.7) +
  theme_bw()

SOR <- ggplot(VCF, aes(x=SOR, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=c(4, 10), size=1, colour = c(snps, invariant)) +
  theme_bw()

ReadPosRankSum <- ggplot(VCF, aes(x=ReadPosRankSum, fill=class)) +
  geom_density(alpha=.3) +
  geom_vline(xintercept=c(-10,10), size=1) +
  xlim(-30, 30) +
  theme_bw()

plot_grid(QD, DP, FS, MQ, MQRankSum, SOR, ReadPosRankSum, nrow=4)

ggsave("reports/filtering_bpres/unfiltered_diagnostics.png")

R ggplot2 tidyverse cowplot From line 7 of src/filtering_diagnostics.R

import click
import argparse
import os
import sys
import random
import gzip
import time


@click.command()
@click.option('--nr_bootstraps',  type=int, help="nr of bootstraps [20]", default=20)
@click.option("--chunk_size", type=int, help="size of bootstrap chunks [5000000]", default=5000000)
@click.option("--chunks_per_chromosome", type=int,
              help="nr of chunks to put on one chromosome in the bootstrap [20]", default=20)
@click.option("--nr_chromosomes", type=int, help="nr of chromosomes to write [30]", default=24)
@click.option("--seed", type=int, help="initialize the random number generator", default=None)
@click.argument("out_dir_prefix")
@click.argument("files", nargs=-1)
def main(nr_bootstraps, chunk_size, chunks_per_chromosome, nr_chromosomes, seed, out_dir_prefix, files):
    chunks = []
    offset = 0
    chunks_in_chrom = []
    if not seed:
        seed = int(time.time())
    random.seed(seed)
    print('seed: %s' % seed)
    for file in files:
        print(file)
        with gzip.open(file, 'rb') as f:
            header = f.readline().decode()
            prev_chunk_idx = -1
            for line_count,line in enumerate(f):
                line = line.decode()
                line = [int(x) for x in line.strip().split()]
                if line_count == 0:
                    pos = 1
                    offset += 1
                else:
                    pos = line[0] + offset
                    offset += line[0]
                chunk_index = (pos - 1) // chunk_size
                # The code below won't work if there are super large hom. stretches in the smc input file
                # This might happen if, for example you're only analyzing part of the chromosome, and you
                # mask everything outside of that part. The last line in the input file might look like:
                # 72402527 -1 0 0
                # Which tells SMC++ to mask the final ~72MB of the chromosome.
                # I've written a check below that should deal with this and stop the for loop if the next
                # position is more than (2*chunk_size) away
                if (chunk_index - prev_chunk_idx) >= 2:
                    print('\n')
                    print('Next position is in %s bases' % pos)
                    print('Last relative position was at %s bases' % lastpos)
                    print('This might be the result of how vcf2smc masks large stretches of the vcf (say for subsampling a chromosome), or it might be a real error.\nDouble check that this input file is formatted how you intend.\nStopping "chunking" of input file and moving on\n')
                    break
                print('Processing chunk %s of input file' % chunk_index)
                ##
                if chunk_index > len(chunks_in_chrom)-1:
                    chunks_in_chrom.append([])
                chunks_in_chrom[chunk_index].append(line)
                ##
                prev_chunk_idx = prev_chunk_idx + (chunk_index - prev_chunk_idx)
                lastpos = pos

        chunks.extend(chunks_in_chrom)

    for bootstrap_id in range(1, nr_bootstraps +1):
        for chr_ in range(1, nr_chromosomes + 1):
            chr_dir = "{}_{}".format(out_dir_prefix, bootstrap_id)
            if not os.path.exists(chr_dir):
                os.makedirs(chr_dir)
            chr_file = "{}/bootstrap_chr{}.gz".format(chr_dir, chr_)
            print("writing", chr_file, file=sys.stderr)
            with gzip.open(chr_file, 'wb') as f:
                f.write(header.encode())
                for i in range(chunks_per_chromosome):
                    chunk_id = random.randrange(len(chunks))
                    for line in chunks[chunk_id]:
                        line = ' '.join([str(x) for x in line]) + '\n'
                        f.write(line.encode())



if __name__ == "__main__":
    main()