A snakemake workflow for high-throughput amplicon sequencing

public 1yr ago 0 bookmarks

View Workflow

This is a snakemake workflow for generating a list of variants (Single Nucleotide Polymorphisms- SNPs) from amplicon sequencing data and the production of haplotypes using these variants with rad_haplotyper, a program designed to produce SNP haplotypes and is included in this workflow.
The workflow is designed to handle both single-end and paired-end sequencing data, and it gives the user the flexibility to filter the identified SNPs under stringend conditions and use these for haplotype calling, or use the overlapping SNPs with an existing list of high-quality variants and use only these for haplotype calling. Additionally, the workflow has the option for the pipeline to produce only the variants without having the haplotypes called, to reduce the analysis time when only the variants are needed.

Overview

The standard snakemake_haps workflow essentially performs the following steps:

1) Quality control of the raw reads.

2) Quality trimming and adapter removal of raw reads.

3) Mapping of the processed reads against the reference sequence.

4) Duplicate reads based on the resulting alignments are marked.

5) The resulting alignments are sorted and indexed.

6) Variants are called using FreeBayes

7) Variants are filtered with very stringent parameters to identify high quality SNPs

8) Haplotypes are called using rad_haplotyper

Getting started

If you are new to [Snakemake], you might want to start by working the available [tutorial for beginners] with instructions for how to install Snakemake, as well as Miniconda Python 3 and setting an environment with all required software.

1. Installation

"Snakemake_haps" can be downloaded and installed from a [Github repository], however, since "snakemake_haps" has many dependencies, it is suggested to use one of the [Anaconda] or [Miniconda] repositories to avoid compiling dependecies one at a time which is time consuming.

1.1 Downloading and Installing (Ana|mini)conda:

[Anaconda] - Free
[Miniconda] - Free

For installing Anaconda:

- Steps for downloading and installing Anaconda:

# Curl is needed so if not installed you can install it by typing:
$ sudo apt install curl	
# Installing Anaconda
$ curl -o /tmp/Anaconda.sh https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh && bash /tmp/Anaconda.sh

or

Download the available installer version of Anaconda from [Anaconda]

#Change directory to where the installer is
$ cd ~/Path/to/Anaconda.sh
#Run the installer by typing 
$ bash Anaconda.sh

For installing Miniconda Pyhon 3 distribution:

- Steps for downloading and installing Miniconda3:

# If Curl is not installed you can install it by typing:
$ sudo apt install curl	
# Installing Miniconda3t
$ curl -o /tmp/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash /tmp/miniconda.sh

or

Download the available installer version of Miniconda3 from [Miniconda]

#Change directory to where the installer is
$ cd ~/Path/to/Miniconda3.sh
#Run the installer by typing 
$ bash Miniconda3-latest-Linux-x86_64.sh

- Answer yes to the question whether conda shall be put into your PATH. - You must restart your account for the changes to take place or run ~/.bashrc from your terminal.

1.2 Installing Snakemake workflow manager:

Via Bioconda channel:

#Installing Snakemake from the Bioconda channel:
$ conda install -c bioconda snakemake

With a working Python >=3.5 setup, installation of Snakemake can be performed via PyPi:

#Installing Snakemake via PyPi:
$ pip3 install snakemake

or

$ easy_install3 snakemake

1.3 Downloading and installing "Snakemake_haps" files:

Description

This repository contains scripts that are part of a complete workflow for DNA-seq analysis and mapping reads to a reference genome for the identification of SNPs and designing of haplotypes.

Clone the repository using the Command Line:

#Download files
$ git clone https://github.com/chollenbeck/snakemake_haps.git
#Change directory to snakemake_haps
$ cd snakemake_haps

or

Visit ("link") and select "Clone or download" option and select "Download Zip"

unzip the zip. file to the preffered location and then change directory:

#Change directory to snakemake_haps
$ cd snakemake_haps

1.4 Create an environment

For using "Snakemake_haps" it is suggested to create an environment dedicated to "Snakemake_haps". This environemnt is a directory that contains all dependencies for "Snakemake_haps" and it can be activated before running the workflow and deactivatedd after it finishes. The environment configuration for "Snakemake_haps" is located in /bin/install/environment.yml and can be created as follows:

#Installing available environment
$ conda env create --name *environment_name* --file bin/install/environment.yml
#Activate environment
$ source activate *environment_name*

and deactivate it by using

#Deactivating environment
$ source deactivate

1.5 Quick run of "Snakemake_haps"

For running "Snakemake_haps with default parameters, the following set of commands must be executed.

Change home directory where "Snakemake_haps" is downloaded:

#Change directory
$ cd /home/Snakemake_haps_directory/

Run build_hap_config.py script to create the config.yaml file with all analysis parameters and paths of the raw fastq files.

#Run the build_hap_config.py script
$ python bin/scripts/build_hap_config.py -g data/genome/"name_of_ref_seq.fasta" -d /home/Path/of/Fastq/files

Run "Snakemake_haps"

#Start workflow by running snakemake
$ snakemake

A more detailed explanation of the above steps are given in following "Tutorial".

2 Tutorial

This next part, will use the available sample data files in [data/fastq_raw] to illustrate how to run the workflow using both paired-end and single-end reads. Both single-end and paired-end fastq files are from a sequencing (using MiSeq technology) of King scallop samples ( Pecten maximus ). For testing purposes, the sample files can be found in the "Snakemake_haps" data/fastq_raw directory/Single_end_reads and data/fastq_raw directory/Paired_end_reads directory. Their corresponding reference sequences can be found in data/genome/Single_end_sample.fasta and data/genome/Paired_end_sample.fasta.

2.1 Steps for running "Snakemake_haps" with default parameters

After activating the newly created environment, the working directory must be set where the "Snakemake_haps" workflow is:

#Change directory
$ cd /home/Snakemake_haps_directory/

2.1.1 To use the default workflow parameters:

Raw data must be FASTQ sequencing files and stored in the data/fastq_raw/ directory in the "Snakemake_haps" folder.
A reference sequence must also be provided and stored in the data/genome/ directory in the "Snakemake_haps" folder.

2.1.2 Create the config.yaml file with default parameters, which consists the list of the samples and their paths in /data/fastq_raw/ and the reference sequence path in data/genome/, as well as mapping and variant filtering parameters.

#Run the build_hap_config.py script
$ python bin/scripts/build_hap_config.py -g /data/genome/"name_of_ref_seq.fasta"

This python script will create a .yaml file consisting all the available parameters needed for the workflow to succesfully run, as well as all the paths of the FASTQ sample files with their designation as paired-end or single-end reads.

- The config.yaml file is then stored in the home directory of "Snakemake_haps" and it can be opened by using a typical text editor. It is advisable for the user, before running the workflow, to open the config file and have a visual inspection of the sequencing read path, as well as the default parameters of mapping, variant calling and variant filtering.

- Other options that can be controled and changed from the config.yaml file are the option of including the QC step in the workflow and the removing the haplotype calling step.

2.1.3 After creating config.yaml file with all the necessary parameters and the correct paths of the raw sequencing data, the workflow can be executed.

#Start workflow by running snakemake
$ snakemake

2.2 Steps for running "Snakemake_haps" only for variant calling.

2.2.1 The "Snakemake_haps" program is coming with the option to be used only for variant calling when the designing of haplotypes are not needed. To configure "Snakemake_haps" to proceed with the analysis without the [rad_haplotyper] step, [build_hap_config.py] script (described in 2.1.2) must be used with a changed option "-q" as False (Default value = True).

#Run the build_hap_config.py script
$ python bin/scripts/build_hap_config.py -g /data/genome/"name_of_ref_seq.fasta -q False"

That creates a config.yaml file with the [rad_haplotyper] de-activated with all other parameters remaining the same.

Then "Snakemake_haps" workflow can be activated as described in 2.1.3

#Start workflow by running snakemake
$ snakemake

2.2.2 For paired-end reads, the default usage include the "FLASH" (Fast Length Adjustment of SHort reads) step of merging paired-end reads. FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies. They can also improve transcriptome assembly when FLASH is used to merge RNA-seq data. For variant calling with paired-end reads that are not necessary to be merged, FLASH option can be deactivated by running the [build_hap_config.py] script with the option "-f" as False (Default value = True).

#Run the build_hap_config.py script
$ python bin/scripts/build_hap_config.py -g /data/genome/"name_of_ref_seq.fasta -f False"

That creates a config.yaml file with FLASH step de-activated with all other parameters remaining the same.

Then "Snakemake_haps" workflow can be activated as described in 2.1.3

#Start workflow by running snakemake
$ snakemake

2.3 Steps for running "Snakemake_haps" without QC step (Not suggested) .

In the same way as before, "Snakemake_haps" can be used for variant analysis and haplotype calling, however it is possitble to skip the QC step of the sequencing reads, which can take a huge amount of time. To do that, [build_hap_config.py] script has to be run by using the option "-p" as False (Default value = False).

#Run the build_hap_config.py script
$ python bin/scripts/build_hap_config.py -g /data/genome/"name_of_ref_seq.fasta -p False"

That creates a config.yaml file with the [rad_haplotyper] de-activated with all other parameters remaining the same.

Then "Snakemake_haps" workflow can be activated as desctibed in 2.1.3

#Start workflow by running snakemake
$ snakemake

2.4 Steps for running "Snakemake_haps" without both QC and haplotype calling:

For using "Snakemake_haps" only for variant calling with no QC and haplotype calling, the two previiously explained options "-p" and "-q" can be combined when [build_hap_config.py] is run.

#Run the build_hap_config.py script
$ python bin/scripts/build_hap_config.py -g /data/genome/"name_of_ref_seq.fasta -p False -q False"

Then "Snakemake_haps" workflow can be activated as desctibed in 2.1.3

#Start workflow by running snakemake
$ snakemake

2.4 Other parameters that can be changed in [build_hap_config.py]

As mentioned before, [build_hap_config.py] is the script that creates the [config.yaml] that contains all the necessary parameters for "Snakemake_haps" to run, and there is a number of other options that can be configured. some of these are

"MANDATORY" "-g", "--genome" : This option must be included by the user when [build_hap_config.py] is run, since the complete relative path and name of the reference sequence is needed for the workflow. If the sequence is not in data/genome directory, the correct relative path can also be added by using the option "-g" followed by the name of the .fasta file (Default value: data/genome/).
"-n", "--run_name" : When there is the need to use a different name for each run (Default value: haps_run).
"-a", "--adapter_dir" : If adapters for adapter removal are not located in the directory data/adapters, the new relative path for the adapters can be added by using the option "-a" (Default value: data/adapters).
"-s", "--split_genome" : This option splits the reference genome into n pieces for parallel variant calling (Default value: 5)
"-v", "--diff_vcf" : This option allows the "Snakemake_haps" workflow use a VCF file with variants that the user has previsouly identified as of high-quality and only the overlapping SNPs between this VCF and the identified SNPs in the VCF file of the "Snakemake_haps" workflow will be used for the haplotype calling. The relative path of the VCF file can be added to the workflow with option "-v" (Default value: "Null").

2.5 Parameters that can be changed in [config.yaml]

As mentioned before, [config.yaml] is the created file from running [build_hap_config.py] script and contains all of the parameters for each step of the workflow, including mapping, variant calling, and variant filtering. These parameters, even though they work very good with sequencing data coming from Miseq runs, they can be non-ideal for other sequencing data or for the needs of each project, and be edited by opening [config.yaml] by using a text editor.

"Trimmomatic read trimming" : Trimmomatic performs a variety of useful trimming steps for illumina paired-end and single ended reads. The trimming tasks that it performs include the removal of adapters, leading and trailing low quality bases, the scanning of the read with a 4-base wide sliding window and cutting when the average quality per base dropw below a threshold, and the removal of low quality reads. The parameters for these tasks, that can be also found in the [config.yaml], are the following:
- LEADING:[20] Cut bases off the start of a read, if below a threshold quality
- TRAILING:[20] Cut bases off the end of a read, if below a threshold quality
- AVGQUAL:[30] Drop the read if the average quality is below the specified level
- MINLEN:[100] Drop the read if it is below a specified length
- SLIDINGWINDOW:[4:15] Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.

The values of these parameters can be easily modified by changing the values inside te brackets and then save the changes. For more information about Trimmomatic and the available options please go to [Trimmomatic].

"BWA-MEM" : BWA-MEM algorithm performs the local alignment of the reads to a reference sequence. The parameters taken for the mapping of the reads in this workflow are the default, with the only the number of threads used being changed from default for better performance in multithreading mode. However, there are more options available in [BWA-MEM] that can be used and added in pconfig.yaml[] file.
- "-t" INT : Number of threads [8]
- "-B" INT : Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4] (Default)
- "-O" INT : Gap open penalty. [6] (Default)

The values of these parameters can be easily modified by changing the values inside te brackets and then save the changes. For more information about BWA-MEM algorithm and the available options please go to [BWA-MEM].

"Samtools Fixmate" : This step fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment. This step is essential for the duplicate read marking and removal with samtools markdup.
- "-r" : Remove secondary and unmapped reads.
- "-m" : Add ms (mate score) tags. These are used by markdup to select the best reads to keep.
"Samtools MarkDup" : Mark duplicate alignments from a coordinate sorted file that has been run through fixmate with the -m option. This program relies on the MC and ms tags that fixmate provides.
- "-s" : Print some basic stats. Extra options:
- "-r" : Remove duplicate reads. (Default: Null)
- "-S" : Mark supplementary reads of duplicates as duplicates. (Default: Null)

To activate these options, just add "-r" and/or "-S" next to "-s" in [config.yaml] file and then save it.

"Samtools View" : Samtools view command is a very versatile command from the Samtools package where its main function is to convert the binary form of the alignments into a human readable format. However, Samtools view has other functions, such as count the total lines of a BAM file and exctract reads that fulfill specific characteristics, like paired-reads mapped properly, unmapped reads and mates that are unmaped. For more information about the SAM flags that correspond to ecah property, visit [SAMformat] webpage. For this workflow, the parameters taken for filtering the reads that were aligned are:
- "-f" INT : Keep only reads that have the INT present in the FLAG field [3] (Default value: Keep paired reads and mapped in proper way (for paired-end reads only))
- "-F" INT : Discard reads that have the INT present in the FLAG field [780] (Default value: Discard reads that are unmapped, the mate is unmmaped, the read has split hits, and read fails quality checks.

The values of these parameters can be easily modified by removing them or adding new ones and then save the changes. For more information about Samtools fixmate, markdup, and view and their available options please go to [Samtools].

"FreeBayes Variant Calling" : FreeBayes is a genetic variant detector, specifically designed to identify small polymorphisms, IndDels, MNPs and compelx events (composite insertion and substitution events). It uses short-read alignments (BAM files), which in this workflow all BAM files for a number of samples are merged into one BAM file, and a reference genome (in FASTA format) are used to determine the most-likely combination of genotypes for the population at each position in the reference:
- --haplotype-length : [0] Simply annotate observation counts of SNPs and indels:
- --no-complex : Do not generate information about complex events
- --use-duplicate-reads : Duplicate reads are included in the analysis

The values of these parameters can be easily modified by removing them or adding new ones and then save the changes. For more information about FreeBayes please go to [FreeBayes].

"VCFtools Variant Filtering" : VCFtools is a program package and a toolset that is designed to work wth VCF file and perform a number of different operations on VCF files. On this wokrflow, VCFtools is used to filter out specific variants based on their available tags. These filters are:
- --minDP > 10x : Includes only genotypes greater than or equal to the "--minDP" value.
- --minGQ: > 20 : Exclude all genotypes with a quality below the threshold specified.
- --Max-missing: 0.25 : Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed).
- --MAF > 0.05 : Include only sites with a Minor Allele Frequency greater than or equal to the "--maf" value

The values of these parameters can be easily modified by removing them or adding new ones and then save the changes. For more information about VCFtools please go to [VCFtools].

"Rad_haplotyper" : The filtered set of SNPs from previous step are used for haplotype calling using rad_haplotyper tool (Willis et al., 2017), which is a program designed to produce SNP haplotypes. Haplotyping in this program is based on the idea that single- and paired-end reads capture the phase of linked SNPs. Apart from the filtered VCF with high quality SNPs, a BAM file with the aligned reads is also required. This program works first by iterating through each locus attempting to construct haplotypes from the raw reads and then by comparing the haplotypes to the bases appeared in the VCF file, haplotypes that are unique only on the raw reads are removed while haplotypes that can be matched with the bases called in the VCF are kept. The filter options that are taken for this step are:
- --miss_cutoff: 0.75 ": Proportion of missing data cutoff for removing loci from the fnal output.
- --hap_rescue: 3 ": Specify a rescue parameter that controls the behavior of the cript when dealing with loci that have more observed haplotypes han are possible given the genotypes.
The values of these parameters can be easily modified by removing them or adding new ones and then save the changes. For more information about rad_haplotyper please go to [rad_haplotyper].

3 Results & output files

The results of the analysis are stored in multiple locations, based on which step of the workflow they were produced (QC, mapping, variant calling, haplotype calling). However, the main results of the workflow are stored in the results/Final_results/ directory. The files that are produced during the workflow contain figures and statistis about the reads and their alignemnt against the reference sequence, as well as two VCF files with the raw and filtered variants from the variant analysis and the haplotypes that are designed using [rad_haplotyper]. A detailed description of these files is going to be given over the next part.

3.1 - Quality_control: The quality control directory contains the statistics about the quality of the reads before and after the quality trimming and adapter removal step of the workflow by using FastQC. The FASTQ files received from the sequencing can have various artefacts, which prior to variant calling must be removed. These artefacts can either be low quality bases at the end of the reads and contamination from Illumina adapters.

The main functions of FastQC are:

-Import of data from FastQ files -Providing a quick overview to tell you in which areas there may be problems -Summary graphs and tables to quickly assess your data -Export of results to an HTML based permanent report -Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample.

All figures and statistics are then stored in results/final_results/Quality_control/QC_reads and results/final_results/Quality_control/Raw_reads directories, depending on whether the FastQC was performed before or after the quality trimming step.

As mentioned earlier, the FastQC step can be avoided (even though it is not suggested), thus, the final FastQC files will be missing at the end of the workflow.

3.2 BAM_files: This directory contains the merged sequence alignment data from the mapping of all samples against the refernece sequence. The BAM format is the compressed binary version of the SAM file which is produced from the alignent of each sample against the reference sequece, and the conversion from SAM to BAM can save valuable space when storing a large number of samples.

This merged BAM file is the main input file for the variant analysis with FreeBayes.

3.3 Variants: After completing the alignment step where all the processed reads are mapped against the reference sequence, Freebayes proceeds with the identification of variants. The identified calls are then stored in a VCF format file where it can be further filtered to remove low quality variants and false positive SNPs. The results from this step are stored in two different directories in results/Final_results/Variants/Raw_Variants or results/Final_results/Variants/Filtered_Variants, based on whether they are the raw variants or after filtering.

- Inside the Filtered_Variants directory, a .txt format file with p-values for each site from a Hardy-Weinberg Equilibrium test are reported, as well as the Observed numbers of Homozygotes and Heterozygotes and the corresponding Expected numbers under HWE.

- In case a set of pre-defined high-quality SNPs have been used for the selection of overlapping SNPs with the newly identified SNPs, a new VCF file will be stored in the results/Final_results/Variants/Filtered_Variants directory, where only the common SNPs between the two datasets will be present. This file can then be used for the haplotype calling.

3.4 Haplotypes: The results from using the previously indentified SNPs to create haplotypes are stored in results/Final_results/Haplotypes. Various files with information about haplotypes can be found in this directory such as:

- codes.haps.gen: This file contains a complete list of haplotypes that passed the quality control of [rad_haplotyper] for each locus. - haps.gen: This file has the list of the loci that have haplotypes that passed the quality control of [rad_haplotyper]. - stats.out: This file lists the number of sites at the locus (SNPs or indels), the number of haplotype alleles observed in the dataset, the number of individuals haplotyped, the total number of individuals in the data set, and the proportion of individuals successfully haplotyped per locus. It also indicates the status of each locus as either passed/failed, and if it failed, the possible reason for failure. Possibilities include: possible paralogs, possible low-coverage/genotyping errors, previously missing genotypes, or a complex polymorphism that is difficult to haplotype. See the section on haplotype calling for more details about how this works.

This (tidy) file can be loaded into R and can be used to further filter loci from the data set based on user-defined cut-off values.

- ind_stats.out: This file lists number of loci that are possible paralogs, have low coverage/errors, missing genotypes, number of failed loci, the total number of loci, and the proportion of successful loci on a per individual basis. This file can be used to identify and remove problematic individuals from the final data set. -This step of the analysis also produces the appropriate input files for generating a visual summary of microhaplotypes found from this workflow using [microhaplot] R package.

3.5 Documents: This directory contains all the files that contain statistics about the reads and the QC step, as well as the mapping of the reads. Some of the information that can be found in these files is:

- The total number of reads per sample, - The number of mapped and unmapped reads per sample, - The number of duplicate reads, - The percentage of properly paired if the reads are paired-end, - The minimum and maximum length of the reads, - Their average quality,

and

- The total number of reads aligned to every locus

4 Snakefiles (These files should not be changed)

In snakemake, snakefiles are the files that contain the rules that denote how to create output files from input files. For this workflow, snakefiles are categorised based on their function and on which step of the workflow they are activated.

-[raw.py] creates links with the reference genome and the fastq files for downstream analysis by the workflow.

-[folders.py] creates folders for all output files produced by each step of the workflow.

-[qc.py] This snakefile includes the adapter removal by Trimmomatic which removes Illumina adapters that can cause a problem downstream analysis and to the identification of high quality SNPs. Additionally, Trimmomatic trims low quality regions of reads. Apart from Trimmomatic, qc.py contains a quality control step using fastQC tool to confirm the high quality of trimmed reads producing figures and tables with statistics which are explained in [fastQC] help page.

-[map.py] This script contains the full workflow for mapping the reads to a reference genome by using [BWA-MEM] mapper (Li, 2009) and [FreeBayes] (Garrison et al., 2012) variant caller.

Steps
1. Indexing of the reference genome.
2. Mapping the trimmed reads against the reference genome and sorting the output [BAM] file.
3. Apply [SAMtools] -fixmate and -markdup utilities to sorted BAM files for the removal of potential PCR duplicates. The output file of this step must be indexed.
4. For variant calling, the output BAM files from mapping each sample to the reference genome must be merged and indexed.

-[variant_calling.py] contains the workflow for the identification of variants from the produced BAM files. Initially this script splits the reference sequence into multiple components for faster variant analysis. FreeBayes is the variant caller which requires as input files the indexed merged BAM file and produes a VCF file with raw variants. A filtering procedure is included in this script to remove low quality variants from the final variant set.

-[haplotype.py] builds SNP haplotypes using the filtered variants from previous steps, providing a test for paralogy.

You can read more about the method in the available manual of [rad_haplotyper] and the following publication:

Willis, S. C., Hollenbeck, C. M., Puritz, J. B., Gold, J. R. and Portnoy, D. S. (2017), Haplotyping RAD loci: an efficient method to filter paralogs and account for physical linkage. Mol Ecol Resour, 17: 955–965. doi:10.1111/1755-0998.12647 [link]

-[BAM_to_SAM.py] is an essential part of the [haplotype.py] where prepares the files for an extra analysis with the R package [microhaplot]. Microhaplot generates visual summaries of microhaplotypes found in short read alignments by using the alignment SAM files and a variant call VCF file. It was designed for extracting and visualized haplotypes from high-quality amplicon sequencing data. For more information about this R package you can read in https://github.com/ngthomas/microhaplot. All the output files required for microhaplot are stored in the results directory results/Final_results/Microhaplot.

-[clean.py] removes all uneccessary files and folders created thoughout the analysis

5 Sample data

This repository includes 2 directories with fastq files of 3 samples each, one of which is single-end fastq files and the other is paired-end fastq files.These files can be used as exampels with the tutorial of this manual. Additionally, two reference sequences for the two types of raw reads are available, one for the single-end and one for the paired-end sample tutorial.

Single-end reads [data/fastq_raw/Single_end_sample_fastq/]

Sample_1
- S1_001
  - S1_L001.fastq.gz
Sample_2
- S2_001
  - S2_L001.fastq.gz
Sample_3
- S3_001
  - S3_L001.fastq.gz

Paired-end reads [data/fastq_raw/Single_end_sample_fastq/]

Sample_1
- S1_001
  - FR14992874_S1_L001_R1_001.fastq.gz : Containing the forward reads of Sample 1
  - FR14992874_S1_L001_R2_001.fastq.gz : Containing the reverse reads of Sample 1
Sample_2
- S2_001
  - FR14992882_S2_L001_R1_001.fastq.gz : Containing the forward reads of Sample 2
  - FR14992882_S2_L001_R2_001.fastq.gz : Containing the reverse reads of Sample 2
Sample_3
- S3_001
  - FR14992890_S3_L001_R1_001.fastq.gz : Containing the forward reads of Sample 3
  - FR14992890_S3_L001_R2_001.fastq.gz : Containing the reverse reads of Sample 3

Scripts

- config.yaml contains links with configuration files and parameters that are used in the workflow.

-[Snakefile] contains the rules that are used in the workflow, written as python script and stored in home directory. This file creates all the links with scripts hat contain the rules of the workflow. Executing [Snakefile] by using the command "snakemake", starts running the workflow specified.

-[build_hap_config.py] creates the [config.yaml] with the default parameters for mapping and filtering, as well as the links to the raw data of the samples.

References

Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012

Li H. and Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25(14), 1754-1760.

This pipeline is a contribution from the University of St Andrews, a partner in the European Marine Biological Research Infrastructure Cluster (EMBRIC) project funded by the European Union's Horizon 2020 research and innovation program under grant agreement No 654008.

Code Snippets

if config["illumina_se"] is None and config["Haplotyper"] is True:
    rule map_BAM_to_SAM_pe:
        input:
            MAP_DIR + "{sample}.sorted.bam"
        output:
            MAP_SAM + "{sample}.sorted.sam"
        params:
            scripts_dir = config["scripts_dir"]
        shell:
            "samtools view -h -o {output} {input}"


elif config["illumina_pe"] is None and config["Haplotyper"] is True:
    rule map_BAM_to_SAM_se:
        input:
            MAP_DIR + "{sample}.sorted.bam"
        output:
            MAP_SAM + "{sample}.sorted.sam"
        shell:
            "samtools view -h -o {output} {input}"

else:
	pass

##########################################################################################

if config["illumina_se"] is None and config["Haplotyper"] is True:
    rule make_label:
        input:
        	expand(MAP_SAM + "{sample}.sorted.sam", sample = SAMPLES_PE)
        output:
            MICROHAPLOT + "labels.txt"
        params:
            scripts_dir = config["scripts_dir"]
        shell:
            "{params.scripts_dir}/Make_labels.sh {output}"

elif config["illumina_pe"] is None and config["Haplotyper"] is True:
    rule make_label:
        input:
            expand(MAP_SAM + "{sample}.sorted.sam", sample = SAMPLES_SE)
        output:
            MICROHAPLOT + "labels.txt"
        params:
            scripts_dir = config["scripts_dir"]
        shell:
            "{params.scripts_dir}/Make_labels.sh {output}"

else:
	pass

##########################################################################################

if config["Haplotyper"] is True:
    rule make_SAM_link:
        """
        Make symlinks to run "microhaplot"
        """
        input:
            MAP_SAM + "{sample}.sorted.sam"
        output:
            MICROHAPLOT + "{sample}.sorted.sam"
        log:
            HAPLOTYPE_DOC + "make_SAM_link.{sample}.log"
        threads:
            1
        shell:
            "ln -s $(readlink -f {input}) {output} 2> {log} "

else:
    pass 


########################################################################################


if config["Haplotyper"] is True and config["Validation"] is None:
    rule make_VCF_link:
        """
        Make symlinks to run "microhaplot"
        """
        input:
            vcf = FINAL_VARIANTS + "filtered_snps.recode.vcf"
        output:
            vcf = MICROHAPLOT + "snps.vcf"
        log:
            HAPLOTYPE_DOC + "make_VCF_link.log"
        threads:
            1
        shell:
            "ln -s $(readlink -f {input.vcf}) {output.vcf} 2> {log} "

elif config["Haplotyper"] is True and config["Validation"] is not None:
    rule make_VCF_link:
        """
        Make symlinks to run "microhaplot"
        """
        input:
            vcf = FINAL_VARIANTS + "Overlapping_SNPs.vcf"
        output:
            vcf = MICROHAPLOT + "snps.vcf"
        log:
            HAPLOTYPE_DOC + "make_VCF_link.log"
        threads:
            1
        shell:
            "ln -s $(readlink -f {input.vcf}) {output.vcf} 2> {log} "

else:
	pass

Python SAMtools From line 1 of snakefiles/BAM_to_SAM.py

rule clean:
    shell:
        "rm -rf "
            "{RAW_DIR} {RAW_DOC} "
            "{MAP_DIR} {MAP_DOC} "
            "{QC_DIR} {QC_DOC} "
            "{ASSEMBLY_QC_DIR} {ASSEMBLY_QC_DOC} "

rule clean_raw:
    shell:
        "rm -rf {RAW_DIR} {RAW_DOC} "

rule clean_map:
    shell:
        "rm -rf {MAP_DIR} {MAP_DOC} "

rule clean_qc:
    shell:
        "rm -rf {QC_DIR} {QC_DOC} "

rule clean_assembly_qc:
    shell:
        "rm -rf {ASSEMBLY_QC_DIR} {ASSEMBLY_QC_DOC} "

Python From line 1 of snakefiles/clean.py

RAW_DIR = "results/raw/"
RAW_DOC = "doc/raw/"
FLASHED_READS = "data/Flashed/"

MAP_DIR = "results/map/"
MAP_DOC = "doc/map/"
MAP_SAM = "results/map/SAM/"
MICROHAPLOT = "results/Final_results/Microhaplot/"
#MAP_DIR_SE = "results/map/se/"
#MAP_DOC_SE = "doc/map/se/"

QC_DIR = "results/qc/"
QC_DOC = "doc/qc/"
FLASHED_READS = "results/map/Flashed/"

VARIANT_CALL_DIR = "results/variant_calling/"
VARIANT_CALL_DOC = "doc/variant_calling/"

HAPLOTYPE_DIR = "results/Haplotype/"
HAPLOTYPE_DOC = "doc/haplotype/"

fastQC_raw = "results/Final_results/Quality_control/Raw_reads/"
fastQC = "results/Final_results/Quality_control/QC_reads/"
FINAL_VARIANTS = "results/Final_results/Variants/Filtered_Variants/"
RAW_VARIANTS = "results/Final_results/Variants/Raw_Variants/"
BAM_FILES = "results/Final_results/BAM_files/"
FINAL_DOCS_GENERALSTATS = "results/Final_results/Documents/GeneralStats/"
FINAL_DOCS_IDXSTATS = "results/Final_results/Documents/Idxstats/"
FINAL_DOCS_FLAGSTAT = "results/Final_results/Documents/Flagstat/"
FINAL_DOCS_QUALIMAP = "results/Final_results/Documents/Qualimap/"
FINAL_DOCS_MULTIQC_IDXSTATS = "results/Final_results/Documents/Idxstats/multiQC/"
FINAL_DOCS_MULTIQC_FLAGSTAT = "results/Final_results/Documents/Flagstat/multiQC/"
FINAL_DOCS_MULTIQC_QUALIMAP = "results/Final_results/Documents/Qualimap/multiQC/"
FINAL_DOCS_MULTIQC_FASTQC_RAW = "results/Final_results/Documents/FastQC/multiQC/Raw_reads/"
FINAL_DOCS_MULTIQC_FASTQC_PROCESSED = "results/Final_results/Documents/FastQC/multiQC/Processed_reads/"
FINAL_HAPLOTYPES = "results/Final_results/Haplotypes/"

Python From line 3 of snakefiles/folders.py

if config["Haplotyper"] is True and config["Validation"] is None:
    rule haplotype_make_vcf_link_pe:
        """
        Make symlinks to run rad_haplotyper
        """
        input:
            vcf = FINAL_VARIANTS + "filtered_snps.recode.vcf"
        output:
            vcf = HAPLOTYPE_DIR + "filtered_snps.recode.vcf"
        threads:
            1
        log:
            HAPLOTYPE_DOC + "make_vcf_link.log"
        benchmark:
            HAPLOTYPE_DOC + "make_vcf_link.json"
        shell:
            "ln -s $(readlink -f {input.vcf}) {output.vcf} 2> {log} "


elif config["Haplotyper"] is True and config["Validation"] is not None:
    rule haplotype_make_overlapping_vcf_link_pe:
        """
        Make symlinks to run rad_haplotyper
        """
        input:
            vcf = FINAL_VARIANTS + "Overlapping_SNPs.vcf"
        output:
            vcf = HAPLOTYPE_DIR + "Overlapping_filtered_snps.recode.vcf"
        threads:
            1
        log:
            HAPLOTYPE_DOC + "make_vcf_link.log"
        benchmark:
            HAPLOTYPE_DOC + "make_vcf_link.json"
        shell:
            "ln -s $(readlink -f {input.vcf}) {output.vcf} 2> {log} "

else:
	pass

#########################################################################################################

if config["Haplotyper"] is True and config["Validation"] is None:
    rule haplotype_make_bam_links_pe:
        """
        Make symlinks to run rad_haplotyper
        """
        input:
            vcf = HAPLOTYPE_DIR + "filtered_snps.recode.vcf",
            bam = MAP_DIR + "{sample}.sorted.bam",
            bai = MAP_DIR + "{sample}.sorted.bam.bai"
        output:
            bam = HAPLOTYPE_DIR + "{sample}.bam",
            bai = HAPLOTYPE_DIR + "{sample}.bam.bai"
        threads:
            1
        log:
            HAPLOTYPE_DOC + "make_bam_links.{sample}.log"
        benchmark:
            HAPLOTYPE_DOC + "make_bam_links.{sample}.json"
        shell:
            "ln -s $(readlink -f {input.bam}) {output.bam} 2> {log}; "
            "ln -s $(readlink -f {input.bai}) {output.bai} 2> {log}; "



elif  config["Haplotyper"] is True and config["Validation"] is not None:
    rule haplotype_make_bam_links_overlapping_SNPs_pe:
        """
        Make symlinks to run rad_haplotyper
        """
        input:
            vcf = HAPLOTYPE_DIR + "Overlapping_filtered_snps.recode.vcf",
            bam = MAP_DIR_PE + "{sample}.sorted.bam",
            bai = MAP_DIR_PE + "{sample}.sorted.bam.bai"
        output:
            bam = HAPLOTYPE_DIR + "{sample}.bam",
            bai = HAPLOTYPE_DIR + "{sample}.bam.bai"
        threads:
            1
        log:
            HAPLOTYPE_DOC + "make_bam_links.{sample}.log"
        benchmark:
            HAPLOTYPE_DOC + "make_bam_links.{sample}.json"
        shell:
            "ln -s $(readlink -f {input.bam}) {output.bam} 2> {log}; "
            "ln -s $(readlink -f {input.bai}) {output.bai} 2> {log}; "



else:
	pass
#########################################################################################################

if config["illumina_se"] is None and config["Haplotyper"] is True and config["Validation"] is None:
    rule haplotype_make_popmap_pe:
        """
        Make popmap to run rad_haplotyper
        """
        input:
            vcf = HAPLOTYPE_DIR + "filtered_snps.recode.vcf",
            bam = expand(HAPLOTYPE_DIR + "{sample}.bam",
                         sample = SAMPLES_PE)
        output:
            popmap = HAPLOTYPE_DIR + "popmap"
        threads:
            1
        params:
            hap_dir = HAPLOTYPE_DIR
        log:
            HAPLOTYPE_DOC + "make_popmap.log"
        benchmark:
            HAPLOTYPE_DOC + "make_popmap.json"
        shell:
            "ls {params.hap_dir}/*.bam | sed 's/.bam//' | awk '{{print $1, 1}}' > {output.popmap} "


elif config["illumina_pe"] is None and config["Haplotyper"] is True and config["Validation"] is None:
    rule haplotype_make_popmap_se:
        """
        Make popmap to run rad_haplotyper
        """
        input:
            vcf = HAPLOTYPE_DIR + "filtered_snps.recode.vcf",
            bam = expand(HAPLOTYPE_DIR + "{sample}.bam",
                         sample = SAMPLES_SE)
        output:
            popmap = HAPLOTYPE_DIR + "popmap"
        threads:
            1
        params:
            hap_dir = HAPLOTYPE_DIR
        log:
            HAPLOTYPE_DOC + "make_popmap.log"
        benchmark:
            HAPLOTYPE_DOC + "make_popmap.json"
        shell:
            "ls {params.hap_dir}/*.bam | sed 's/.bam//' | awk '{{print $1, 1}}' > {output.popmap} "


elif config["illumina_se"] is None and config["Haplotyper"] is True and config["Validation"] is not None:
    rule haplotype_make_popmap_overlapping_SNPs_pe:
        """
        Make popmap to run rad_haplotyper
        """
        input:
            vcf = HAPLOTYPE_DIR + "Overlapping_filtered_snps.recode.vcf",
            bam = expand(HAPLOTYPE_DIR + "{sample}.bam",
                         sample = SAMPLES_PE)
        output:
            popmap = HAPLOTYPE_DIR + "popmap"
        threads:
            1
        params:
            hap_dir = HAPLOTYPE_DIR
        log:
            HAPLOTYPE_DOC + "make_popmap.log"
        benchmark:
            HAPLOTYPE_DOC + "make_popmap.json"
        shell:
            "ls {params.hap_dir}/*.bam | sed 's/.bam//' | awk '{{print $1, 1}}' > {output.popmap} "


elif config["illumina_pe"] is None and config["Haplotyper"] is True and config["Validation"] is not None:
    rule haplotype_make_popmap_overlapping_SNPs_se:
        """
        Make popmap to run rad_haplotyper
        """
        input:
            vcf = HAPLOTYPE_DIR + "Overlapping_filtered_snps.recode.vcf",
            bam = expand(HAPLOTYPE_DIR + "{sample}.bam",
                         sample = SAMPLES_SE)
        output:
            popmap = HAPLOTYPE_DIR + "popmap"
        threads:
            1
        params:
            hap_dir = HAPLOTYPE_DIR
        log:
            HAPLOTYPE_DOC + "make_popmap.log"
        benchmark:
            HAPLOTYPE_DOC + "make_popmap.json"
        shell:
            "ls {params.hap_dir}/*.bam | sed 's/.bam//' | awk '{{print $1, 1}}' > {output.popmap} "


else:
	pass

#########################################################################################################


if  config["Haplotyper"] is True and config["Validation"] is None:
    rule haplotype_run_rad_haplotyper:
        """
        Run rad_haplotyper
        """
        input:
            popmap = HAPLOTYPE_DIR + "popmap",
            vcf = HAPLOTYPE_DIR + "filtered_snps.recode.vcf"
        output:
            output_1 = HAPLOTYPE_DIR + "haplotypes.done",
            output_2 = FINAL_HAPLOTYPES + "haplotypes.done"
        threads:
            20
        params:
            hap_dir = HAPLOTYPE_DIR,
            hap_dir_final_results = FINAL_HAPLOTYPES,
            home_dir = "../..",
            Haplotyping_params = config["Haplotyping_params"]["rad_haplotyper_params"]
        log:
            HAPLOTYPE_DOC + "rad_haplotyper.log"
        benchmark:
            HAPLOTYPE_DOC + "rad_haplotyper.json"
        shell:
            "cd {params.hap_dir}; "
            "rad_haplotyper.pl -v filtered_snps.recode.vcf -x {threads} -g haps.gen -p popmap {params.Haplotyping_params}; "
            "cd {params.home_dir}; "
            "touch {output.output_1}; "
            "cp {output.output_1} {output.output_2}; "
            "cp results/Haplotype/codes.haps.gen results/Final_results/Haplotypes/; "
            "cp results/Haplotype/stats.out results/Final_results/Haplotypes/; "
            "cp results/Haplotype/ind_stats.out results/Final_results/Haplotypes/; "
            "cp results/Haplotype/haps.gen results/Final_results/Haplotypes/; "
            "cp results/Haplotype/hap_loci.txt results/Final_results/Haplotypes/; "


elif  config["Haplotyper"] is True and config["Validation"] is not None:
    rule haplotype_run_rad_haplotyper_overlapping_SNPs_pe:
        """
        Run rad_haplotyper
        """
        input:
            popmap = HAPLOTYPE_DIR + "popmap",
            vcf = HAPLOTYPE_DIR + "Overlapping_filtered_snps.recode.vcf"
        output:
            output_1 = HAPLOTYPE_DIR + "haplotypes.done",
            output_2 = FINAL_HAPLOTYPES + "haplotypes.done"
        threads:
            20
        params:
            hap_dir = HAPLOTYPE_DIR,
            hap_dir_final_results = FINAL_HAPLOTYPES,
            home_dir = "../..",
            Haplotyping_params = config["Haplotyping_params"]["rad_haplotyper_params"]
        log:
            HAPLOTYPE_DOC + "rad_haplotyper.log"
        benchmark:
            HAPLOTYPE_DOC + "rad_haplotyper.json"
        shell:
            "cd {params.hap_dir}; "
            "rad_haplotyper.pl -v Overlapping_filtered_snps.recode.vcf -x {threads} -g haps.gen -p popmap {params.Haplotyping_params}; "
            "cd {params.home_dir}; "
            "touch {output.output_1}; "
            "cp {output.output_1} {output.output_2}; "
            "cp results/Haplotype/codes.haps.gen results/Final_results/Haplotypes/; "
            "cp results/Haplotype/stats.out results/Final_results/Haplotypes/; "
            "cp results/Haplotype/ind_stats.out results/Final_results/Haplotypes/; "
            "cp results/Haplotype/haps.gen results/Final_results/Haplotypes/; "
            "cp results/Haplotype/hap_loci.txt results/Final_results/Haplotypes/; "



else:
	pass

Python From line 1 of snakefiles/haplotype.py

rule map_bwa_index_reference:
    input:
        RAW_DIR + "genome.fasta"
    output:
        RAW_DIR + "genome.fasta.bwt"
    log:
        MAP_DOC + "bwa_index_reference.log"
    benchmark:
        MAP_DOC + "bwa_index_reference.json"
    threads:
        4
    shell:
        "bwa index {input} &> {log}"

#########################################################################################################

if config["illumina_se"] is None and config["Flash"] is False:
   rule map_bwa_sample_pe:
        input:
            reference = RAW_DIR + "genome.fasta",
            index = RAW_DIR + "genome.fasta.bwt",
            forward = QC_DIR + "{sample}.final.1.fq.gz",
            reverse = QC_DIR + "{sample}.final.2.fq.gz"
        output:
            temp(MAP_DIR + "{sample}.unsorted.bam")
        params:
            rg="@RG\tID:{sample}\tSM:{sample}\tPL:Illumina",
            bwa_params = config["BWA-Mapping"]["BWA_mem_params"],
            samtools_params = config["Samtools_mapping_params"]["samtools_view_params"]
        log:
            MAP_DOC + "bwa_{sample}_pe.log"
        benchmark:
            MAP_DOC + "bwa_{sample}_pe.json"
        shell:
            "(bwa mem -R '{params.rg}' {params.bwa_params} {input.reference} {input.forward} {input.reverse} | "
            "samtools view {params.samtools_params} - "
            "> {output}) "
            "2> {log}"

elif config["illumina_se"] is None and config["Flash"] is True:
   rule map_bwa_sample_pe_flashed:
        input:
            reference = RAW_DIR + "genome.fasta",
            index = RAW_DIR + "genome.fasta.bwt",
            Flashed_read = FLASHED_READS + "{sample}.flashed.extendedFrags.fastq.gz"
        output:
            temp(MAP_DIR + "{sample}.unsorted.bam")
        params:
            rg="@RG\tID:{sample}\tSM:{sample}\tPL:Illumina",
            bwa_params = config["BWA-Mapping"]["BWA_mem_params"],
            samtools_params = config["Samtools_mapping_params"]["samtools_view_params"]
        log:
            MAP_DOC + "bwa_{sample}.log"
        benchmark:
            MAP_DOC + "bwa_{sample}.json"
        shell:
            "(bwa mem -R '{params.rg}' {params.bwa_params} {input.reference} {input.Flashed_read} | "
            "samtools view {params.samtools_params} - "
            "> {output}) "
            "2> {log}"


elif config["illumina_pe"] is None:
    rule map_bwa_sample_se:
        input:
            reference = RAW_DIR + "genome.fasta",
            index = RAW_DIR + "genome.fasta.bwt",
            single = QC_DIR + "{sample}.se.final.fq.gz"
        output:
            temp(MAP_DIR + "{sample}.unsorted.bam")
        params:
            rg="@RG\tID:{sample}\tSM:{sample}\tPL:Illumina",
            bwa_params = config["BWA-Mapping"]["BWA_mem_params"],
            samtools_params = config["Samtools_mapping_params"]["samtools_view_params"]
        log:
            MAP_DOC + "bwa_{sample}.log"
        benchmark:
            MAP_DOC + "bwa_{sample}.json"
        shell:
            "(bwa mem -R '{params.rg}' {params.bwa_params} {input.reference} {input.single} | "
            "samtools view {params.samtools_params} - "
            "> {output}) "
            "2> {log}"

#########################################################################################################


rule map_sort_sample:
    input:
        MAP_DIR + "{sample}.unsorted.bam"
    output:
        temp(MAP_DIR + "{sample}.Rmdup.bam")
    log:
        MAP_DOC + "sort_{sample}.log"
    benchmark:
        MAP_DOC + "sort_{sample}.json"
    threads:
        4
    shell:
        "samtools sort -n -@ {threads} "
            "-T $(mktemp --dry-run) "
            "-O bam {input} "
        "> {output} "
        "2> {log}"

#########################################################################################################


rule map_fixmate_sample:
    input:
        MAP_DIR + "{sample}.Rmdup.bam"
    output:
        temp(MAP_DIR + "{sample}.fixmate.bam")
    params:
        fixmate_params = config["SamTools_Fixmate"]["fixmate_params"]
    log:
        MAP_DOC + "fixmate_{sample}.log"
    benchmark:
        MAP_DOC + "fixmate_{sample}.json"
    shell:
        "samtools fixmate {params.fixmate_params} {input} {output} "




#########################################################################################################


rule map_sort_2_sample:
    input:
        MAP_DIR + "{sample}.fixmate.bam"
    output:
        temp(MAP_DIR + "{sample}.fixmate_sorted.bam")
    log:
        MAP_DOC + "sort_2_{sample}.log"
    benchmark:
        MAP_DOC + "sort_2_{sample}.json"
    threads:
        4
    shell:
        "samtools sort -@ {threads} "
            "-T $(mktemp --dry-run) "
            "-O bam {input} "
        "> {output} "
        "2> {log}"


#########################################################################################################


rule map_markdup_sample:
    input:
        MAP_DIR + "{sample}.fixmate_sorted.bam"
    output:
        BAM = MAP_DIR + "{sample}.sorted.bam",
        BAI = MAP_DIR + "{sample}.sorted.bam.bai"
    params:
        markdup_params = config["Marking_Duplicates"]["markdup_params"]
    log:
        MAP_DOC + "rmdup_{sample}.log"
    benchmark:
        MAP_DOC + "rmdup_{sample}.json"
    shell:
        """
        samtools markdup {params.markdup_params} {input} {output.BAM}
        samtools index {output.BAM} {output.BAI}
        """

#########################################################################################################


rule map_stats_sample:
    input:
        MAP_DIR + "{sample}.sorted.bam"
    output:
        FINAL_DOCS_GENERALSTATS + "{sample}.generalStats"
    log:
        MAP_DOC + "stats_{sample}.log"
    benchmark:
        MAP_DOC + "stats_{sample}.json"
    shell:
        "samtools stats {input} | "
        "grep ^SN | cut -f 2- > {output}"


#########################################################################################################


rule map_idxstats_sample:
    input:
        MAP_DIR + "{sample}.sorted.bam"
    output:
        idxstats = FINAL_DOCS_IDXSTATS + "{sample}.idxstats",
        flagstat = FINAL_DOCS_FLAGSTAT + "{sample}.flagstat"
    benchmark:
        MAP_DOC + "index_{sample}_idxstats.json"
    shell:
        """
        samtools idxstats {input} > {output.idxstats}
        samtools flagstat {input} > {output.flagstat}
        """


#########################################################################################################



rule map_QualiMap_bam_sample:
    input:
        MAP_DIR + "{sample}.sorted.bam"
    output:
        sample_header = temp(MAP_DIR + "{sample}.header.sam"),
        new_sample_header = temp(MAP_DIR + "{sample}.header.new.sam"),
        new_bam = temp(MAP_DIR + "{sample}.new.bam"),
        new_bam_index = temp(MAP_DIR + "{sample}.new.bam.bai"),
        qualimap_outfile = FINAL_DOCS_QUALIMAP + "{sample}.qualimap"
    benchmark:
        MAP_DOC + "qualimap_{sample}.json"
    shell:
        """
        samtools view -H {input} > {output.sample_header}
        head -n -3 {output.sample_header} > {output.new_sample_header}
        samtools reheader {output.new_sample_header} {input} > {output.new_bam}
        samtools index {output.new_bam} {output.new_bam_index}
        qualimap bamqc -bam {output.new_bam} -outdir {output.qualimap_outfile} -outfile {output.qualimap_outfile} -outformat html
        """

#########################################################################################################

if config["illumina_se"] is None:
    rule map_MultiQC_Samtools_sample_pe:
        input:
            idxstats = expand(FINAL_DOCS_IDXSTATS + "{sample}.idxstats", sample = SAMPLES_PE),
            flagstat = expand(FINAL_DOCS_FLAGSTAT + "{sample}.flagstat", sample = SAMPLES_PE)
        output:
            MultiQC_idxstats = FINAL_DOCS_MULTIQC_IDXSTATS,
            MultiQC_flagstat = FINAL_DOCS_MULTIQC_FLAGSTAT
        shell:
            """
            multiqc -m samtools {input.idxstats} -o {output.MultiQC_idxstats}
            multiqc -m samtools {input.flagstat} -o {output.MultiQC_flagstat}
            """

elif config["illumina_pe"] is None:
    rule map_MultiQC_Samtools_se:
        input:
            idxstats = expand(FINAL_DOCS_IDXSTATS + "{sample}.idxstats", sample = SAMPLES_SE),
            flagstat = expand(FINAL_DOCS_FLAGSTAT + "{sample}.flagstat", sample = SAMPLES_SE)
        output:
            MultiQC_idxstats = FINAL_DOCS_MULTIQC_IDXSTATS,
            MultiQC_flagstat = FINAL_DOCS_MULTIQC_FLAGSTAT
        shell:
            """
            multiqc -m samtools {input.idxstats} -o {output.MultiQC_idxstats}
            multiqc -m samtools {input.flagstat} -o {output.MultiQC_flagstat}
            """

#########################################################################################################

if config["illumina_se"] is None and config["QC_Tests"] is True:
    rule map_MultiQC_Qualimap_sample_pe:
        input:
            QUALIMAP = expand(FINAL_DOCS_QUALIMAP + "{sample}.qualimap", sample = SAMPLES_PE)
        output:
            MULTIQC_QUALIMAP = FINAL_DOCS_MULTIQC_QUALIMAP
        shell:
            """
            multiqc -m qualimap {input.QUALIMAP} -o {output.MULTIQC_QUALIMAP}
            """

elif config["illumina_pe"] is None and config["QC_Tests"] is True:
    rule map_MultiQC_Qualimap_sample_se:
        input:
            QUALIMAP = expand(FINAL_DOCS_QUALIMAP + "{sample}.qualimap", sample = SAMPLES_SE)
        output:
            MULTIQC_QUALIMAP = FINAL_DOCS_MULTIQC_QUALIMAP
        shell:
            """
            multiqc -m qualimap {input.QUALIMAP} -o {output.MULTIQC_QUALIMAP}
            """
else:
	pass

#########################################################################################################

if config["illumina_se"] is None:
    rule map_merge_bam_pe:
        input:
            bam = expand(MAP_DIR + "{sample}.sorted.bam",
                         sample = SAMPLES_PE),
            bai = expand(MAP_DIR + "{sample}.sorted.bam.bai",
                         sample = SAMPLES_PE)
        output:
            BAM_1 = MAP_DIR + "merged.bam",
            BAM_2 = BAM_FILES + "merged.bam",
            BAI_1 = BAM_FILES + "merged.bam.bai",
            BAI_2 = MAP_DIR + "merged.bam.bai",
            MD5_1 = MAP_DIR + "mergedMD5.txt",
            MD5_2 = BAM_FILES + "mergedMD5.txt"
        log:
            MAP_DOC + "merge_bam_pe.log"
        benchmark:
            MAP_DOC + "merge_bam_pe.json"
        shell:
            """
            samtools merge {output.BAM_1} {input.bam} > {log} 2>&1
            cp {output.BAM_1} {output.BAM_2}
            md5sum {output.BAM_1} > {output.MD5_1}
            md5sum {output.BAM_1} > {output.MD5_2}
            samtools index {output.BAM_1} {output.BAI_1}
            cp {output.BAI_1} {output.BAI_2}
            """

elif config["illumina_pe"] is None:
    rule map_merge_bam_se:
        input:
            bam = expand(MAP_DIR + "{sample}.sorted.bam",
                         sample = SAMPLES_SE),
            bai = expand(MAP_DIR + "{sample}.sorted.bam.bai",
                         sample = SAMPLES_SE)
        output:
            BAM_1 = MAP_DIR + "merged.bam",
            BAM_2 = BAM_FILES + "merged.bam",
            BAI_1 = BAM_FILES + "merged.bam.bai",
            BAI_2 = MAP_DIR + "merged.bam.bai",
            MD5_1 = MAP_DIR + "mergedMD5.txt",
            MD5_2 = BAM_FILES + "mergedMD5.txt"
        log:
            MAP_DOC + "merge_bam_se.log"
        benchmark:
            MAP_DOC + "merge_bam_se.json"
        shell:
            """
            samtools merge {output.BAM_1} {input.bam} > {log} 2>&1
            cp {output.BAM_1} {output.BAM_2}
            md5sum {output.BAM_1} > {output.MD5_1}
            md5sum {output.BAM_1} > {output.MD5_2}
            samtools index {output.BAM_1} {output.BAI_1}
            cp {output.BAI_1} {output.BAI_2}
            """

#########################################################################################################


if config["illumina_se"] is None and config["Flash"] is False:
    rule map_samtools_properly_paired_pe:
        input:
            MAP_DIR + "merged.bam"
        output:
            BAM_1 = MAP_DIR + "merged_properly_paired.bam",
            BAM_2 = BAM_FILES + "merged_properly_paired.bam"
        params:
            samtools_params = config["Samtools_mapping_params"]["samtools_view_params"],
            samtools_filtering_params = config["Samtools_read_filtering"]["samtools_view_filtering"]
        log:
            MAP_DOC + "properly_paired_merged.log"
        benchmark:
            MAP_DOC + "properly_paired_merged.json"
        shell:
            """
			samtools view {params.samtools_params} {params.samtools_filtering_params} {input} > {output.BAM_1}
            cp {output.BAM_1} {output.BAM_2}
            """

elif config["illumina_se"] is None and config["Flash"] is True:
    pass


elif config["illumina_pe"] is None:
    pass


#########################################################################################################

if config["illumina_se"] is None:
    rule properly_paired_index_pe:
        input:
            MAP_DIR + "merged_properly_paired.bam"
        output:
            BAI_1 = MAP_DIR + "merged_properly_paired.bam.bai",
            BAI_2 = BAM_FILES + "merged_properly_paired.bam.bai"
        log:
            MAP_DOC + "index_merged_properly_paired.log"
        benchmark:
            MAP_DOC + "index_merged_properly_paired.json"
        shell:
            """
            samtools index {input} {output.BAI_1}
            cp {output.BAI_1} {output.BAI_2}
            """

elif config["illumina_se"] is None and config["Flash"] is True:
    pass

elif config["illumina_pe"] is None:
    pass

#########################################################################################################


if config["illumina_se"] is None:
    rule map:
        input:
            MAP_DIR + "merged.bam.bai",
            expand([FINAL_DOCS_IDXSTATS + "{sample}.idxstats", FINAL_DOCS_FLAGSTAT + "{sample}.flagstat", FINAL_DOCS_GENERALSTATS + "{sample}.generalStats"],
                sample = SAMPLES_PE)

elif config["illumina_pe"] is None:
    rule map:
        input:
            MAP_DIR + "merged.bam.bai",
            expand([FINAL_DOCS_IDXSTATS + "{sample}.idxstats", FINAL_DOCS_FLAGSTAT + "{sample}.flagstat", FINAL_DOCS_GENERALSTATS + "{sample}.generalStats"],
                sample = SAMPLES_SE)

Python SAMtools MultiQC BWA QualiMap From line 7 of snakefiles/map.py

if config["illumina_se"] is None and config["QC_Tests"] is True:
    rule fastqc_raw_pe:
        input:
            RAW_DIR + "{sample}_{pair}.fq.gz",
        output:
            fastQC_raw + "{sample}.raw.{pair}.fastqc"
        shell:
            """
            fastqc --outdir {fastQC_raw} {input} > {output}
            """

elif config["illumina_pe"] is None and config["QC_Tests"] is True:
    rule fastqc_raw_se:
        input:
            RAW_DIR + "{sample}.se.fq.gz"
        output:
            fastQC_raw + "{sample}.raw.fastqc"
        shell:
            """
            fastqc --outdir {fastQC_raw} {input} > {output}
            """
else:
    pass


############################################################################################################
#"""Run trimmomatic in paired end mode to eliminate Illumina adaptors and remove
 #low quality regions and reads."""
############################################################################################################

if config["illumina_se"] is None:
    rule qc_trimmomatic_pe:
        """
        Run trimmomatic in paired end mode to eliminate Illumina adaptors and
        remove low quality regions and reads.
        Inputs _1 and _2 are piped through gzip/pigz.
        Outputs _1 and _2 are piped to gzip/pigz (level 9).
        Outputs _3 and _4 are compressed with the builtin compressor from
        Trimmomatic. Further on they are catted and compressed with gzip/pigz
        (level 9).
        Note: The cut -f 1 -d " " is to remove additional fields in the FASTQ
        header. It is done posterior to the trimming since the output comes
        slower than the input is read.
        Number of threads used:
            4 for trimmomatic
            2 for gzip inputs
            2 for gzip outputs
            Total: 8
        """
        input:
            forward = RAW_DIR + "{sample}_1.fq.gz",
            reverse = RAW_DIR + "{sample}_2.fq.gz"
        output:
            forward     = QC_DIR + "{sample}.final.1.fq.gz",
            reverse     = QC_DIR + "{sample}.final.2.fq.gz",
            unpaired_pe    = protected(QC_DIR + "{sample}.final.unpaired.fq.gz")
        params:
            unpaired_1  = QC_DIR + "{sample}_3.fq.gz",
            unpaired_2  = QC_DIR + "{sample}_4.fq.gz",
            adapter     = lambda wildcards: config["illumina_pe"][wildcards.sample]["adapter"],
            phred       = lambda wildcards: config["illumina_pe"][wildcards.sample]["phred"],
            trimmomatic_params = config["Read Trimming"]["trimmomatic_params"]
        log:
            QC_DOC + "trimmomatic_pe_{sample}.log"
        benchmark:
            QC_DOC + "trimmomatic_pe_{sample}.json"
        threads:
            24 # I've been able to work with pigz and 24 trimmomatic threads.
        shell:
            """
            trimmomatic PE \
                -threads {threads} \
                -{params.phred} \
                {input.forward} {input.reverse} \
                {output.forward} {params.unpaired_1} {output.reverse} {params.unpaired_2} \
                ILLUMINACLIP:{params.adapter}:2:30:10 \
                {params.trimmomatic_params} \
                2> {log}

                zcat {params.unpaired_1} {params.unpaired_2} |
                cut -f 1 -d " " |
                pigz -9 > {output.unpaired_pe}

            rm {params.unpaired_1} {params.unpaired_2}
            """

elif config["illumina_pe"] is None:
    rule qc_trimmomatic_se:
        """
        Run trimmomatic in paired end mode to eliminate Illumina adaptors and
        remove low quality regions and reads.
        Inputs _1 and _2 are piped through gzip/pigz.
        Outputs _1 and _2 are piped to gzip/pigz (level 9).
        Outputs _3 and _4 are compressed with the builtin compressor from
        Trimmomatic. Further on they are catted and compressed with gzip/pigz
        (level 9).
        Note: The cut -f 1 -d " " is to remove additional fields in the FASTQ
        header. It is done posterior to the trimming since the output comes
        slower than the input is read.
        Number of threads used:
            4 for trimmomatic
            2 for gzip inputs
            2 for gzip outputs
            Total: 8
        """
        input:
            single = RAW_DIR + "{sample}.se.fq.gz"
        output:
            single	= QC_DIR + "{sample}.se.final.fq.gz"
        params:
            adapter	= lambda wildcards: config["illumina_se"][wildcards.sample]["adapter"],
            phred	= lambda wildcards: config["illumina_se"][wildcards.sample]["phred"],
            trimmomatic_params = config["Read Trimming"]["trimmomatic_params"]
        log:
            QC_DOC + "trimmomatic_se_{sample}.log"
        benchmark:
            QC_DOC + "trimmomatic_se_{sample}.json"
        threads:
            24 # I've been able to work with pigz and 24 trimmomatic threads.
        shell:
            """
            trimmomatic SE \
            	-threads {threads} \
            	-{params.phred} \
            	{input.single} \
            	{output.single} \
            	ILLUMINACLIP:{params.adapter}:2:30:10 \
            	{params.trimmomatic_params} \
            	2> {log}
            """



############################################################################################################

if config["illumina_se"] is None and config["QC_Tests"] is True:
    rule fastqc_pe:
        input:
            QC_DIR + "{sample}.final.{pair}.fq.gz"
        output:
            fastQC + "{sample}.{pair}"
        shell:
            """
            fastqc --outdir {fastQC} {input} > {output}
            """

elif config["illumina_pe"] is None and config["QC_Tests"] is True:
    rule fastqc_se:
        input:
            QC_DIR + "{sample}.se.final.fq.gz"
        output:
            fastQC + "{sample}.se"
        shell:
            """
            fastqc --outdir {fastQC} {input} > {output}
            """
else:
	pass

############################################################################################################

if config["illumina_se"] is None and config["Flash"] is True:
    rule flash_pe_reads:
        input:
            forward = QC_DIR + "{sample}.final.1.fq.gz",
            reverse = QC_DIR + "{sample}.final.2.fq.gz"
        output:
            name = FLASHED_READS + "{sample}.flashed",
            file_name = FLASHED_READS + "{sample}.flashed.extendedFrags.fastq.gz"
        params:
            scripts_dir = config["scripts_dir"],
            flash_params = config["FLASH"]["FLASH_params"]
        threads:
            4
        shell:
            "flash {params.flash_params} {input.forward} {input.reverse} -o {output.name} -z"
            #"{params.scripts_dir}/flash_reads.sh {input.forward} {input.reverse} {output.name}"
            "> {output.name} "   


elif config["illumina_pe"] is None:
    pass


############################################################################################################

if config["illumina_se"] is None:
    rule qc_results_pe:
        '''Generate only the resulting files, not the reports'''
        input:
            pe_files = expand(
                [QC_DIR + "{sample}.final.{pair}.fq.gz", fastQC_raw + "{sample}.raw.{pair}.fastqc", FLASHED_READS + "{sample}.flashed.extendedFrags.fastq.gz", FLASHED_READS + "{sample}.flashed"],
                sample = SAMPLES_PE,
                pair = "1 2".split()
            )

elif config["illumina_pe"] is None:
    rule qc_results_se:
        '''Generate only the resulting files, not the reports'''
        input:
            se_files = expand(
				[QC_DIR + "{sample}.se.final.fq.gz", fastQC_raw + "{sample}.raw_fastqc.zip"],
                sample = SAMPLES_SE
            )

############################################################################################################



if config["illumina_se"] is None and config["QC_Tests"] is True:
    rule raw_results_fastqc_pe:
        """Checkpoint to generate all the links for raw data"""
        input:
            expand(
                fastQC_raw + "{sample}.raw.{pair}.fastqc.{extension}", 
                sample = SAMPLES_PE,
                pair = "1 2".split(),
                extension = "zip html".split()
                )
elif config["illumina_pe"] is None and config["QC_Tests"] is True:
    rule raw_results_fastqc_se:
        """Checkpoint to generate all the links for raw data"""
        input:
            expand(
                fastQC_raw + "{sample}.raw.fastqc.{extension}",
                sample = SAMPLES_SE,
                extension = "zip html".split()
           )

Python FastQC Trimmomatic From line 3 of snakefiles/qc.py

rule raw_make_link_reference:
    """
    Make a link to the reference genome.
    """
    input:
        config["genome"]
    output:
        RAW_DIR + "genome.fasta"
    log:
        RAW_DOC + "make_link_reference.log"
    benchmark:
        RAW_DOC + "make_link_reference.json"
    shell:
        "ln -s $(readlink -f {input}) {output} 2> {log}"



if config["illumina_se"] is None:
    rule raw_make_links_pe_sample:
        """
        Make a link next to the original file, with a prettier name than default.
        """
        input:
            forward = lambda wildcards: config["illumina_pe"][wildcards.sample]["forward"],
            reverse = lambda wildcards: config["illumina_pe"][wildcards.sample]["reverse"]
        output:
            forward = protected(RAW_DIR + "{sample}_1.fq.gz"),
            reverse = protected(RAW_DIR + "{sample}_2.fq.gz")
        log:
            RAW_DOC + "make_links_pe_{sample}.log"
        benchmark:
            RAW_DOC + "make_links_pe_{sample}.json"
        shell:
            "ln -s $(readlink -f {input.forward}) {output.forward} 2> {log};"
            "ln -s $(readlink -f {input.reverse}) {output.reverse} 2>> {log}"

elif config["illumina_pe"] is None:
    rule raw_make_links_se_sample:
        """
        Make a link next to the original file, with a prettier name than default.
        """
        input:
           single = lambda wildcards: config["illumina_se"][wildcards.sample]["single"]
        output:
            single = protected(RAW_DIR + "{sample}.se.fq.gz")
        log:
            RAW_DOC + "make_links_se_{sample}.log"
        benchmark:
            RAW_DOC + "make_links_se_{sample}.json"
        shell:
            "ln -s $(readlink -f {input.single}) {output.single} 2> {log}"


if config["illumina_se"] is None:
    rule raw_results_pe:
        """Checkpoint to generate all the links for raw data"""
        input:
            expand(
                RAW_DIR + "{sample}_{pair}.fq.gz",
                sample = SAMPLES_PE,
                pair = "1 2".split()
                )
elif config["illumina_pe"] is None:
    rule raw_results_se:
        """Checkpoint to generate all the links for raw data"""
        input:
            expand(
                RAW_DIR + "{sample}.se.fq.gz",
                sample = SAMPLES_SE
           )

Python From line 1 of snakefiles/raw.py

rule variant_call_split_genome:
    """
    Split the reference genome into separate components for parallel variant calling
    """
    input:
        genome = RAW_DIR + "genome.fasta"
    output:
        expand(
            RAW_DIR + "genome.{index}.fasta",
            index = map(lambda n: "%01d" % n, range(0, int(config["variant_call_params"]["split_genome"])))
            )
    threads:
        1
    params:
        split = config["variant_call_params"]["split_genome"]
    log:
        VARIANT_CALL_DOC + "variant_call_split_fasta.log"
    benchmark:
        VARIANT_CALL_DOC + "variant_call_split_fasta.json"
    shell:
        "pyfasta split -n {params.split} {input} 2> {log}"


rule variant_call_make_bed:
    """
    Generate a BED file for each of the split FASTA files
    """
    input:
        RAW_DIR + "genome.{index}.fasta"
    output:
        VARIANT_CALL_DIR + "genome.{index}.bed"
    params:
        scripts_dir = config["scripts_dir"]
    threads:
        2
    log:
        VARIANT_CALL_DOC + "variant_call_make_bed.{index}.log"
    benchmark:
        VARIANT_CALL_DOC + "variant_call_make_bed.{index}.json"
    shell:
        "{params.scripts_dir}/fa_to_bed.sh {input}"
        "> {output} "
        "2> {log}"


#########################################################################################################

if config["illumina_se"] is None and config["Flash"] is False:
    rule variant_call_subset_bam_pe:
        """
        Generate a BAM file for each of the genome partitions
        """
        input:
            bam = MAP_DIR + "merged_properly_paired.bam",
            bai = MAP_DIR + "merged_properly_paired.bam.bai",
            bed = VARIANT_CALL_DIR + "genome.{index}.bed"
        output:
            bam = VARIANT_CALL_DIR + "merged.{index}.bam",
            bai = VARIANT_CALL_DIR + "merged.{index}.bam.bai"
        params:
            scripts_dir = config["scripts_dir"]
        threads:
            2
        log:
            VARIANT_CALL_DOC + "variant_call_make_bam.log"
        benchmark:
            VARIANT_CALL_DOC + "variant_call_make_bam.json"
        shell:
            "samtools view -L {input.bed} -b {input.bam} "
            "> {output.bam} "
            "2> {log} && "
            "samtools index {output.bam}"

elif config["illumina_se"] is None and config["Flash"] is True:
    rule variant_call_flashed_bam_pe:
        """
        Generate a BAM file for each of the genome partitions
        """
        input:
            bam = MAP_DIR + "merged.bam",
            bai = MAP_DIR + "merged.bam.bai",
            bed = VARIANT_CALL_DIR + "genome.{index}.bed"
        output:
            bam = VARIANT_CALL_DIR + "merged.{index}.bam",
            bai = VARIANT_CALL_DIR + "merged.{index}.bam.bai"
        params:
            scripts_dir = config["scripts_dir"]
        threads:
            2
        log:
            VARIANT_CALL_DOC + "variant_call_make_bam.log"
        benchmark:
            VARIANT_CALL_DOC + "variant_call_make_bam.json"
        shell:
            "samtools view -L {input.bed} -b {input.bam} "
            "> {output.bam} "
            "2> {log} && "
            "samtools index {output.bam}"


elif config["illumina_pe"] is None:
    rule variant_call_subset_bam_se:
        """
        Generate a BAM file for each of the genome partitions
        """
        input:
            bam = MAP_DIR + "merged.bam",
            bai = MAP_DIR + "merged.bam.bai",
            bed = VARIANT_CALL_DIR + "genome.{index}.bed"
        output:
            bam = VARIANT_CALL_DIR + "merged.{index}.bam",
            bai = VARIANT_CALL_DIR + "merged.{index}.bam.bai"
        params:
            scripts_dir = config["scripts_dir"]
        threads:
            2
        log:
            VARIANT_CALL_DOC + "variant_call_make_bam.log"
        benchmark:
            VARIANT_CALL_DOC + "variant_call_make_bam.json"
        shell:
            "samtools view -L {input.bed} -b {input.bam} "
            "> {output.bam} "
            "2> {log} && "
            "samtools index {output.bam}"

#########################################################################################################


rule variant_call_run_freebayes:
    """
    Run FreeBayes
    """
    input:
        bam = VARIANT_CALL_DIR + "merged.{index}.bam",
        bai = VARIANT_CALL_DIR + "merged.{index}.bam.bai",
        fasta = RAW_DIR + "genome.{index}.fasta"
    output:
        VARIANT_CALL_DIR + "snps.{index}.vcf"
    params:
        fb_params = config["FreeBayes_variant_calling"]["freebayes_params"]
    threads:
        4
    log:
        VARIANT_CALL_DOC + "freebayes.{index}.log"
    benchmark:
        VARIANT_CALL_DOC + "freebayes.{index}.json"
    shell:
        "freebayes {params.fb_params} -f {input.fasta} {input.bam} > {output}"


#########################################################################################################


rule variant_call_bgzip_vcf:
    """
    Compress the VCF files with bgzip
    """
    input:
        VARIANT_CALL_DIR + "snps.{index}.vcf"
    output:
        VARIANT_CALL_DIR + "snps.{index}.vcf.gz"
    threads:
        1
    log:
        VARIANT_CALL_DOC + "bgzip_{index}.log"
    benchmark:
        VARIANT_CALL_DOC + "bgzip_{index}.json"
    shell:
        "bgzip -c {input} > {output}"


#########################################################################################################


rule variant_call_index_vcf:
    """
    Index the VCF files with tabix
    """
    input:
        VARIANT_CALL_DIR + "snps.{index}.vcf.gz"
    output:
        VARIANT_CALL_DIR + "snps.{index}.vcf.gz.tbi"
    threads:
        2
    log:
        VARIANT_CALL_DOC + "vcf_index_{index}.log"
    benchmark:
        VARIANT_CALL_DOC + "vcf_index_{index}.json"
    shell:
        "tabix {input}"


#########################################################################################################


rule variant_call_merge_vcf:
    """
    Merge all VCF files into a single file with bcftools merge
    """
    input:
        vcf = expand(
            VARIANT_CALL_DIR + "snps.{index}.vcf.gz",
            index = map(lambda n: "%01d" % n, range(0, int(config["variant_call_params"]["split_genome"])))),
	    index = expand(
            VARIANT_CALL_DIR + "snps.{index}.vcf.gz.tbi",
            index = map(lambda n: "%01d" % n, range(0, int(config["variant_call_params"]["split_genome"]))))
    output:
        VARIANTS_1 = VARIANT_CALL_DIR + "raw_snps.vcf.gz",
        VARIANTS_2 = RAW_VARIANTS + "raw_snps.vcf.gz"
    threads:
        10
    log:
        VARIANT_CALL_DOC + "bcftools_merge.log"
    benchmark:
        VARIANT_CALL_DOC + "bcftools_merge.json"
    shell:
        """
        bcftools concat -Oz {input.vcf} > {output.VARIANTS_1} 2> {log}
        cp {output.VARIANTS_1} {output.VARIANTS_2}
        """

#########################################################################################################


rule variant_call_filter_vcf:
    """
    Basic site quality filtering for the VCF file
    """
    input:
        VARIANT_CALL_DIR + "raw_snps.vcf.gz"
    output:
        FILTERED_VARIANTS_1 = VARIANT_CALL_DIR + "filtered_snps.recode.vcf",
        FILTERED_VARIANTS_2 = FINAL_VARIANTS + "filtered_snps.recode.vcf",
        HARDY_WEINGBERG_VARIANTS = FINAL_VARIANTS + "Hardy_Weinberg_values"
    threads:
        2
    params:
        vcftools_params = config["VCFtools"]["variant_filtering"]
    log:
        VARIANT_CALL_DOC + "variant_call_filter_vcf.log"
    benchmark:
        VARIANT_CALL_DOC + "variant_call_filter_vcf.json"
    shell:
        """
        vcftools --gzvcf {input} {params.vcftools_params} -c > {output.FILTERED_VARIANTS_1}
        vcftools --gzvcf {output.FILTERED_VARIANTS_1} --hardy -c > {output.HARDY_WEINGBERG_VARIANTS}
        cp {output.FILTERED_VARIANTS_1} {output.FILTERED_VARIANTS_2}
        """

#########################################################################################################

if config["illumina_se"] is None and config["Validation"] is not None:
    rule variant_validation_vcf:
        """
        Use a previous variant dataset to compare, extract and store common high-value SNPs between the two datasets
        """
        input:
            FILTERED_VARIANTS = FINAL_VARIANTS + "filtered_snps.recode.vcf",
            INPUT_VCF_FILE = config["Validation"]
        output:
            FINAL_VARIANTS + "Overlapping_SNPs"
        log:
            VARIANT_CALL_DOC + "variant_validation.log"
        benchmark:
            VARIANT_CALL_DOC + "variant_validation.json"
        shell:
            """
            vcftools --vcf {input.FILTERED_VARIANTS} --diff {input.INPUT_VCF_FILE} --diff-site -c > {output}
            """

else:
    pass

#########################################################################################################

if config["illumina_se"] is None and config["Validation"] is not None:
    rule intersect_vcf:
        input:
            FILTERED_VARIANTS = FINAL_VARIANTS + "filtered_snps.recode.vcf",
            INPUT_VCF_FILE = config["Validation"]
        output:
            FINAL_VARIANTS + "Overlapping_SNPs.vcf"
        log:
            VARIANT_CALL_DOC + "bedtools_intersect.log"
        benchmark:
            VARIANT_CALL_DOC + "bedtools_intersect.json"
        shell:
            """
            bedtools intersect -a {input.FILTERED_VARIANTS} -b {input.INPUT_VCF_FILE} -header > {output}
            """

else:
    pass