Running the haplotype/variant calling pipeline with Snakemake

public public 1yr ago 0 bookmarks

Overview: Using samples from amplicon deep sequencing from gene targets, either call haplotypes or call variants.

Method: Targeted amplicon deep sequencing which produces forward and reverse fastq files for each sample.

Snakemake is a workflow manager that enables massively parallel and reproducible analyses. Snakemake is a suitable tool to use when you can break a workflow down into discrete steps, with each step having input and output files.

We provide this example workflow as a template to get started running the pipeline with Snakemake. To adjust to your specific data, you can customize the config.yaml file.

For more details on Snakemake, see the Snakemake tutorial .

The Workflow

The Snakefile contains rules which define the output files we want and how to make them. Snakemake automatically builds a directed acyclic graph (DAG) of jobs to figure out the dependencies of each of the rules and what order to run them in.

dag

The DAG shows how the pipeline can call haplotypes and call variants simultaneously, and how calls can be run in parallel if Snakemake is allowed to run more than one job at a time. Here, the pipeline is parallelized by being called with two targets each for haplotype and variant calling (AMA and CSP for haplotype calling, and PFCRT and PFMDR1 for variant calling) for and six quality scores (2, 5, 10, 15, 20, 25) at the same time.

call_fastqc uses FastQC to perform quality control on the raw reads and allow for visualization of the quality of the reads.

call_trimmomatic cleans and filters the raw reads. It uses CutAdapt to trim the primers and adapter sequences from sequencing reads and Trimmomatic to quality filter reads if the average of every 4 nucleotides had a Phred Quality Score < 15 or was less than 80 nucleotides long.

generate_multiqc_report uses the MultiQC reporting tool to parse summary statistics from results and log files generated by fastqc and cutadapt.

These first three rules are called for the given input data, regardless of whether the pipeline is being used to call haplotypes or call variants. Following this, the pipeline splits up into haplotype and/or variant calling.

variant_calling uses a package called Burrows-Wheeler Aligner, or BWA, which maps low-divergent sequences against a large reference genome. This is used for variant calling for drug resistance genes.

analyze_vcf takes in the vcf file(s) produced by the previous rule and returns a table of drug-resistance-associated mutations among the amplicon vcf file(s).

combine_vcf_analysis uses each target's table returned by analyze_vcf and combines them into a summary table that allows for easy analysis of the variant calling results.

get_marker_lengths calculates the length of each marker using the lengths of the reference and primer sequences.

synchronize_reads cleans, filters, and maps the raw reads. It uses BBmap to map all reads from the reference sequences, CutAdapt to trim the primers and adapter sequences from sequencing reads, and Trimmomatic to quality filter reads if the average of every 4 nucleotides had a Phred Quality Score < 15 or was less than 80 nucleotides long. This is the first step in read processing on the cluster.

trim_and_filter filters and trims the forward and reverse reads using the DADA2 program and outputs them into a filtered folder.

call_haplotypes calls haplotypes for your target(s) and writes the results out into a reads table.

optimize_reads finds the quality score that produces the highest number of read counts and uses only the filtered reads created using that quality score.

merge_read_tracking combines the read count files from the previous three rules into one table that shows the number of reads at different points of the workflow.

censor_haplotypes censors falsely detected haplotypes. Censoring criteria is applied in this order:

  1. Haplotypes that occur in < 250 of the sample’s reads are removed. You can change this criteria in the config file ("read_depth").

  2. Haplotypes that occur in < 3% of the sample’s reads are removed. You can change this criteria in the config file ("proportion").

  3. Haplotypes that are a different length than the majority of haplotypes (300 nucleotides for pfama1, 288 nucleotides for pfcsp. You can change the length in the config file ("haplotype_length").

  4. For haplotypes that have 1 SNP difference, occur in the same sample, and have a > 8 times read depth difference between them within that sample, removed the hapltoype with the lower read depth from that sample. You can change the 8 to a different read depth ratio in the config file ("read_depth_ratio").

  5. If a haplotype is defined by a single variant position that is only variable within that haplotype, then it is removed.

get_read_summaries runs a bash script that pulls statistics from the summary files produced during the trim_and_filter rule for later manipulation in R.

create_summaries outputs csv files that summarizes the trim and read counts.

Quick Start

  1. Clone or download this repo.

    git clone https://github.com/kathiehuang/haplotype_calling_pipeline.git
    

    Alternatively, if you're viewing this on GitHub, you can click the green Use this template button to create your own version of the repo on GitHub, then clone it.

  2. Install the dependencies.

    1. If you don't have conda yet, we recommend installing miniconda .

      If you use a Macbook that has an M1 Chip, you will need to install Rosetta2:

      softwareupdate --install-rosetta
      

      Download the x86-64 bit version of miniconda instead of the M1 version.

    2. Next, install mamba , a fast drop-in replacement for conda:

      conda install mamba -n base -c conda-forge
      
    3. Finally, create the environment and activate it:

      mamba env create -f environment.yaml # you only have to do this once
      conda activate haplotype_calling # you have to do this every time
      
    • Alternatively, you can install the dependencies listed in environment.yaml however you like.
  3. Edit the configuration file config.yaml .

    • haplotype_calling_targets : the list of target(s) you want to perform haplotype calling on.

    • variant_calling_targets : the list of target(s) you want to perform variant calling on.

    • pair1 : the path to the folder containing the forward reads.

    • pair2 : the path to the folder containing the reverse reads.

    • refs : the path to the folder containing reference sequences for the polymorphic gene target that will be used to map the raw reads to the appropriate gene targets of interest.

    • forward : the path to the file with the list of forward primers.

    • rev : the path to the file with the list of reverse primers.

    • variant_table : the path to the table that holds positions of interest for each target ran through variant calling.

    • root : the path to your desired output directory.

    • truncQ_values : the list of values of truncQ to be used in the filterAndTrim function to find the optimal truncQ value.

    • cutoff : the cutoff for which samples with less than this number of reads after sampling should be removed.

    • seed : the seed of R's random number generator for the purpose of obtaining a reproducible random result.

    • read_depth : the cutoff for which haplotypes that occur in less than this amount of the sample reads should be removed.

    • proportion : the cutoff for which haplotypes that occur in less than this percentage of the sample reads should be removed.

    • read_depth_ratio : cutoff for which haplotypes that have 1 SNP difference, occur in the same sample, and have a greater than this amount times read depth difference between them within that sample with the lower read depth should be removed

    You can leave these options as-is if you'd like to first make sure the workflow runs without error on your machine before using your own dataset and custom parameters.

  4. Modify your adapters files to notify CutAdapt that you want to use linked adapters.

    A linked adapter combines a 5’ and a 3’ adapter, so we use this if a sequence is surrounded by a 5’ and a 3’ adapter and we want to remove both adapters. Linked adapters aren't required for this pipeline to run, but we found that for our data, using linked adapters produced better results.

    Linked adapters are specified as two sequences separated by ... (three dots), with the second sequence being the reverse complement of the reverse primer for the forward primer and the reverse complement of the forward primer for the reverse primer. We use -g instead of -a (5' vs 3' adapters) when calling CutAdapt, which causes both adapters to be required, even if they are not anchored. However, we want the non-anchored adapters to be optional. You can mark each adapter explicitly as required or optional using the search parameters required and optional . As a result, we add the optional parameter to the end of the sequence after the three dots. This is what our forwardPrimers.fasta file looks like for CSP:

    >Pfcsp-f
    TTAAGGAACAAGAAGGATAATACCA...CATTTCGGTTTGGGTCATTT;optional
    

    And our reversePrimers.fasta file looks like this:

    >Pfcsp-r_rc
    AAATGACCCAAACCGAAATG...TGGTATTATCCTTCTTGTTCCTTAA;optional
    

    If you choose to forgo using linked adapters, this is what your adapters file would look like:

    >Pfcsp-f
    TTAAGGAACAAGAAGGATAATACCA
    
    >Pfcsp-r_rc
    AAATGACCCAAACCGAAATG
    

    For more information about the linked adapters parameter, visit the CutAdapt documentation.

  5. Do a dry run to make sure the snakemake workflow is valid.

    snakemake -n
    
  6. Run the workflow.

    Run it locally with:

    snakemake --cores {NCORES} --stats output/stats # NCORES = number of cores, ie. without parallelization use snakemake --cores 1. You can use snakemake -n to see how many jobs you'll need given the number of targets and quality scores you put in the config file. The stats command produces a JSON file where you can see statistics about your run, such as runtime.
    

    To run the workflow on an HPC with Slurm :

    1. Edit your email address ( YOUR_EMAIL_HERE ) in:

    2. Edit the number of cores you want to use for this pipeline run (right now it is defaulted to 12 cores).

    3. Submit the snakemake workflow with:

      sbatch scripts/submit_slurm.sh
      

      Slurm output files will be written to logs/ . You will receive an email when the job is finished.

Usage Examples

Below shows what output files looked like after running this pipeline with a small sample of haplotypes.

call_fastqc should produce an output directory with a fastqc_out folder that contains FastQC reports for each input fastq file. This allows you to visualize and assess quality collectively across all reads within a sample. To open the .zip files, download and extract the files. A copy of the config.yaml file will also be added to the output folder so that the parameters for each run can be saved.

Example of FastQC output files for a single sample: BF1_S1_L001_R1_001_fastqc.html , BF1_S1_L001_R1_001_fastqc.zip , BF1_S1_L001_R2_001_fastqc.html , BF1_S1_L001_R2_001_fastqc.zip

call_trimmomatic should add a trimmed_reads folder within the output directory containing the results from the Trimmomatic program, which filters poor quality reads and trims poor quality bases from your samples and produces statistical summaries.

The trimmed_reads folder contains four folders labeled 1 , 2 , singleton , and summaries . As an example, the trim/1 folder contains a BF1.1.fastq.gz file, the trim/2 folder contains a BF1.2.fastq.gz file, the singleton folder contains BF1.1_unpaired.fq.gz and BF1.2_unpaired.fq.gz files, and the summaries folder contains a BF1.summary file.

generate_multiqc_report should create a file multiqc_report.html in the output directory. If you download this file to your local computer, you can open it in a browser to see the aggregated results as an HTML report.

multiqc

Variant Calling

variant_calling should create a variant_output directory, as well as a directory within it for each target, with each one containing four folders: bam , sam , sort_bam , and vcf .

analyze_vcf should temporarily create a file called DR_mutations.csv in the directory for each target. After running the pipeline on a small set of samples, the PFCRT table looked like this:

FIELD1 Target Sample POS REF ALT ALT_DEPTH DEPTH ALT_FREQ
0 PFCRT BF252 404 A . 0 11 0.0
1 PFCRT BF252 454 A . 0 11 0.0
2 PFCRT BF252 466 C . 0 12 0.0
3 PFCRT BF252 610 C . 0 7 0.0

combine_vcf_analysis should create a file called dr_depths_freqs.csv in the variant_output directory. After running the pipeline on a small set of samples with targets PFCRT and PFMDR1, our depths frequency table looked like this:

Target Sample POS BASE TOTAL_DEPTH DEPTH FREQ
PFMDR1 BF246 57 T 111 34 0.3063063063063063
PFMDR1 BF259 57 T 115 64 0.5565217391304348
PFMDR1 BF260 57 T 89 64 0.7191011235955056
PFMDR1 BF261 57 T 139 67 0.48201438848920863
PFMDR1 BF262 57 T 125 96 0.768
PFMDR1 BF263 57 T 137 127 0.927007299270073
PFMDR1 BF246 352 T 116 116 1
PFMDR1 BF247 352 T 68 68 1
PFMDR1 BF248 352 T 119 119 1
PFMDR1 BF249 352 T 88 88 1
PFMDR1 BF251 352 T 99 99 1
PFMDR1 BF252 352 T 84 67 0.7976190476190477
PFMDR1 BF254 352 T 80 80 1
PFMDR1 BF255 352 T 108 108 1
PFMDR1 BF257 352 T 79 78 0.9873417721518988
PFMDR1 BF259 352 T 118 49 0.4152542372881356
PFMDR1 BF261 352 T 143 43 0.3006993006993007
PFMDR1 BF246 57 A 111 77 0.6936936936936937
PFMDR1 BF259 57 A 115 51 0.4434782608695652
PFMDR1 BF260 57 A 89 25 0.2808988764044944
PFMDR1 BF261 57 A 139 72 0.5179856115107914
PFMDR1 BF262 57 A 125 29 0.23199999999999998
PFMDR1 BF263 57 A 137 10 0.07299270072992703
PFMDR1 BF246 352 A 116 0 0
PFMDR1 BF247 352 A 68 0 0
PFMDR1 BF248 352 A 119 0 0
PFMDR1 BF249 352 A 88 0 0
PFMDR1 BF251 352 A 99 0 0
PFMDR1 BF252 352 A 84 17 0.20238095238095233
PFMDR1 BF254 352 A 80 0 0
PFMDR1 BF255 352 A 108 0 0
PFMDR1 BF257 352 A 79 1 0.012658227848101222
PFMDR1 BF259 352 A 118 69 0.5847457627118644
PFMDR1 BF261 352 A 143 100 0.6993006993006994
PFMDR1 BF247 57 A 67 67 1
PFMDR1 BF248 57 A 118 118 1
PFMDR1 BF249 57 A 88 88 1
PFMDR1 BF250 57 A 81 81 1
PFMDR1 BF251 57 A 95 95 1
PFMDR1 BF252 57 A 85 85 1
PFMDR1 BF253 57 A 103 103 1
PFMDR1 BF254 57 A 79 79 1
PFMDR1 BF255 57 A 105 105 1
PFMDR1 BF256 57 A 121 121 1
PFMDR1 BF257 57 A 79 79 1
PFMDR1 BF258 57 A 94 94 1
PFMDR1 BF250 352 A 82 82 1
PFMDR1 BF253 352 A 102 102 1
PFMDR1 BF256 352 A 121 121 1
PFMDR1 BF258 352 A 95 95 1
PFMDR1 BF260 352 A 96 96 1
PFMDR1 BF262 352 A 126 126 1
PFMDR1 BF263 352 A 142 142 1
PFCRT BF252 404 A 11 11 1
PFCRT BF252 454 A 11 11 1
PFCRT BF252 466 C 12 12 1
PFCRT BF252 610 C 7 7 1

Haplotype Calling

get_marker_lengths should add a csv file containing the names of each marker and their corresponding lengths called marker_lengths.csv to the output directory.

synchronize_reads should create a haplotype_output directory, as well as a directory within it for each target. For each target, the rule creates a bbsplit_out folder containing three folders: mapped_reads , ref , and map_info . mapped_reads contains the cleaned and mapped fastq files, ref contains information about the indexes produced and referenced by the BBSplit program, and map_info contains summaries of the BBSplit call on each sample.

As an example, for AMA, the mapped_reads folder contains BF1_AMA_1.fastq.gz and BF1_AMA_2.fastq.gz files, the ref folder contains a genome/1 folder that has merged_ref_64917.fa.gz , namelist.txt , and reflist.txt files, and map_info contains a BF1.txt file.

trim_and_filter should produce a trim_filter_out folder in each target's directory that contains a summary for read trimming and filtering for each q value (which can be changed in the config file), {target}_{q_values}_trim_and_filter_table . It also produces a {target}/read_count file that lists the read counts for each q value.

After running the pipeline on a small sample of haplotypes, the AMA trim and filter table for a q value of 2 looked like this:

reads.in reads.out
BF1_AMA_1.fastq.gz 13072 8297
BF10_AMA_1.fastq.gz 14875 9243
BF2_AMA_1.fastq.gz 15269 10247
BF3_AMA_1.fastq.gz 10116 5075
BF4_AMA_1.fastq.gz 13672 8852
BF5_AMA_1.fastq.gz 12650 7720
BF6_AMA_1.fastq.gz 14591 9754
BF7_AMA_1.fastq.gz 12760 8208
BF8_AMA_1.fastq.gz 11474 7382
BF9_AMA_1.fastq.gz 12432 6318

And the AMA read count table looked like this:

25 76565
15 103295
20 103224
2 81096
5 81096
10 116036

call_haplotypes should add temporary files {target}_{q_values}_haplotypes.rds to the haplotype_output folder as well as a file {target}_{q_values}_track_reads_through_pipeline.csv to the trim_filter_out folder.

  • {target}_{q_values}_haplotypes.rds : R file that stores the haplotype results data set for each q value for further manipulation in censor_haplotypes .

  • {target}_{q_values}_track_reads_through_pipeline.csv : tracks the reads, looking at the number of reads that made it through each step of the pipeline for each q value.

With our small sample, the AMA track reads through pipeline table for a q value of 2 looked like this:

merged tabled nonchim
BF1 8296 8296 8296
BF10 9241 9241 9241
BF2 10239 10239 10239
BF3 5056 5056 5056
BF4 8850 8850 8850
BF5 7614 7614 7210
BF6 9752 9752 9752
BF7 8125 8125 7795
BF8 7378 7378 7378
BF9 6317 6317 6317

optimize_reads should produce an optimize_reads_out folder with 3 files: {target}_final_q_value , {target}_max_read_count , and {target}_final_track_reads_through_pipeline , and remove the temporary files from the last step.

  • {target}_final_q_value : contains the quality score that produced the highest number of read counts.

  • {target}_max_read_count : contains the value of the highest number of read counts—the read count associated with the optimal quality score.

  • {target}_final_track_reads_through_pipeline : contains the reads table that was created with the optimal quality score.

With our small sample, the AMA final q value was: 10

The AMA max read count was: 114561

And the final AMA trim and filter table looked like this:

merged tabled nonchim
BF1 11768 11768 11768
BF10 13377 13377 13377
BF2 13681 13681 13681
BF3 8450 8450 8450
BF4 12211 12211 12211
BF5 10584 10584 10360
BF6 13078 13078 13078
BF7 11049 11049 10773
BF8 10314 10314 10314
BF9 10549 10549 10549

merge_read_tracking should add a file {target}_track_reads_through_dada2.csv to the optimize reads directory. This file combines the trim and filter table from the trim_and_filter rule that corresponds to the optimal q value and the {target}_final_track_reads_through_pipeline table that was produced in the last step.

With our small sample, the AMA track_reads_through_dada2 table looked like this:

sample reads.in reads.out merged tabled nonchim
1 BF1 13072 11777 11768 11768 11768
2 BF10 14875 13384 13377 13377 13377
3 BF2 15269 13718 13681 13681 13681
4 BF246 0 0 NA NA NA
5 BF247 0 0 NA NA NA
6 BF248 0 0 NA NA NA
7 BF249 0 0 NA NA NA
8 BF250 0 0 NA NA NA
9 BF251 0 0 NA NA NA
10 BF252 0 0 NA NA NA
11 BF253 0 0 NA NA NA
12 BF254 0 0 NA NA NA
13 BF255 0 0 NA NA NA
14 BF256 0 0 NA NA NA
15 BF257 0 0 NA NA NA
16 BF258 0 0 NA NA NA
17 BF259 0 0 NA NA NA
18 BF260 0 0 NA NA NA
19 BF261 0 0 NA NA NA
20 BF262 0 0 NA NA NA
21 BF263 0 0 NA NA NA
22 BF3 10116 8491 8450 8450 8450
23 BF4 13672 12226 12211 12211 12211
24 BF5 12650 11119 10584 10584 10360
25 BF6 14591 13085 13078 13078 13078
26 BF7 12760 11350 11049 11049 10773
27 BF8 11474 10326 10314 10314 10314
28 BF9 12432 10560 10549 10549 10549

censor_haplotypes should produce a haplotypes folder for each target with two files {target}_haplotype_table_precensored.csv and {target}_haplotype_table_censored_final_version.csv . It should also produce a sequences folder with four files: {target}_snps_between_haps_within_samples.fasta , {target}_unique_seqs.fasta , {target}_aligned_seqs.fasta , and {target}_unique_seqs_final_censored.fasta .

  • {target}_haplotype_table_precensored.csv : outputs the haplotype data set prior to beginning the censoring process (essentially {target}_haplotypes.rds in a formatted csv file).

  • {target}_snps_between_haps_within_samples.fasta : fasta file of the haplotypes after the first three steps of the censoring process are completed. This is used to tally up the number of SNPs between all haplotype pairings.

  • {target}_unique_seqs.fasta : fasta file of the haplotypes after the fourth step of the censoring process is completed.

  • {target}_aligned_seqs.fasta : fasta file of the sequences after alignment.

  • {target}_unique_seqs_final_censored.fasta : fasta file of the haplotype results after all five steps of the censoring process are completed.

  • {target}_haplotype_table_censored_final_version.csv : outputs the final censored haplotype data set in a formatted table.

With our small sample, the AMA haplotype_table_precensored file looked like this:

H1 H2 H3 H4 H5 H6 H7 MiSeq.ID
0 0 0 0 11768 0 0 BF1
0 0 0 13377 0 0 0 BF10
13665 0 0 16 0 0 0 BF2
0 8450 0 0 0 0 0 BF3
18 12186 7 0 0 0 0 BF4
268 0 9862 0 0 230 0 BF5
13078 0 0 0 0 0 0 BF6
0 603 10170 0 0 0 0 BF7
0 0 0 0 0 10314 0 BF8
51 0 0 0 0 0 10498 BF9

The AMA snps_between_haps_within_samples file looked like this:

>Seq1
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAACACCGGTAGCTACGGGAAATCAATATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCATATGTCACCAATGACATTAGATGAAATGAGACATTTTTATAAAGATAATAAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGATTCCAGATAATGATAAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq2
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq3
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAGATGATATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATCAAAATTCAAATTATAAATATCCAGCTGTTTATGATTACGAAGATAAAAAGTGTCATATATTATATATTG
>Seq4
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAACACCGGTAGCTACGGGAAAACAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAGATCATATGAGAGATTTTTATAAAAAAAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATTACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq5
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTATATAAAAATAATGAAGATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATAAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAAATAAAAAGTGTCATATATTATATATTG
>Seq6
GTAAAGGTATAATTATTGAGAATTCAAAAACTACTTTTTTAACACCGGTAGCTACGGAAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCCTATGTCACCAATGACATTAAATGGTATGAGAGATTTATATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATAAAAATTCAAATTATAAATATCCAGCTGTTTATGATTACAATGATAATAAGTGTCATATATTATATATTG
>Seq7
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAAGGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG

The AMA unique_seqs file looked like this:

>Seq1
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAACACCGGTAGCTACGGGAAATCAATATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCATATGTCACCAATGACATTAGATGAAATGAGACATTTTTATAAAGATAATAAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGATTCCAGATAATGATAAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq2
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq3
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAGATGATATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATCAAAATTCAAATTATAAATATCCAGCTGTTTATGATTACGAAGATAAAAAGTGTCATATATTATATATTG
>Seq4
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAACACCGGTAGCTACGGGAAAACAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAGATCATATGAGAGATTTTTATAAAAAAAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATTACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq5
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTATATAAAAATAATGAAGATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATAAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAAATAAAAAGTGTCATATATTATATATTG
>Seq6
GTAAAGGTATAATTATTGAGAATTCAAAAACTACTTTTTTAACACCGGTAGCTACGGAAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCCTATGTCACCAATGACATTAAATGGTATGAGAGATTTATATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATAAAAATTCAAATTATAAATATCCAGCTGTTTATGATTACAATGATAATAAGTGTCATATATTATATATTG
>Seq7
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAAGGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG

The AMA aligned_seqs file looked like this:

>Seq5
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTATATAAAAATAATGA
AGATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATAAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATGACAAAAATAAAAAGTGTCATATATTATATATTG
>Seq6
GTAAAGGTATAATTATTGAGAATTCAAAAACTACTTTTTTAACACCGGTAGCTACGGAAAATCAAGATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAAATCCTCCTATGTCACCAATGACATTAAATGGTATGAGAGATTTATATAAAAATAATGA
ATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATAAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATTACAATGATAATAAGTGTCATATATTATATATTG
>Seq4
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAACACCGGTAGCTACGGGAAAACAAGATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAGATCATATGAGAGATTTTTATAAAAAAAATGA
ATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATTACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq1
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAACACCGGTAGCTACGGGAAATCAATATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAGAACCTCATATGTCACCAATGACATTAGATGAAATGAGACATTTTTATAAAGATAATAA
ATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGATTCCAGATAATGATAAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq2
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGA
ATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG
>Seq3
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAGATGATATGAGAGATTTTTATAAAAATAATGA
ATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATCAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATTACGAAGATAAAAAGTGTCATATATTATATATTG
>Seq7
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGA
GGTTTTGCTTTTCCTCCAACAGAACCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGA
ATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAAGGATGAAAATTCAA
ATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG

The AMA unique_seqs_final_censored file looked like this:

>Seq1
GTAAAGGTATAATTATTGAGAATTCAAATACTACTTTTTTAAAACCGGTAGCTACGGGAAATCAAGATTTAAAAGATGGAGGTTTTGCTTTTCCTCCAACAAATCCTCTTATATCACCAATGACATTAAATGGTATGAGAGATTTTTATAAAAATAATGAATATGTAAAAAATTTAGATGAATTGACTTTATGTTCAAGACATGCAGGAAATATGAATCCAGATAATGATGAAAATTCAAATTATAAATATCCAGCTGTTTATGATGACAAAGATAAAAAGTGTCATATATTATATATTG

And the AMA haplotype_table_censored_final table looked like this:

H2 MiSeq.ID
8450 BF3
12186 BF4
603 BF7

get_read_summaries should produce two files, pre-filt_fastq_read_counts and filt_fastq_read_counts , in the haplotype_output folder, and two files, trim_summaries and trim_summary_names , in the output/trimmed_reads folder.

  • trim_summaries consolidates all the output/trimmed_reads/summaries data into one file.

  • trim_summary_names contains all the file names in output/trimmed_reads/summaries .

  • pre-filt_fastq_read_counts contains the read counts prior to filtering.

  • filt-fastq_read_counts contains the read counts after filtering.

With our small sample, one of the consolidated trim summaries looked like this:

Input Read Pairs: 33577
Both Surviving Reads: 32298
Both Surviving Read Percent: 96.19
Forward Only Surviving Reads: 454
Forward Only Surviving Read Percent: 1.35
Reverse Only Surviving Reads: 659
Reverse Only Surviving Read Percent: 1.96
Dropped Reads: 166
Dropped Read Percent: 0.49

The trim summary names file looked like this:

BF10.summary
BF1.summary
BF246.summary
BF247.summary
BF248.summary
BF249.summary
BF250.summary
BF251.summary
BF252.summary
BF253.summary
BF254.summary
BF255.summary
BF256.summary
BF257.summary
BF258.summary
BF259.summary
BF260.summary
BF261.summary
BF262.summary
BF263.summary
BF2.summary
BF3.summary
BF4.summary
BF5.summary
BF6.summary
BF7.summary
BF8.summary
BF9.summary

Some of the pre-filtered read counts looked like this:

AMA/out/fastq/all_samples/BF10_AMA_1.fastq.gz	14875
AMA/out/fastq/all_samples/BF10_AMA_2.fastq.gz	14875
AMA/out/fastq/all_samples/BF1_AMA_1.fastq.gz	13072
AMA/out/fastq/all_samples/BF1_AMA_2.fastq.gz	13072
AMA/out/fastq/all_samples/BF2_AMA_1.fastq.gz	15269
AMA/out/fastq/all_samples/BF2_AMA_2.fastq.gz	15269
AMA/out/fastq/all_samples/BF3_AMA_1.fastq.gz	10116
AMA/out/fastq/all_samples/BF3_AMA_2.fastq.gz	10116

And some of the filtered read counts looked like this:

AMA/out/fastq/all_samples/final_filtered/BF10_final_F_filt.fastq.gz	13384
AMA/out/fastq/all_samples/final_filtered/BF10_final_R_filt.fastq.gz	13384
AMA/out/fastq/all_samples/final_filtered/BF1_final_F_filt.fastq.gz	11777
AMA/out/fastq/all_samples/final_filtered/BF1_final_R_filt.fastq.gz	11777
AMA/out/fastq/all_samples/final_filtered/BF2_final_F_filt.fastq.gz	13718
AMA/out/fastq/all_samples/final_filtered/BF2_final_R_filt.fastq.gz	13718
AMA/out/fastq/all_samples/final_filtered/BF3_final_F_filt.fastq.gz	8491
AMA/out/fastq/all_samples/final_filtered/BF3_final_R_filt.fastq.gz	8491

create_summaries should produce 2 files in the haplotype_output folder: long_summary and wide_summary . long_summary is a csv file that summarizes the read counts with columns read_type, read_ct, sample, and target. This allows for easy manipulation of the dataframe if needed in the future. wide_summary is a csv file that summarizes the read counts with each row being a sample. This allows for easy visualization of the data analysis to see how each sample did throughout the workflow.

With our small samples of AMA and CSP, the long summary looked like this:

read_type read_ct sample target
Input Read Pairs 33577 BF10 NA
Both Surviving Reads 32298 BF10 NA
Forward Only Surviving Reads 454 BF10 NA
Reverse Only Surviving Reads 659 BF10 NA
Dropped Reads 166 BF10 NA
Input Read Pairs 26485 BF1 NA
Both Surviving Reads 25565 BF1 NA
Forward Only Surviving Reads 359 BF1 NA
Reverse Only Surviving Reads 461 BF1 NA
Dropped Reads 100 BF1 NA
Input Read Pairs 28008 BF2 NA
Both Surviving Reads 26932 BF2 NA
Forward Only Surviving Reads 412 BF2 NA
Reverse Only Surviving Reads 516 BF2 NA
Dropped Reads 148 BF2 NA
Input Read Pairs 25840 BF3 NA
Both Surviving Reads 24730 BF3 NA
Forward Only Surviving Reads 443 BF3 NA
Reverse Only Surviving Reads 504 BF3 NA
Dropped Reads 163 BF3 NA
Input Read Pairs 29520 BF4 NA
Both Surviving Reads 28500 BF4 NA
Forward Only Surviving Reads 392 BF4 NA
Reverse Only Surviving Reads 546 BF4 NA
Dropped Reads 82 BF4 NA
Input Read Pairs 30472 BF5 NA
Both Surviving Reads 29347 BF5 NA
Forward Only Surviving Reads 455 BF5 NA
Reverse Only Surviving Reads 533 BF5 NA
Dropped Reads 137 BF5 NA
Input Read Pairs 32025 BF6 NA
Both Surviving Reads 30863 BF6 NA
Forward Only Surviving Reads 444 BF6 NA
Reverse Only Surviving Reads 591 BF6 NA
Dropped Reads 127 BF6 NA
Input Read Pairs 29455 BF7 NA
Both Surviving Reads 28429 BF7 NA
Forward Only Surviving Reads 391 BF7 NA
Reverse Only Surviving Reads 524 BF7 NA
Dropped Reads 111 BF7 NA
Input Read Pairs 27293 BF8 NA
Both Surviving Reads 26279 BF8 NA
Forward Only Surviving Reads 303 BF8 NA
Reverse Only Surviving Reads 609 BF8 NA
Dropped Reads 102 BF8 NA
Input Read Pairs 26675 BF9 NA
Both Surviving Reads 25541 BF9 NA
Forward Only Surviving Reads 477 BF9 NA
Reverse Only Surviving Reads 514 BF9 NA
Dropped Reads 143 BF9 NA
filt 11777 BF1 ama
prefilt 13072 BF1 ama
filt 11352 BF1 csp
prefilt 12491 BF1 csp
filt 13384 BF10 ama
prefilt 14875 BF10 ama
filt 15886 BF10 csp
prefilt 17422 BF10 csp
filt 13718 BF2 ama
prefilt 15269 BF2 ama
filt 10659 BF2 csp
prefilt 11659 BF2 csp
filt 8491 BF3 ama
prefilt 10116 BF3 ama
filt 13099 BF3 csp
prefilt 14611 BF3 csp
filt 12226 BF4 ama
prefilt 13672 BF4 ama
filt 13584 BF4 csp
prefilt 14828 BF4 csp
filt 11119 BF5 ama
prefilt 12650 BF5 ama
filt 15006 BF5 csp
prefilt 16695 BF5 csp
filt 13085 BF6 ama
prefilt 14591 BF6 ama
filt 14725 BF6 csp
prefilt 16269 BF6 csp
filt 11350 BF7 ama
prefilt 12760 BF7 ama
filt 14179 BF7 csp
prefilt 15668 BF7 csp
filt 10326 BF8 ama
prefilt 11474 BF8 ama
filt 13359 BF8 csp
prefilt 14801 BF8 csp
filt 10560 BF9 ama
prefilt 12432 BF9 ama
filt 11500 BF9 csp
prefilt 13108 BF9 csp

And the wide summary looked like this:

sample NA_Input Read Pairs NA_Both Surviving Reads NA_Forward Only Surviving Reads NA_Reverse Only Surviving Reads NA_Dropped Reads ama_filt ama_prefilt csp_filt csp_prefilt
BF10 33577 32298 454 659 166 13384 14875 15886 17422
BF1 26485 25565 359 461 100 11777 13072 11352 12491
BF2 28008 26932 412 516 148 13718 15269 10659 11659
BF3 25840 24730 443 504 163 8491 10116 13099 14611
BF4 29520 28500 392 546 82 12226 13672 13584 14828
BF5 30472 29347 455 533 137 11119 12650 15006 16695
BF6 32025 30863 444 591 127 13085 14591 14725 16269
BF7 29455 28429 391 524 111 11350 12760 14179 15668
BF8 27293 26279 303 609 102 10326 11474 13359 14801
BF9 26675 25541 477 514 143 10560 12432 11500 13108

More resources

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
echo "getting trim summaries"

ls ${snakemake_params[pre_trim_summaries]} > ${snakemake_params[trim_summary_names]}
cat ${snakemake_params[all_pre_trim_summaries]} > ${snakemake_params[trim_summaries]}

echo "getting pre-filt"

for i in $(ls ${snakemake_params[all_fastq_files]}); do paste <(echo $i) <(zcat $i | grep -c "^+$"); done > ${snakemake_output[pre_filt_fastq_counts]}

echo "getting filt"

for i in $(ls ${snakemake_params[all_filtered_files]}); do paste <(echo $i) <(zcat $i | grep -c "^+$"); done > ${snakemake_output[filt_fastq_counts]}
53
54
script:
	"{params.pyscript}"
68
69
script:
	"{params.pyscript}"
78
79
shell:
	"multiqc . --outdir {params.out}"
94
95
script:
	"{params.pyscript}"
107
108
script:
	"{params.pyscript}"
SnakeMake From line 107 of master/Snakefile
121
122
script:
	"{params.rscript}"
SnakeMake From line 121 of master/Snakefile
135
136
script:
	"{params.pyscript}"
SnakeMake From line 135 of master/Snakefile
152
153
script:
	"{params.pyscript}"
SnakeMake From line 152 of master/Snakefile
166
167
script:
	"{params.rscript}"
SnakeMake From line 166 of master/Snakefile
183
184
script:
	"{params.rscript}"
SnakeMake From line 183 of master/Snakefile
204
205
script:
	"{params.rscript}"
SnakeMake From line 204 of master/Snakefile
217
218
script:
	"{params.rscript}"
SnakeMake From line 217 of master/Snakefile
239
240
script:
	"{params.rscript}"
SnakeMake From line 239 of master/Snakefile
255
256
script:
	"scripts/step13_get_read_summaries.sh"
SnakeMake From line 255 of master/Snakefile
271
272
script:
	"{params.rscript}"
SnakeMake From line 271 of master/Snakefile
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/kathiehuang/haplotype_calling_pipeline
Name: haplotype_calling_pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...