Fastq to Joint-Called Cohort VCF Pipeline with GATK4 and Cromwell on SLURM

public public 1yr ago Version: Version 1 0 bookmarks

SLURM HPC Cromwell implementation of GATK4 germline variant calling pipeline

See the GATK website for more information on this toolset

Assumptions

  • Using hg38 human reference genome build
  • Running using HPC/SLURM scheduling. This repo was specifically tested on Pawsey Zeus machine, primarily running in the /scratch partition.
  • Starting from short-read Illumina paired-end fastq files as input

Dependencies

The following versions have been tested and work, but GATK and Cromwell are regularly updated and so one must consider whether they would like to use newer versions of these tools.

  • BWA/0.7.15
  • GATK v4.0.6.0
  • SAMtools/1.5
  • picard/2.9
  • Python/2.7
  • Cromwell v61

Quick start guide

Installing and preparing environment for GATK4 with Cromwell

  1. Clone repository

git clone https://github.com/SarahBeecroft/slurmCromwellGATK4.git cd slurmCromwellGATK4 chmod +x *.sh

  1. Install Miniconda if you haven’t already. This is best placed in your /group directory to avoid filling your small /home directory, or being purged is placed in the /scratch directory.

  2. Create Conda environment using the supplied conda environment file

conda env create --file gatk4_pipeline.yml

  1. Download the necessary .jar files

  2. If you do not have the resource bundle files already, these need to be downloaded. In future they will be cached on Pawsey systems. The bundle data should be download from the Google Cloud bucket and not from the FTP site, which is missing various files. Refer to this handy blog post on how to download the resource files using Google Cloud SDK. There is a Slurm script (download_bundle.slurm) that can be used to download all hg38 files from the Google Cloud bucket. The files were downloaded in /scratch/pawsey0001/sbeecroft/hg38/v0, which needs to be moved before the data becomes purged after 30 days. Note that Homo_sapiens_assembly38.dbsnp138.vcf.gz was from the FTP bundle as this file could not be downloaded using the Conda version of Google Cloud SDK.

  1. Set up the config files. Files that you need to edit with the correct paths to your data/jar files or other specific configurations are:

    • Multisample_Fastq_to_Gvcf_GATK4_inputs_hg38.json
    • Multisample_jointgt_GATK4_inputs_hg38.json
      • both json files will need the correct paths to your reference file locations, and the file specifying your inputs i.e. samples.txt or gvcfs.txt
    • samples.txt
    • gvcfs.txt
      • These are the sample input files (tab seperated)
      • The format for samples.txt is sampleID, sampleID_readgroup, path_to_fastq_R1_file, path_to_fastq_R2_file,
      • The format for gvcfs.txt is sample ID, gvcf, gvcf .tbi index file
      • Examples are included in this repo
      • NOTE: Having tabs, not spaces, is vital for parsing the file. Visual studio code tends to introduce spaces, so if you are having issues, check the file with another text editor such as sublime.
    • launch_cromwell.sh
    • launch_jointgt.sh
      • These are the scripts which launch the pipeline.
      • launch_cromwell.sh launches the fastq to gvcf stage
      • launch_jointgt.sh launched the gvcf joint genotyping to cohort vcf step. This is perfomed when you have run all samples through the fastq to gvcf stage.
      • Check the paths and parameters make sense for your machine
    • slurm.conf
      • the main options here relate to the job scheduler. If you are running on Zeus at Pawsey, you should not need to alter these parameters.
    • cromwell.options
      • cromwell.options requires editing to provide the directory where you would like the final workflow outputs to be written
    • Multisample_Fastq_to_Gvcf_GATK4.wdl
    • ruddle_fastq_to_gvcf_single_sample_gatk4.wdl
      • The paths to your jar files will need to be updated
      • The path to your conda activate binary will need to be updated (e.g. /group/projectID/userID/miniconda/bin/activate )
  2. Launch the job using sbatch launch_cromwell.sh . When that has completed successfully, you can launch the second stage of the pipeline (joint calling) with sbatch launch_jointgt.sh .

Code Snippets

37
38
39
40
call make_uniq_samples_file {
  input:
    inputSamplesFile = inputSamplesFile,
 }
44
45
46
47
48
call make_fastq_file {
 input:
   inputSamplesFile = inputSamplesFile,
   sample_name = sample
}
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
call single_wf.Fastq_to_Gvcf_GATK4 { 
   input: 

     sample_fastq_file = make_fastq_file.sample_fastq_file,
     sample_name = sample,


     unmapped_bam_suffix = unmapped_bam_suffix,

     ref_name = ref_name,

     ref_fasta = ref_fasta,
     ref_fasta_index = ref_fasta_index,
     ref_dict = ref_dict,
     ref_amb = ref_amb,
     ref_ann = ref_ann,
     ref_bwt = ref_bwt,
     ref_pac = ref_pac,
     ref_sa = ref_sa,
     ref_alt = ref_alt,

     bwa_commandline = bwa_commandline,
     compression_level = compression_level,

     dbSNP_vcf = dbSNP_vcf,
     dbSNP_vcf_index = dbSNP_vcf_index,
     known_indels_sites_VCFs = known_indels_sites_VCFs,
     known_indels_sites_indices = known_indels_sites_indices,

     scattered_calling_intervals_list = scattered_calling_intervals_list
 }
84
85
86
87
88
89
90
91
call FastqToUbam {
  input:
    sample_name = fastq[0],
    readgroup = readgroup,
    fastq1 = fastq[2],
    fastq2 = fastq[3],
    outpref = sample_name + "." + readgroup
}
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
    call SamToFastqAndBwaMem {
      input:
        input_bam = FastqToUbam.unmapped_bam,
        bwa_commandline = bwa_commandline,
        output_bam_basename = sample_name + "." + readgroup + ".unmerged",
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        ref_dict = ref_dict,
        ref_amb = ref_amb,
        ref_ann = ref_ann,
        ref_bwt = ref_bwt,
        ref_pac = ref_pac,
        ref_sa = ref_sa,
        ref_alt = ref_alt,
	compression_level = compression_level
     }
112
113
114
115
116
117
118
119
120
121
122
123
call MergeBamAlignment {
  input:
    unmapped_bam = FastqToUbam.unmapped_bam,
    bwa_commandline = bwa_commandline,
    bwa_version = GetBwaVersion.version,
    aligned_bam = SamToFastqAndBwaMem.output_bam,
    output_bam_basename = sample_name + "." + readgroup + ".aligned.unsorted",
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    ref_dict = ref_dict,
    compression_level = compression_level
}
129
130
131
132
133
134
135
call MarkDuplicates {
  input:
    input_bams = MergeBamAlignment.output_bam,
    output_bam_basename = base_file_name + ".aligned.unsorted.duplicates_marked",
    metrics_filename = base_file_name + ".duplicate_metrics",
    compression_level = compression_level
}
138
139
140
141
142
143
144
145
146
call SortAndFixTags {
  input:
    input_bam = MarkDuplicates.output_bam,
    output_bam_basename = base_file_name + ".aligned.duplicate_marked.sorted",
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    compression_level = compression_level
}
149
150
151
152
call CreateSequenceGroupingTSV {
  input:
    ref_dict = ref_dict,
}
157
158
159
160
161
162
163
164
165
166
167
168
169
170
call BaseRecalibrator {
  input:
    input_bam = SortAndFixTags.output_bam,
    input_bam_index = SortAndFixTags.output_bam_index,
    recalibration_report_filename = base_file_name + ".recal_data.csv",
    sequence_group_interval = subgroup,
    dbSNP_vcf = dbSNP_vcf,
    dbSNP_vcf_index = dbSNP_vcf_index,
    known_indels_sites_VCFs = known_indels_sites_VCFs,
    known_indels_sites_indices = known_indels_sites_indices,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
}  
174
175
176
177
178
call GatherBqsrReports {
  input:
    input_bqsr_reports = BaseRecalibrator.recalibration_report,
    output_report_filename = base_file_name + ".recal_data.csv",
}
183
184
185
186
187
188
189
190
191
192
193
call ApplyBQSR {
  input:
    input_bam = SortAndFixTags.output_bam,
    input_bam_index = SortAndFixTags.output_bam_index,
    output_bam_basename = base_file_name + ".aligned.duplicates_marked.recalibrated",
    recalibration_report = GatherBqsrReports.output_bqsr_report,
    sequence_group_interval = subgroup,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
}
197
198
199
200
201
202
 call GatherBamFiles {
   input:
     input_bams = ApplyBQSR.recalibrated_bam,
     output_bam_basename = base_file_name,
     compression_level = compression_level  
}
207
208
209
210
211
212
213
214
215
216
call HaplotypeCaller {
  input:
    input_bam = GatherBamFiles.output_bam,
    input_bam_index = GatherBamFiles.output_bam_index,
    interval_list = interval_file,
    output_filename = base_file_name + ".g.vcf.gz",
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index
}
220
221
222
223
224
225
call MergeGVCFs {
  input:
    input_vcfs = HaplotypeCaller.output_vcf,
    input_vcfs_indexes = HaplotypeCaller.output_vcf_index,
    output_filename = base_file_name + ".g.vcf.gz"
}
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
  command <<<
    set -o pipefail
    set -e

    # set the bash variable needed for the command-line
    bash_ref_fasta=${ref_fasta}

		java -Dsamjdk.compression_level=${compression_level} -Xms3000m -jar /scratch/pawsey0001/sbeecroft/tool/picard.jar \
      SamToFastq \
			INPUT=${input_bam} \
			FASTQ=/dev/stdout \
			INTERLEAVE=true \
			NON_PF=true \
    | \
		${bwa_commandline} /dev/stdin -  2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) \
    | \
		samtools view -1 - > ${output_bam_basename}.bam

  >>>
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
command <<<
  python <<CODE
  with open("${ref_dict}", "r") as ref_dict_file:
      sequence_tuple_list = []
      longest_sequence = 0
      for line in ref_dict_file:
          if line.startswith("@SQ"):
              line_split = line.split("\t")
              # (Sequence_Name, Sequence_Length)
              sequence_tuple_list.append((line_split[1].split("SN:")[1], int(line_split[2].split("LN:")[1])))
      longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1]
  # We are adding this to the intervals because b37 has contigs named with embedded colons (:) and a bug in 
  # some versions of GATK strips off the last element after a colon, so we add this as a sacrificial element.
  b37_protection_tag = ":1+"
  # initialize the tsv string with the first sequence
  tsv_string = sequence_tuple_list[0][0] + b37_protection_tag
  temp_size = sequence_tuple_list[0][1]
  for sequence_tuple in sequence_tuple_list[1:]:
      if temp_size + sequence_tuple[1] <= longest_sequence:
          temp_size += sequence_tuple[1]
          tsv_string += "\t" + sequence_tuple[0] + b37_protection_tag
      else:
          tsv_string += "\n" + sequence_tuple[0] + b37_protection_tag
          temp_size = sequence_tuple[1]
  # add the unmapped sequences as a separate line to ensure that they are recalibrated as well
  with open("sequence_grouping.txt","w") as tsv_file:
    tsv_file.write(tsv_string)
    tsv_file.close()

  tsv_string += '\n' + "unmapped"

  with open("sequence_grouping_with_unmapped.txt","w") as tsv_file_with_unmapped:
    tsv_file_with_unmapped.write(tsv_string)
    tsv_file_with_unmapped.close()
  CODE
>>>
651
652
653
654
655
656
657
658
659
660
661
662
command <<<
set -e

  /scratch/pawsey0001/sbeecroft/tool/gatk-4.2.0.0/gatk --java-options "-Xmx15000m ${java_opt}" \
    HaplotypeCaller \
    -R ${ref_fasta} \
    -I ${input_bam} \
    -L ${interval_list} \
    -O ${output_filename} \
    -contamination ${default=0 contamination} \
    -ERC GVCF
>>>
682
683
684
685
686
687
688
689
command <<<
set -e

  /scratch/pawsey0001/sbeecroft/tool/gatk-4.2.0.0/gatk --java-options "-Xmx15000m"  \
    MergeVcfs \
    --INPUT ${sep=' --INPUT ' input_vcfs} \
    --OUTPUT ${output_filename}
>>>
ShowHide 16 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/SarahBeecroft/slurmCromwellGATK4
Name: gatk4-fastq-to-joint-called-cohort-vcf-with-cromwe
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...