Fastq to Joint-Called Cohort VCF Pipeline with GATK4 and Cromwell on SLURM
SLURM HPC Cromwell implementation of GATK4 germline variant calling pipeline
See the GATK website for more information on this toolset
Assumptions
- Using hg38 human reference genome build
-
Running using HPC/SLURM scheduling. This repo was specifically tested on Pawsey Zeus machine, primarily running in the
/scratch
partition. - Starting from short-read Illumina paired-end fastq files as input
Dependencies
The following versions have been tested and work, but GATK and Cromwell are regularly updated and so one must consider whether they would like to use newer versions of these tools.
- BWA/0.7.15
- GATK v4.0.6.0
- SAMtools/1.5
- picard/2.9
- Python/2.7
- Cromwell v61
Quick start guide
Installing and preparing environment for GATK4 with Cromwell
- Clone repository
git clone https://github.com/SarahBeecroft/slurmCromwellGATK4.git
cd slurmCromwellGATK4
chmod +x *.sh
-
Install Miniconda if you haven’t already. This is best placed in your
/group
directory to avoid filling your small/home
directory, or being purged is placed in the/scratch
directory. -
Create Conda environment using the supplied conda environment file
conda env create --file gatk4_pipeline.yml
-
Download the necessary .jar files
- The Cromwell workfow orchestration engine can be downloaded from https://github.com/broadinstitute/cromwell/releases/
-
GATK can be downloaded from
https://github.com/broadinstitute/gatk/releases
. Unzip the file with
unzip
- Picard can be downloaded from https://github.com/broadinstitute/picard/releases/
-
If you do not have the resource bundle files already, these need to be downloaded. In future they will be cached on Pawsey systems. The bundle data should be download from the Google Cloud bucket and not from the FTP site, which is missing various files. Refer to this handy blog post on how to download the resource files using Google Cloud SDK. There is a Slurm script (download_bundle.slurm) that can be used to download all hg38 files from the Google Cloud bucket. The files were downloaded in /scratch/pawsey0001/sbeecroft/hg38/v0, which needs to be moved before the data becomes purged after 30 days. Note that Homo_sapiens_assembly38.dbsnp138.vcf.gz was from the FTP bundle as this file could not be downloaded using the Conda version of Google Cloud SDK.
-
Set up the config files. Files that you need to edit with the correct paths to your data/jar files or other specific configurations are:
-
Multisample_Fastq_to_Gvcf_GATK4_inputs_hg38.json
-
Multisample_jointgt_GATK4_inputs_hg38.json
-
both json files will need the correct paths to your reference file locations, and the file specifying your inputs i.e.
samples.txt
orgvcfs.txt
-
both json files will need the correct paths to your reference file locations, and the file specifying your inputs i.e.
-
samples.txt
-
gvcfs.txt
- These are the sample input files (tab seperated)
- The format for samples.txt is sampleID, sampleID_readgroup, path_to_fastq_R1_file, path_to_fastq_R2_file,
- The format for gvcfs.txt is sample ID, gvcf, gvcf .tbi index file
- Examples are included in this repo
- NOTE: Having tabs, not spaces, is vital for parsing the file. Visual studio code tends to introduce spaces, so if you are having issues, check the file with another text editor such as sublime.
-
launch_cromwell.sh
-
launch_jointgt.sh
- These are the scripts which launch the pipeline.
-
launch_cromwell.sh
launches the fastq to gvcf stage -
launch_jointgt.sh
launched the gvcf joint genotyping to cohort vcf step. This is perfomed when you have run all samples through the fastq to gvcf stage. - Check the paths and parameters make sense for your machine
-
slurm.conf
- the main options here relate to the job scheduler. If you are running on Zeus at Pawsey, you should not need to alter these parameters.
-
cromwell.options
-
cromwell.options
requires editing to provide the directory where you would like the final workflow outputs to be written
-
-
Multisample_Fastq_to_Gvcf_GATK4.wdl
-
ruddle_fastq_to_gvcf_single_sample_gatk4.wdl
- The paths to your jar files will need to be updated
-
The path to your conda
activate
binary will need to be updated (e.g./group/projectID/userID/miniconda/bin/activate
)
-
-
Launch the job using
sbatch launch_cromwell.sh
. When that has completed successfully, you can launch the second stage of the pipeline (joint calling) withsbatch launch_jointgt.sh
.
Code Snippets
37 38 39 40 | call make_uniq_samples_file { input: inputSamplesFile = inputSamplesFile, } |
44 45 46 47 48 | call make_fastq_file { input: inputSamplesFile = inputSamplesFile, sample_name = sample } |
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | call single_wf.Fastq_to_Gvcf_GATK4 { input: sample_fastq_file = make_fastq_file.sample_fastq_file, sample_name = sample, unmapped_bam_suffix = unmapped_bam_suffix, ref_name = ref_name, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, ref_dict = ref_dict, ref_amb = ref_amb, ref_ann = ref_ann, ref_bwt = ref_bwt, ref_pac = ref_pac, ref_sa = ref_sa, ref_alt = ref_alt, bwa_commandline = bwa_commandline, compression_level = compression_level, dbSNP_vcf = dbSNP_vcf, dbSNP_vcf_index = dbSNP_vcf_index, known_indels_sites_VCFs = known_indels_sites_VCFs, known_indels_sites_indices = known_indels_sites_indices, scattered_calling_intervals_list = scattered_calling_intervals_list } |
84 85 86 87 88 89 90 91 | call FastqToUbam { input: sample_name = fastq[0], readgroup = readgroup, fastq1 = fastq[2], fastq2 = fastq[3], outpref = sample_name + "." + readgroup } |
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | call SamToFastqAndBwaMem { input: input_bam = FastqToUbam.unmapped_bam, bwa_commandline = bwa_commandline, output_bam_basename = sample_name + "." + readgroup + ".unmerged", ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, ref_dict = ref_dict, ref_amb = ref_amb, ref_ann = ref_ann, ref_bwt = ref_bwt, ref_pac = ref_pac, ref_sa = ref_sa, ref_alt = ref_alt, compression_level = compression_level } |
112 113 114 115 116 117 118 119 120 121 122 123 | call MergeBamAlignment { input: unmapped_bam = FastqToUbam.unmapped_bam, bwa_commandline = bwa_commandline, bwa_version = GetBwaVersion.version, aligned_bam = SamToFastqAndBwaMem.output_bam, output_bam_basename = sample_name + "." + readgroup + ".aligned.unsorted", ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, ref_dict = ref_dict, compression_level = compression_level } |
129 130 131 132 133 134 135 | call MarkDuplicates { input: input_bams = MergeBamAlignment.output_bam, output_bam_basename = base_file_name + ".aligned.unsorted.duplicates_marked", metrics_filename = base_file_name + ".duplicate_metrics", compression_level = compression_level } |
138 139 140 141 142 143 144 145 146 | call SortAndFixTags { input: input_bam = MarkDuplicates.output_bam, output_bam_basename = base_file_name + ".aligned.duplicate_marked.sorted", ref_dict = ref_dict, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, compression_level = compression_level } |
149 150 151 152 | call CreateSequenceGroupingTSV { input: ref_dict = ref_dict, } |
157 158 159 160 161 162 163 164 165 166 167 168 169 170 | call BaseRecalibrator { input: input_bam = SortAndFixTags.output_bam, input_bam_index = SortAndFixTags.output_bam_index, recalibration_report_filename = base_file_name + ".recal_data.csv", sequence_group_interval = subgroup, dbSNP_vcf = dbSNP_vcf, dbSNP_vcf_index = dbSNP_vcf_index, known_indels_sites_VCFs = known_indels_sites_VCFs, known_indels_sites_indices = known_indels_sites_indices, ref_dict = ref_dict, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, } |
174 175 176 177 178 | call GatherBqsrReports { input: input_bqsr_reports = BaseRecalibrator.recalibration_report, output_report_filename = base_file_name + ".recal_data.csv", } |
183 184 185 186 187 188 189 190 191 192 193 | call ApplyBQSR { input: input_bam = SortAndFixTags.output_bam, input_bam_index = SortAndFixTags.output_bam_index, output_bam_basename = base_file_name + ".aligned.duplicates_marked.recalibrated", recalibration_report = GatherBqsrReports.output_bqsr_report, sequence_group_interval = subgroup, ref_dict = ref_dict, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, } |
197 198 199 200 201 202 | call GatherBamFiles { input: input_bams = ApplyBQSR.recalibrated_bam, output_bam_basename = base_file_name, compression_level = compression_level } |
207 208 209 210 211 212 213 214 215 216 | call HaplotypeCaller { input: input_bam = GatherBamFiles.output_bam, input_bam_index = GatherBamFiles.output_bam_index, interval_list = interval_file, output_filename = base_file_name + ".g.vcf.gz", ref_dict = ref_dict, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index } |
220 221 222 223 224 225 | call MergeGVCFs { input: input_vcfs = HaplotypeCaller.output_vcf, input_vcfs_indexes = HaplotypeCaller.output_vcf_index, output_filename = base_file_name + ".g.vcf.gz" } |
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 | command <<< set -o pipefail set -e # set the bash variable needed for the command-line bash_ref_fasta=${ref_fasta} java -Dsamjdk.compression_level=${compression_level} -Xms3000m -jar /scratch/pawsey0001/sbeecroft/tool/picard.jar \ SamToFastq \ INPUT=${input_bam} \ FASTQ=/dev/stdout \ INTERLEAVE=true \ NON_PF=true \ | \ ${bwa_commandline} /dev/stdin - 2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) \ | \ samtools view -1 - > ${output_bam_basename}.bam >>> |
473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 | command <<< python <<CODE with open("${ref_dict}", "r") as ref_dict_file: sequence_tuple_list = [] longest_sequence = 0 for line in ref_dict_file: if line.startswith("@SQ"): line_split = line.split("\t") # (Sequence_Name, Sequence_Length) sequence_tuple_list.append((line_split[1].split("SN:")[1], int(line_split[2].split("LN:")[1]))) longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1] # We are adding this to the intervals because b37 has contigs named with embedded colons (:) and a bug in # some versions of GATK strips off the last element after a colon, so we add this as a sacrificial element. b37_protection_tag = ":1+" # initialize the tsv string with the first sequence tsv_string = sequence_tuple_list[0][0] + b37_protection_tag temp_size = sequence_tuple_list[0][1] for sequence_tuple in sequence_tuple_list[1:]: if temp_size + sequence_tuple[1] <= longest_sequence: temp_size += sequence_tuple[1] tsv_string += "\t" + sequence_tuple[0] + b37_protection_tag else: tsv_string += "\n" + sequence_tuple[0] + b37_protection_tag temp_size = sequence_tuple[1] # add the unmapped sequences as a separate line to ensure that they are recalibrated as well with open("sequence_grouping.txt","w") as tsv_file: tsv_file.write(tsv_string) tsv_file.close() tsv_string += '\n' + "unmapped" with open("sequence_grouping_with_unmapped.txt","w") as tsv_file_with_unmapped: tsv_file_with_unmapped.write(tsv_string) tsv_file_with_unmapped.close() CODE >>> |
651 652 653 654 655 656 657 658 659 660 661 662 | command <<< set -e /scratch/pawsey0001/sbeecroft/tool/gatk-4.2.0.0/gatk --java-options "-Xmx15000m ${java_opt}" \ HaplotypeCaller \ -R ${ref_fasta} \ -I ${input_bam} \ -L ${interval_list} \ -O ${output_filename} \ -contamination ${default=0 contamination} \ -ERC GVCF >>> |
682 683 684 685 686 687 688 689 | command <<< set -e /scratch/pawsey0001/sbeecroft/tool/gatk-4.2.0.0/gatk --java-options "-Xmx15000m" \ MergeVcfs \ --INPUT ${sep=' --INPUT ' input_vcfs} \ --OUTPUT ${output_filename} >>> |
Support
- Future updates