Fastq to Joint-Called Cohort VCF Pipeline with GATK4 and Cromwell on SLURM

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

gatk4-fastq-to-joint-called-cohort-vcf-with-cromwe — View Workflow

SLURM HPC Cromwell implementation of GATK4 germline variant calling pipeline

See the GATK website for more information on this toolset

Assumptions

Using hg38 human reference genome build
Running using HPC/SLURM scheduling. This repo was specifically tested on Pawsey Zeus machine, primarily running in the /scratch partition.
Starting from short-read Illumina paired-end fastq files as input

Dependencies

The following versions have been tested and work, but GATK and Cromwell are regularly updated and so one must consider whether they would like to use newer versions of these tools.

BWA/0.7.15
GATK v4.0.6.0
SAMtools/1.5
picard/2.9
Python/2.7
Cromwell v61

Quick start guide

Installing and preparing environment for GATK4 with Cromwell

Clone repository

git clone https://github.com/SarahBeecroft/slurmCromwellGATK4.git cd slurmCromwellGATK4 chmod +x *.sh

Install Miniconda if you haven’t already. This is best placed in your /group directory to avoid filling your small /home directory, or being purged is placed in the /scratch directory.
Create Conda environment using the supplied conda environment file

conda env create --file gatk4_pipeline.yml

Download the necessary .jar files
- The Cromwell workfow orchestration engine can be downloaded from https://github.com/broadinstitute/cromwell/releases/
- GATK can be downloaded from https://github.com/broadinstitute/gatk/releases . Unzip the file with unzip
- Picard can be downloaded from https://github.com/broadinstitute/picard/releases/
If you do not have the resource bundle files already, these need to be downloaded. In future they will be cached on Pawsey systems. The bundle data should be download from the Google Cloud bucket and not from the FTP site, which is missing various files. Refer to this handy blog post on how to download the resource files using Google Cloud SDK. There is a Slurm script (download_bundle.slurm) that can be used to download all hg38 files from the Google Cloud bucket. The files were downloaded in /scratch/pawsey0001/sbeecroft/hg38/v0, which needs to be moved before the data becomes purged after 30 days. Note that Homo_sapiens_assembly38.dbsnp138.vcf.gz was from the FTP bundle as this file could not be downloaded using the Conda version of Google Cloud SDK.

Set up the config files. Files that you need to edit with the correct paths to your data/jar files or other specific configurations are:
- Multisample_Fastq_to_Gvcf_GATK4_inputs_hg38.json
- Multisample_jointgt_GATK4_inputs_hg38.json
  - both json files will need the correct paths to your reference file locations, and the file specifying your inputs i.e. samples.txt or gvcfs.txt
- samples.txt
- gvcfs.txt
  - These are the sample input files (tab seperated)
  - The format for samples.txt is sampleID, sampleID_readgroup, path_to_fastq_R1_file, path_to_fastq_R2_file,
  - The format for gvcfs.txt is sample ID, gvcf, gvcf .tbi index file
  - Examples are included in this repo
  - NOTE: Having tabs, not spaces, is vital for parsing the file. Visual studio code tends to introduce spaces, so if you are having issues, check the file with another text editor such as sublime.
- launch_cromwell.sh
- launch_jointgt.sh
  - These are the scripts which launch the pipeline.
  - launch_cromwell.sh launches the fastq to gvcf stage
  - launch_jointgt.sh launched the gvcf joint genotyping to cohort vcf step. This is perfomed when you have run all samples through the fastq to gvcf stage.
  - Check the paths and parameters make sense for your machine
- slurm.conf
  - the main options here relate to the job scheduler. If you are running on Zeus at Pawsey, you should not need to alter these parameters.
- cromwell.options
  - cromwell.options requires editing to provide the directory where you would like the final workflow outputs to be written
- Multisample_Fastq_to_Gvcf_GATK4.wdl
- ruddle_fastq_to_gvcf_single_sample_gatk4.wdl
  - The paths to your jar files will need to be updated
  - The path to your conda activate binary will need to be updated (e.g. /group/projectID/userID/miniconda/bin/activate )
Launch the job using sbatch launch_cromwell.sh . When that has completed successfully, you can launch the second stage of the pipeline (joint calling) with sbatch launch_jointgt.sh .

Code Snippets

call make_uniq_samples_file {
  input:
    inputSamplesFile = inputSamplesFile,
 }

WDL From line 37 of main/Multisample_Fastq_to_Gvcf_GATK4.wdl

call make_fastq_file {
 input:
   inputSamplesFile = inputSamplesFile,
   sample_name = sample
}

WDL From line 44 of main/Multisample_Fastq_to_Gvcf_GATK4.wdl

call single_wf.Fastq_to_Gvcf_GATK4 { 
   input: 

     sample_fastq_file = make_fastq_file.sample_fastq_file,
     sample_name = sample,


     unmapped_bam_suffix = unmapped_bam_suffix,

     ref_name = ref_name,

     ref_fasta = ref_fasta,
     ref_fasta_index = ref_fasta_index,
     ref_dict = ref_dict,
     ref_amb = ref_amb,
     ref_ann = ref_ann,
     ref_bwt = ref_bwt,
     ref_pac = ref_pac,
     ref_sa = ref_sa,
     ref_alt = ref_alt,

     bwa_commandline = bwa_commandline,
     compression_level = compression_level,

     dbSNP_vcf = dbSNP_vcf,
     dbSNP_vcf_index = dbSNP_vcf_index,
     known_indels_sites_VCFs = known_indels_sites_VCFs,
     known_indels_sites_indices = known_indels_sites_indices,

     scattered_calling_intervals_list = scattered_calling_intervals_list
 }

WDL BWA dbSNP From line 51 of main/Multisample_Fastq_to_Gvcf_GATK4.wdl

call FastqToUbam {
  input:
    sample_name = fastq[0],
    readgroup = readgroup,
    fastq1 = fastq[2],
    fastq2 = fastq[3],
    outpref = sample_name + "." + readgroup
}

WDL From line 84 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

    call SamToFastqAndBwaMem {
      input:
        input_bam = FastqToUbam.unmapped_bam,
        bwa_commandline = bwa_commandline,
        output_bam_basename = sample_name + "." + readgroup + ".unmerged",
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        ref_dict = ref_dict,
        ref_amb = ref_amb,
        ref_ann = ref_ann,
        ref_bwt = ref_bwt,
        ref_pac = ref_pac,
        ref_sa = ref_sa,
        ref_alt = ref_alt,
	compression_level = compression_level
     }

WDL BWA From line 94 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call MergeBamAlignment {
  input:
    unmapped_bam = FastqToUbam.unmapped_bam,
    bwa_commandline = bwa_commandline,
    bwa_version = GetBwaVersion.version,
    aligned_bam = SamToFastqAndBwaMem.output_bam,
    output_bam_basename = sample_name + "." + readgroup + ".aligned.unsorted",
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    ref_dict = ref_dict,
    compression_level = compression_level
}

WDL BWA From line 112 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call MarkDuplicates {
  input:
    input_bams = MergeBamAlignment.output_bam,
    output_bam_basename = base_file_name + ".aligned.unsorted.duplicates_marked",
    metrics_filename = base_file_name + ".duplicate_metrics",
    compression_level = compression_level
}

WDL From line 129 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call SortAndFixTags {
  input:
    input_bam = MarkDuplicates.output_bam,
    output_bam_basename = base_file_name + ".aligned.duplicate_marked.sorted",
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    compression_level = compression_level
}

WDL From line 138 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call CreateSequenceGroupingTSV {
  input:
    ref_dict = ref_dict,
}

WDL From line 149 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call BaseRecalibrator {
  input:
    input_bam = SortAndFixTags.output_bam,
    input_bam_index = SortAndFixTags.output_bam_index,
    recalibration_report_filename = base_file_name + ".recal_data.csv",
    sequence_group_interval = subgroup,
    dbSNP_vcf = dbSNP_vcf,
    dbSNP_vcf_index = dbSNP_vcf_index,
    known_indels_sites_VCFs = known_indels_sites_VCFs,
    known_indels_sites_indices = known_indels_sites_indices,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
}  

WDL dbSNP From line 157 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call GatherBqsrReports {
  input:
    input_bqsr_reports = BaseRecalibrator.recalibration_report,
    output_report_filename = base_file_name + ".recal_data.csv",
}

WDL From line 174 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call ApplyBQSR {
  input:
    input_bam = SortAndFixTags.output_bam,
    input_bam_index = SortAndFixTags.output_bam_index,
    output_bam_basename = base_file_name + ".aligned.duplicates_marked.recalibrated",
    recalibration_report = GatherBqsrReports.output_bqsr_report,
    sequence_group_interval = subgroup,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
}

WDL From line 183 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

 call GatherBamFiles {
   input:
     input_bams = ApplyBQSR.recalibrated_bam,
     output_bam_basename = base_file_name,
     compression_level = compression_level  
}

WDL From line 197 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call HaplotypeCaller {
  input:
    input_bam = GatherBamFiles.output_bam,
    input_bam_index = GatherBamFiles.output_bam_index,
    interval_list = interval_file,
    output_filename = base_file_name + ".g.vcf.gz",
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index
}

WDL From line 207 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

call MergeGVCFs {
  input:
    input_vcfs = HaplotypeCaller.output_vcf,
    input_vcfs_indexes = HaplotypeCaller.output_vcf_index,
    output_filename = base_file_name + ".g.vcf.gz"
}

WDL From line 220 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

  command <<<
    set -o pipefail
    set -e

    # set the bash variable needed for the command-line
    bash_ref_fasta=${ref_fasta}

		java -Dsamjdk.compression_level=${compression_level} -Xms3000m -jar /scratch/pawsey0001/sbeecroft/tool/picard.jar \
      SamToFastq \
			INPUT=${input_bam} \
			FASTQ=/dev/stdout \
			INTERLEAVE=true \
			NON_PF=true \
    | \
		${bwa_commandline} /dev/stdin -  2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) \
    | \
		samtools view -1 - > ${output_bam_basename}.bam

  >>>

WDL SAMtools Picard From line 313 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

command <<<
  python <<CODE
  with open("${ref_dict}", "r") as ref_dict_file:
      sequence_tuple_list = []
      longest_sequence = 0
      for line in ref_dict_file:
          if line.startswith("@SQ"):
              line_split = line.split("\t")
              # (Sequence_Name, Sequence_Length)
              sequence_tuple_list.append((line_split[1].split("SN:")[1], int(line_split[2].split("LN:")[1])))
      longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1]
  # We are adding this to the intervals because b37 has contigs named with embedded colons (:) and a bug in 
  # some versions of GATK strips off the last element after a colon, so we add this as a sacrificial element.
  b37_protection_tag = ":1+"
  # initialize the tsv string with the first sequence
  tsv_string = sequence_tuple_list[0][0] + b37_protection_tag
  temp_size = sequence_tuple_list[0][1]
  for sequence_tuple in sequence_tuple_list[1:]:
      if temp_size + sequence_tuple[1] <= longest_sequence:
          temp_size += sequence_tuple[1]
          tsv_string += "\t" + sequence_tuple[0] + b37_protection_tag
      else:
          tsv_string += "\n" + sequence_tuple[0] + b37_protection_tag
          temp_size = sequence_tuple[1]
  # add the unmapped sequences as a separate line to ensure that they are recalibrated as well
  with open("sequence_grouping.txt","w") as tsv_file:
    tsv_file.write(tsv_string)
    tsv_file.close()

  tsv_string += '\n' + "unmapped"

  with open("sequence_grouping_with_unmapped.txt","w") as tsv_file_with_unmapped:
    tsv_file_with_unmapped.write(tsv_string)
    tsv_file_with_unmapped.close()
  CODE
>>>

WDL From line 473 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

command <<<
set -e

  /scratch/pawsey0001/sbeecroft/tool/gatk-4.2.0.0/gatk --java-options "-Xmx15000m ${java_opt}" \
    HaplotypeCaller \
    -R ${ref_fasta} \
    -I ${input_bam} \
    -L ${interval_list} \
    -O ${output_filename} \
    -contamination ${default=0 contamination} \
    -ERC GVCF
>>>

WDL gatk From line 651 of main/ruddle_fastq_to_gvcf_single_sample_gatk4.wdl

command <<<
set -e

  /scratch/pawsey0001/sbeecroft/tool/gatk-4.2.0.0/gatk --java-options "-Xmx15000m"  \
    MergeVcfs \
    --INPUT ${sep=' --INPUT ' input_vcfs} \
    --OUTPUT ${output_filename}
>>>