Genomic variants - SNPs and INDELs detection using GATK4 spark based tools.

public public 1yr ago Version: Version 1 0 bookmarks

Author: AMBARISH KUMAR [email protected] & [email protected]

This is a proposed standard operating procedure for genomic variant detection using GATK4.

It is hoped to be effective and useful for getting SARS-CoV-2 genome variants.

It uses Illumina RNASEQ reads and genome sequence.

Code Snippets

864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
baseCommand:
  - bowtie2

arguments:
  - valueFrom: |
      ${
        if (inputs.filelist && inputs.filelist_mates){
          return "-1";
        } else if (inputs.filelist){
          return "-U";
        } else {
          return null;
        }
      }
    position: 82
  - valueFrom: |
      ${
        if (inputs.filelist && inputs.filelist_mates){
          return "-2";
        } else if (inputs.filelist_mates){
          return "-U";
        } else {
          return null;
        }
      }
    position: 84
  - valueFrom: |
      ${
        if (inputs.output_filename == ""){
          return ' 2> ' + default_output_filename().split('.').slice(0,-1).join('.') + '.log';
        } else {
          return ' 2> ' + inputs.output_filename.split('.').slice(0,-1).join('.') + '.log';
        }
      }
    position: 100000
    shellQuote: false
226
227
228
229
230
231
232
baseCommand:
  - bowtie2-build

arguments:
  - valueFrom: $('2> ' + inputs.bt2_index_base + '.log')
    position: 100000
    shellQuote: false
4
5
6
baseCommand:
- gatk
- HaplotypeCallerSpark
5
6
7
baseCommand:
- gatk
- SelectVariants
5
6
7
baseCommand:
- gatk
- SplitNCigarReads
5
6
7
baseCommand:
- gatk
- VariantFiltration
29
30
31
32
33
baseCommand: []
arguments:
- valueFrom: |-
    gatk MarkDuplicatesSpark -I $(inputs.inputBAM.path) -O $(inputs.sampleName).markdup.bam -M output.metrics
  shellQuote: false
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
baseCommand:
- picard
- AddOrReplaceReadGroups

doc: |-
  Assigns all the reads in a file to a single new read-group.

   <h3>Summary</h3>
   Many tools (Picard and GATK for example) require or assume the presence of at least one <code>RG</code> tag, defining a "read-group"
   to which each read can be assigned (as specified in the <code>RG</code> tag in the SAM record).
   This tool enables the user to assign all the reads in the INPUT to a single new read-group.
   For more information about read-groups, see the <a href='https://www.broadinstitute.org/gatk/guide/article?id=6472'>
   GATK Dictionary entry.</a>
   <br />
   This tool accepts as INPUT BAM and SAM files or URLs from the
   <a href="http://ga4gh.org/#/documentation">Global Alliance for Genomics and Health (GA4GH)</a>.
   <h3>Caveats</h3>
   The value of the tags must adhere (according to the <a href="https://samtools.github.io/hts-specs/SAMv1.pdf">SAM-spec</a>)
   with the regex <pre>#READGROUP_ID_REGEX</pre> (one or more characters from the ASCII range 32 through 126). In
   particular <code>&lt;Space&gt;</code> is the only non-printing character allowed.
   <br/>
   The program enables only the wholesale assignment of all the reads in the INPUT to a single read-group. If your file
   already has reads assigned to multiple read-groups, the original <code>RG</code> value will be lost.
  Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups

requirements:
  ShellCommandRequirement: {}
  InlineJavascriptRequirement:
    expressionLib:
    - |
      function generateGATK4BooleanValue(){
          /**
           * Boolean types in GATK 4 are expressed on the command line as --<PREFIX> "true"/"false",
           * so patch here
           */
          if(self === true || self === false){
              return self.toString()
          }

          return self;
      }
hints:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/picard:2.22.2--0
inputs:
- doc: Input file (BAM or SAM or a GA4GH url). [synonymous with -I]
  id: INPUT
  type: File
  inputBinding:
    prefix: INPUT=
    separate: false
- doc: Read-Group library [synonymous with -LB]
  id: RGLB
  type: string
  inputBinding:
    prefix: RGLB=
    separate: false
- doc: Read-Group platform (e.g. ILLUMINA, SOLID) [synonymous with -PL]
  id: RGPL
  type: string
  inputBinding:
    prefix: RGPL=
    separate: false
- doc: Read-Group platform unit (eg. run barcode) [synonymous with -PU]
  id: RGPU
  type: string
  inputBinding:
    prefix: RGPU=
    separate: false
- doc: Read-Group sample name [synonymous with -SM]
  id: RGSM
  type: string
  inputBinding:
    prefix: RGSM=
    separate: false
- doc: Output filename (BAM or SAM)
  id: OUTPUT
  type: string
  inputBinding:
    prefix: OUTPUT=
    separate: false
- doc: Reference sequence file. [synonymous with -R]
  id: REFERENCE_SEQUENCE
  type: File?
  inputBinding:
    prefix: REFERENCE_SEQUENCE=
    separate: false
- doc: Optional sort order to output in. If not supplied OUTPUT is in the same order
    as INPUT. [synonymous with -SO]
  id: SORT_ORDER
  type:
  - 'null'
  - type: enum
    symbols:
    - unsorted
    - queryname
    - coordinate
    - duplicate
    - unknown
  inputBinding:
    prefix: SORT_ORDER=
    separate: false
- doc: Read-Group sequencing center name [synonymous with -CN]
  id: RGCN
  type: string?
  inputBinding:
    prefix: RGCN=
    separate: false
- doc: Read-Group description [synonymous with -DS]
  id: RGDS
  type: string?
  inputBinding:
    prefix: RGDS=
    separate: false
- doc: Read-Group run date in Iso8601Date format [synonymous with -DT]
  id: RGDT
  type: string?
  inputBinding:
    prefix: RGDT=
    separate: false
- doc: Read-Group flow order [synonymous with -FO]
  id: RGFO
  type: string?
  inputBinding:
    prefix: RGFO=
    separate: false
- doc: Read-Group ID [synonymous with -ID]
  id: RGID
  type: string?
  inputBinding:
    prefix: RGID=
    separate: false
- doc: Read-Group key sequence [synonymous with -KS]
  id: RGKS
  type: string?
  inputBinding:
    prefix: RGKS=
    separate: false
- doc: Read-Group program group [synonymous with -PG]
  id: RGPG
  type: string?
  inputBinding:
    prefix: RGPG=
    separate: false
- doc: Read-Group predicted insert size [synonymous with -PI]
  id: RGPI
  type: int?
  inputBinding:
    prefix: RGPI=
    separate: false
- doc: Read-Group platform model [synonymous with -PM]
  id: RGPM
  type: string?
  inputBinding:
    prefix: RGPM=
    separate: false
- doc: Control verbosity of logging.
  id: VERBOSITY
  type:
  - 'null'
  - type: enum
    symbols:
    - ERROR
    - WARNING
    - INFO
    - DEBUG
  inputBinding:
    prefix: VERBOSITY=
    separate: false
- doc: Whether to suppress job-summary info on System.err.
  id: QUIET
  type: boolean?
  inputBinding:
    prefix: QUIET=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: Validation stringency for all SAM files read by this program.  Setting stringency
    to SILENT can improve performance when processing a BAM file in which variable-length
    data (read, qualities, tags) do not otherwise need to be decoded.
  id: VALIDATION_STRINGENCY
  type:
  - 'null'
  - type: enum
    symbols:
    - STRICT
    - LENIENT
    - SILENT
  inputBinding:
    prefix: VALIDATION_STRINGENCY=
    separate: false
- doc: Compression level for all compressed files created (e.g. BAM and VCF).
  id: COMPRESSION_LEVEL
  type: int?
  inputBinding:
    prefix: COMPRESSION_LEVEL=
    separate: false
- doc: When writing files that need to be sorted, this will specify the number of
    records stored in RAM before spilling to disk. Increasing this number reduces
    the number of file handles needed to sort the file, and increases the amount of
    RAM needed.
  id: MAX_RECORDS_IN_RAM
  type: int?
  inputBinding:
    prefix: MAX_RECORDS_IN_RAM=
    separate: false
- doc: Use the JDK Deflater instead of the Intel Deflater for writing compressed output
    [synonymous with -use_jdk_deflater]
  id: USE_JDK_DEFLATER
  type: boolean?
  inputBinding:
    prefix: USE_JDK_DEFLATER=
    separate: false
    valueFrom: $(generateGATK4BooleanValue())
- doc: Use the JDK Inflater instead of the Intel Inflater for reading compressed input
    [synonymous with -use_jdk_inflater]
  id: USE_JDK_INFLATER
  type: boolean?
  inputBinding:
    prefix: USE_JDK_INFLATER=
    separate: false
    valueFrom: $(generateGATK4BooleanValue())
- doc: Whether to create a BAM index when writing a coordinate-sorted BAM file.
  id: CREATE_INDEX
  type: boolean?
  inputBinding:
    prefix: CREATE_INDEX=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: 'Whether to create an MD5 digest for any BAM or FASTQ files created.  '
  id: CREATE_MD5_FILE
  type: boolean?
  inputBinding:
    prefix: CREATE_MD5_FILE=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: Google Genomics API client_secrets.json file path.
  id: GA4GH_CLIENT_SECRETS
  type: File?
  inputBinding:
    prefix: GA4GH_CLIENT_SECRETS=
    separate: false

arguments:
 - TMP_DIR=$(runtime.tmpdir)
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
baseCommand:
- picard
- CreateSequenceDictionary

doc: |-
  Create a SAM/BAM file from a fasta containing reference sequence. The output SAM file contains a header but no
   SAMRecords, and the header contains only sequence records.

requirements:
  ShellCommandRequirement: {}
  InitialWorkDirRequirement:
    listing:
      - $(inputs.REFERENCE)
  InlineJavascriptRequirement:
    expressionLib:
    - |
      function generateGATK4BooleanValue(){
          /**
           * Boolean types in GATK 4 are expressed on the command line as --<PREFIX> "true"/"false",
           * so patch here
           */
          if(self === true || self === false){
              return self.toString()
          }

          return self;
      }
hints:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/picard:2.22.2--0
inputs:
- doc: Input reference fasta or fasta.gz [synonymous with -R]
  id: REFERENCE
  type: File

  inputBinding:
    valueFrom: REFERENCE=$(self.basename)
- doc: Put into AS field of sequence dictionary entry if supplied [synonymous with
    -AS]
  id: GENOME_ASSEMBLY
  type: string?
  inputBinding:
    prefix: GENOME_ASSEMBLY=
    separate: false
- doc: Put into UR field of sequence dictionary entry.  If not supplied, input reference
    file is used [synonymous with -UR]
  id: URI
  type: string?
  inputBinding:
    prefix: URI=
    separate: false
- doc: Put into SP field of sequence dictionary entry [synonymous with -SP]
  id: SPECIES
  type: string?
  inputBinding:
    prefix: SPECIES=
    separate: false
- doc: Make sequence name the first word from the > line in the fasta file.  By default
    the entire contents of the > line is used, excluding leading and trailing whitespace.
  id: TRUNCATE_NAMES_AT_WHITESPACE
  type: boolean?
  inputBinding:
    prefix: TRUNCATE_NAMES_AT_WHITESPACE=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: Stop after writing this many sequences.  For testing.
  id: NUM_SEQUENCES
  type: int?
  inputBinding:
    prefix: NUM_SEQUENCES=
    separate: false
- doc: "Optional file containing the alternative names for the contigs. Tools may\
    \ use this information to consider different contig notations as identical (e.g:\
    \ 'chr1' and '1'). The alternative names will be put into the appropriate @AN\
    \ annotation for each contig. No header. First column is the original name, the\
    \ second column is an alternative name. One contig may have more than one alternative\
    \ name. [synonymous with -AN]"
  id: ALT_NAMES
  type: File?
  inputBinding:
    prefix: ALT_NAMES=
    separate: false
- doc: Control verbosity of logging.
  id: VERBOSITY
  type:
  - 'null'
  - type: enum
    symbols:
    - ERROR
    - WARNING
    - INFO
    - DEBUG
  inputBinding:
    prefix: VERBOSITY=
    separate: false
- doc: Whether to suppress job-summary info on System.err.
  id: QUIET
  type: boolean?
  inputBinding:
    prefix: QUIET=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: Validation stringency for all SAM files read by this program.  Setting stringency
    to SILENT can improve performance when processing a BAM file in which variable-length
    data (read, qualities, tags) do not otherwise need to be decoded.
  id: VALIDATION_STRINGENCY
  type:
  - 'null'
  - type: enum
    symbols:
    - STRICT
    - LENIENT
    - SILENT
  inputBinding:
    prefix: VALIDATION_STRINGENCY=
    separate: false
- doc: Compression level for all compressed files created (e.g. BAM and VCF).
  id: COMPRESSION_LEVEL
  type: int?
  inputBinding:
    prefix: COMPRESSION_LEVEL=
    separate: false
- doc: When writing files that need to be sorted, this will specify the number of
    records stored in RAM before spilling to disk. Increasing this number reduces
    the number of file handles needed to sort the file, and increases the amount of
    RAM needed.
  id: MAX_RECORDS_IN_RAM
  type: int?
  inputBinding:
    prefix: MAX_RECORDS_IN_RAM=
    separate: false
- doc: Use the JDK Deflater instead of the Intel Deflater for writing compressed output
    [synonymous with -use_jdk_deflater]
  id: USE_JDK_DEFLATER
  type: boolean?
  inputBinding:
    prefix: USE_JDK_DEFLATER=
    separate: false
    valueFrom: $(generateGATK4BooleanValue())
- doc: Use the JDK Inflater instead of the Intel Inflater for reading compressed input
    [synonymous with -use_jdk_inflater]
  id: USE_JDK_INFLATER
  type: boolean?
  inputBinding:
    prefix: USE_JDK_INFLATER=
    separate: false
    valueFrom: $(generateGATK4BooleanValue())
- doc: Whether to create a BAM index when writing a coordinate-sorted BAM file.
  id: CREATE_INDEX
  type: boolean?
  inputBinding:
    prefix: CREATE_INDEX=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: 'Whether to create an MD5 digest for any BAM or FASTQ files created.  '
  id: CREATE_MD5_FILE
  type: boolean?
  inputBinding:
    prefix: CREATE_MD5_FILE=
    valueFrom: $(generateGATK4BooleanValue())
    separate: false
- doc: Google Genomics API client_secrets.json file path.
  id: GA4GH_CLIENT_SECRETS
  type: File?
  inputBinding:
    prefix: GA4GH_CLIENT_SECRETS=
    separate: false

arguments:
 - TMP_DIR=$(runtime.tmpdir)
 - OUTPUT=$(inputs.REFERENCE.nameroot).dict
12
13
14
15
16
17
18
19
20
21
baseCommand: [ samtools, faidx ]

inputs:
  sequences:
    type: File
    doc: Input FASTA file


arguments:
   - $(inputs.sequences.basename)
19
20
21
22
23
24
25
26
27
28
29
baseCommand: ["samtools", "index"]
arguments:
  - valueFrom: -b  # specifies that index is created in bai format
    position: 1

inputs:
  bam_sorted:
    doc: sorted bam input file
    type: File
    inputBinding:
      position: 2
29
30
31
32
33
baseCommand: []
arguments:
- valueFrom: |-
    gatk SortSamSpark -I $(inputs.inputBAM.path) -O $(inputs.sampleName).bam
  shellQuote: false
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ambarishK/bio-cwl-tools/blob/release/gatk4W-spark.cwl
Name: genomic-variants-snps-and-indels-detection-using-1
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Boost Software License 1.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...