MetaGOflow: An EOSC-Life project workflow for marine Genomic Observatories data analysis

public public 1yr ago Version: eosc-life-gos @ deb5427 0 bookmarks

metaGOflow: A workflow for marine Genomic Observatories' data analysis

logo

An EOSC-Life project

The workflows developed in the framework of this project are based on pipeline-v5 of the MGnify resource.

This branch is a child of the pipeline_5.1 branch that contains all CWL descriptions of the MGnify pipeline version 5.1.

Dependencies

To run metaGOflow you need to make sure you have the following set on your computing environmnet first:

Storage while running

Depending on the analysis you are about to run, disk requirements vary. Indicatively, you may have a look at the metaGOflow publication for computing resources used in various cases.

Code Snippets

21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
baseCommand: [emapper.py]

inputs:
  fasta_file:
    format: edam:format_1929  # FASTA
    type: File?
    inputBinding:
      separate: true
      prefix: -i
    label: Input FASTA file containing query sequences

  db:
    type: [string?, File?]  # data/eggnog.db
    inputBinding:
      prefix: --database
    label: specify the target database for sequence searches (euk,bact,arch, host:port, local hmmpressed database)

  db_diamond:
    type: [string?, File?]  # data/eggnog_proteins.dmnd
    inputBinding:
      prefix: --dmnd_db
    label: Path to DIAMOND-compatible database

  data_dir:
    type: [string?, Directory?]  # data/
    inputBinding:
      prefix: --data_dir
    label: Directory to use for DATA_PATH

  mode:
    type: string?
    inputBinding:
      prefix: -m
    label: hmmer or diamond

  no_annot:
    type: boolean?
    inputBinding:
      prefix: --no_annot
    label: Skip functional annotation, reporting only hits

  no_file_comments:
    type: boolean?
    inputBinding:
      prefix: --no_file_comments
    label: No header lines nor stats are included in the output files

  cpu:
    type: int?
    inputBinding:
      prefix: --cpu
    default: 8

  annotate_hits_table:
    type: File?
    inputBinding:
      prefix: --annotate_hits_table
    label: Annotatate TSV formatted table of query->hits

  dbmem:
    type: boolean?
    inputBinding:
      prefix: --dbmem
    label: Store the whole eggNOG sqlite DB into memory before retrieving the annotations. This requires ~44 GB of RAM memory available.

  output:
    type: string?
    inputBinding:
      prefix: -o
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
baseCommand: [ split_to_chunks.py ]

inputs:
  seqs:
    # format: edam:format_1929  # collision with concatenate.cwl
    type: File
    inputBinding:
      prefix: -i
  chunk_size:
    type: int
    inputBinding:
      prefix: -s
  file_format:
    type: string?
    inputBinding:
      prefix: -f
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
baseCommand: [ run_FGS.sh ]

# arguments:
# ./FragGeneScan -s SRR1620013_MERGED_FASTQ.fasta -o fgs -w 0 -t illumina_10

inputs:
  input_fasta:
    format: 'edam:format_1929'
    type: File
    inputBinding:
      separate: true
      prefix: "-i"
  output:
    type: string
    inputBinding:
      separate: true
      prefix: "-o"
  seq_type:
    type: string
    inputBinding:
      separate: true
      prefix: "-s"
  train:
    type: string?
    inputBinding:
      separate: true
      prefix: "-t"
    default: "illumina_10"


# stdout: stdout.txt
# stderr: stderr.txt
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
baseCommand: [ unite_protein_predictions.py ]

inputs:
  masking_file:
    type: File?
    inputBinding:
      prefix: "--mask"
  predicted_proteins_prodigal_out:
    type: File?
    inputBinding:
      prefix: "--prodigal-out"
  predicted_proteins_prodigal_ffn:
    type: File?
    inputBinding:
      prefix: "--prodigal-ffn"
  predicted_proteins_prodigal_faa:
    type: File?
    inputBinding:
      prefix: "--prodigal-faa"
  predicted_proteins_fgs_out:
    type: File
    inputBinding:
      prefix: "--fgs-out"
  predicted_proteins_fgs_ffn:
    type: File
    inputBinding:
      prefix: "--fgs-ffn"
  predicted_proteins_fgs_faa:
    inputBinding:
      prefix: "--fgs-faa"
    type: File
  basename:
    inputBinding:
      prefix: "--name"
    type: string
  genecaller_order:
    inputBinding:
      prefix: "--caller-priority"
    type: string?
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
baseCommand: [ fastp ]

arguments: [
        $(inputs.detect_adapter_for_pe),
        $(inputs.overrepresentation_analysis),
        $(inputs.merge),
        $(inputs.merged_out),
        $(inputs.cut_right), 
        $(inputs.base_correction),
        $(inputs.overlap_len_require),
        $(inputs.force_polyg_tail_trimming),
        $(inputs.min_length_required),

        --thread=$(inputs.threads),
        --html, "fastp.html", 
        --json, "fastp.json",
        -i, $(inputs.forward_reads),
        -I, $(inputs.reverse_reads),
        -o, $(inputs.forward_reads.nameroot).trimmed.fastq,
        -O, $(inputs.reverse_reads.nameroot).trimmed.fastq
]

inputs:

  detect_adapter_for_pe:
    type: boolean
    default: false
    inputBinding: 
      valueFrom:
        ${
          if (inputs.detect_adapter_for_pe == true){
            return '--detect_adapter_for_pe';
          } else {
            return '';
          }
        }

  overrepresentation_analysis:
    type: boolean
    default: false
    inputBinding: 
      valueFrom:
        ${
          if (inputs.overrepresentation_analysis == true){
            return '--overrepresentation_analysis';
          } else {
            return '';
          }
        }

  merge: 
    type: boolean
    default: true
    inputBinding: 
      valueFrom: 
        ${
          if (inputs.merge != false){
            return '--merge';
          } else {
            return '';
          }
        }

  merged_out: 
    type: boolean?
    default: true
    inputBinding: 
      prefix: --merged_out
      valueFrom: 
        ${
          if (inputs.merge != false){
            return inputs.forward_reads.nameroot.split(/_(.*)/s)[0] + '.merged.fastq';
          } else {
            return '';
          }
        }

  forward_reads:
    type: File
    format:
      - edam:format_1930 # FASTA
      - edam:format_1929 # FASTQ

  reverse_reads:
    format:
      - edam:format_1930 # FASTA
      - edam:format_1929 # FASTQ
    type: File?

  threads:
    type: int?
    default: 1

  qualified_phred_quality:
    type: int?
    default: 0
    inputBinding: 
      valueFrom: 
        ${
          if (inputs.qualified_phred_quality > 0) {
            return '--qualified_quality_phred=' + inputs.qualified_phred_quality
          } else {
            return ''
          }
        }

  unqualified_percent_limit:
    type: int?
    default: 0
    inputBinding: 
      valueFrom: 
        ${
          if (inputs.unqualified_percent_limit > 0) {
            return '--unqualified_percent_limit=' + inputs.unqualified_percent_limit
          } else {
            return ''
          }
        }

  min_length_required:
    type: int?
    default: 0
    inputBinding: 
      valueFrom: 
        ${
          if (inputs.min_length_required > 0) {
            return '--length_required=' + inputs.min_length_required
          } else {
            return ''
          }
        }

  force_polyg_tail_trimming:
    type: boolean?
    default: false
    inputBinding:
      valueFrom: 
        ${
          if (inputs.force_polyg_tail_trimming != false){
            return '--trim_poly_g';
          } else {
            return '';
          }
        }

  disable_trim_poly_g:
    type: boolean?
    default: false
    inputBinding:
      valueFrom: 
        ${
          if (inputs.disable_trim_poly_g == true){
            return '--disable_trim_poly_g';
          } else {
            return '';
          }
        }

  base_correction:
    type: boolean?
    default: false
    inputBinding:
      valueFrom: 
        ${
          if (inputs.merge == true && inputs.base_correction == true){
            return '--correction';
          } else {
            return '';
          }
        }

  overlap_len_require: 
    type: int
    default: 0
    inputBinding:
      valueFrom:
        ${
          if (inputs.merge == true){
            return '--overlap_len_require='+inputs.overlap_len_require;
          } else {
            return '';
          }
        }

  cut_right: 
    type: boolean
    default: true
    inputBinding:
      valueFrom: 
        ${
          if (inputs.cut_right == true){
            return '--cut_right'
          } else {
            return ''
          }
        }


#  overlap_diff_limit (default 5) and overlap_diff_limit_percent (default 20%). 
#  Please note that the reads should meet these three conditions simultaneously.
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
baseCommand: [ "go_summary_pipeline-1.0.py" ]

inputs:
  InterProScan_results:
    type: File
    format: edam:format_3475
    inputBinding:
      prefix: --input-file

  config:
    type: [string?, File?]
    inputBinding:
      prefix: --config
    default: "go_summary-config.json"

  output_name:
    type: string

arguments:
  - "--output-file"
  - $(inputs.output_name)
23
24
25
26
27
baseCommand: [ hmmscan_tab.py ]  # old was with sed

arguments:
  - valueFrom: $(inputs.input_table.nameroot).tsv
    prefix: -o
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
baseCommand: ["hmmsearch"]

inputs:

  omit_alignment:
    type: boolean?
    inputBinding:
      position: 1
      prefix: "--noali"

  gathering_bit_score:
    type: boolean?
    inputBinding:
      position: 4
      prefix: "--cut_ga"

  database:
    type: string
    doc: |
      "Database name or path, depending on how your using it."

  database_directory:
    type: [string, Directory?]
    doc: |
      "Database path"

  seqfile:
    format: edam:format_1929  # FASTA
    type: File
    inputBinding:
      position: 6
      separate: true

arguments:
  - valueFrom: |
      ${
        if (inputs.database_directory && inputs.database_directory !== "") {
          var path = inputs.database_directory.path || inputs.database_directory; 
          return path + "/" + inputs.database;
        } else {
          return inputs.database;
        }
      }
    position: 5
  - prefix: --domtblout
    valueFrom: $(inputs.seqfile.nameroot)_hmmsearch.tbl
    position: 2
  - prefix: --cpu
    valueFrom: '4'
  # hmmer is too verbose
  # discard all the std output and error
  - prefix: -o
    valueFrom: '/dev/null'
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
baseCommand: [ interproscan.sh ]

inputs:

  inputFile:
    type: File
    format: edam:format_1929
    inputBinding:
      position: 8
      prefix: '--input'
    label: Input file path
    doc: >-
      Optional, path to fasta file that should be loaded on Master startup.
      Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert.

  applications:
    type: string[]?
    inputBinding:
      position: 9
      itemSeparator: ','
      prefix: '--applications'
    label: Analysis
    doc: >-
      Optional, comma separated list of analyses. If this option is not set, ALL
      analyses will be run.

  databases:
    type: [string?, Directory]

  cpu:
    type: int
    default: 8
    inputBinding:
      position: 2
      prefix: '--cpu'
    label: Number of CPUs
    doc: >-
      Optional, number of CPUs to use. If not set, the number of CPUs available
      on the machine will be used.

  disableResidueAnnotation:
    type: boolean?
    inputBinding:
      position: 11
      prefix: '--disable-residue-annot'
    label: Disables residue annotation
    doc: 'Optional, excludes sites from the XML, JSON output.'


arguments:
  - position: 0
    valueFrom: '--disable-precalc'
  - position: 1
    valueFrom: '--goterms'
  - position: 2
    valueFrom: '--pathways'
  - position: 3
    prefix: '--tempdir'
    valueFrom: $(runtime.tmpdir)
  - position: 7
    valueFrom: 'TSV'
    prefix: '-f'
  - position: 8
    valueFrom: $(runtime.outdir)/$(inputs.inputFile.nameroot).IPS.tsv
    prefix: '-o'
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
baseCommand: [ "run_quality_filtering.py" ]

inputs:
  seq_file:
    type: File
    # format: edam:format_1929  # FASTA
    inputBinding:
      position: 1
    label: 'Trimmed sequence file'
    doc: >
      Trimmed and FASTQ to FASTA converted sequences file.
  submitted_seq_count:
    type: int
    label: 'Number of submitted sequences'
    doc: >
      Number of originally submitted sequences as in the user
      submitted FASTQ file - single end FASTQ or pair end merged FASTQ file.
  # stats_file_name:
  #   type: string
  #   default: stats_summary
  #   label: 'Post QC stats output file name'
  #   doc: >
  #     Give a name for the file which will hold the stats after QC.
  min_length:
    type: int
    default: 100 # For assemblies we need to set this in the input YAML to 500
    label: 'Minimum read or contig length'
    doc: >
      Specify the minimum read or contig length for sequences to pass QC filtering.
  input_file_format: string


outputs:
  filtered_file:
    label: Filtered output file
    format: edam:format_1929  # FASTA
    type: File
    outputBinding:
      glob: $(inputs.seq_file.nameroot).fasta
  stats_summary_file:
    label: Stats summary output file
    type: File
    outputBinding:
      glob: $(inputs.seq_file.nameroot).qc_summary

arguments:
   - position: 2
     valueFrom: $(inputs.seq_file.nameroot).fasta
   - position: 3
     valueFrom: $(inputs.seq_file.nameroot).qc_summary
   - position: 4
     valueFrom: $(inputs.submitted_seq_count)
   - position: 5
     prefix: '--min_length'
     valueFrom: $(inputs.min_length)
   - position: 6
     prefix: '--extension'
     valueFrom: $(inputs.input_file_format)
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
baseCommand: ["MGRAST_base.py" ]

inputs:
  QCed_reads:
    type: File
    format: edam:format_1929  # FASTA
    inputBinding:
      prefix: -i
  length_sum:
    label: Prefix for the files assocaited with sequence length distribution
    type: string
    default: seq-length.out
  gc_sum:
    label: Prefix for the files associated with GC distribution
    type: string
    default: GC-distribution.out
  nucleotide_distribution:
    label: Prefix for the files associated with nucleotide distribution
    type: string
    default: nucleotide-distribution.out
  summary:
    label: File names for summary of sequences, e.g. number, min/max length etc.
    type: string
    default: summary.out
  max_seq:
    label: Maximum number of sequences to sub-sample 
    type: int?
    default: 2000000
  out_dir_name:
    label: Specifies output subdirectory
    type: string
    default: qc-statistics
  sequence_count:
    label: Specifies the number of sequences in the input read file (FASTA formatted)
    type: int


outputs:
  output_dir:
    label: Contains all stats output files
    type: Directory
    outputBinding:
      glob: $(inputs.out_dir_name)
  summary_out:
    label: Contains the summary statistics for the input sequence file
    type: File
    format: iana:text/plain
    outputBinding:
      glob: $(inputs.out_dir_name)/$(inputs.summary)

arguments:
   - position: 1
     prefix: '-o'
     valueFrom: $(inputs.out_dir_name)/$(inputs.summary)
   - position: 2
     prefix: '-d'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.nucleotide_distribution, suffix);
       }
   - position: 3
     prefix: '-g'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.gc_sum, suffix);
       }
   - position: 4
     prefix: '-l'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.length_sum, suffix);
       }
   - position: 5
     valueFrom: ${ if (inputs.sequence_count > inputs.max_seq) { return '-m '.concat(inputs.max_seq)} else { return ''} }
21
baseCommand: [clean_motus_output.sh]
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
baseCommand: [ motus ]

inputs:
  reads:
    type: File
    inputBinding:
      position: 1
      prefix: -s
    label: merged and QC reads in fastq
    # format: edam:format_1930  # FASTQ

  threads:
    type: int
    inputBinding:
      prefix: -t
    default: 4


arguments: [profile, -c, -q]
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
baseCommand: [ "biom-convert.sh" ]

inputs:
  biom:
    type: File?
    format: edam:format_3746  # BIOM
    inputBinding:
      prefix: --input-fp

  table_type:
    type: string? #biom-convert-table.yaml#table_type?
    inputBinding:
      prefix: --table-type  # --table-type=    <- worked for cwlexec
      separate: true # false                                  <- worked for cwlexec
      valueFrom: $(inputs.table_type)  # $('"' + inputs.table_type + '"')      <- worked for cwlexec

  json:
    type: boolean?
    label: Output as JSON-formatted table.
    inputBinding:
      prefix: --to-json

  hdf5:
    type: boolean?
    label: Output as HDF5-formatted table.
    inputBinding:
      prefix: --to-hdf5

  tsv:
    type: boolean?
    label: Output as TSV-formatted (classic) table.
    inputBinding:
      prefix: --to-tsv

  header_key:
    type: string?
    doc: |
      The observation metadata to include from the input BIOM table file when
      creating a tsv table file. By default no observation metadata will be
      included.
    inputBinding:
      prefix: --header-key

arguments:
  - valueFrom: |
     ${ var ext = "";
        if (inputs.json) { ext = "_json.biom"; }
        if (inputs.hdf5) { ext = "_hdf5.biom"; }
        if (inputs.tsv) { ext = "_tsv.biom"; }
        var pre = inputs.biom.nameroot.split('.');
        pre.pop()
        return pre.join('.') + ext; }
    prefix: --output-fp
  - valueFrom: "--collapsed-observations"
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
baseCommand: [ cmsearch-deoverlap.pl ]

inputs:
  - id: clan_information
    type: [string?, File?]
    inputBinding:
      position: 0
      prefix: '--clanin'
    label: clan information on the models provided
    doc: Not all models provided need to be a member of a clan
  - id: cmsearch_matches
    type: File
    format: edam:format_3475
    inputBinding:
      position: 1
      valueFrom: $(self.basename)
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
baseCommand: [ cmsearch ]

inputs:
  - id: covariance_model_database
    type: [string, File]
    inputBinding:
      position: 1
  - id: cpu
    type: int?
    inputBinding:
      position: 0
      prefix: '--cpu'
    label: Number of parallel CPU workers to use for multithreads
  - default: false
    id: cut_ga
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--cut_ga'
    label: use CM's GA gathering cutoffs as reporting thresholds
  - id: omit_alignment_section
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--noali'
    label: Omit the alignment section from the main output.
    doc: This can greatly reduce the output volume.
  - default: false
    id: only_hmm
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--hmmonly'
    label: 'Only use the filter profile HMM for searches, do not use the CM'
    doc: |
      Only filter stages F1 through F3 will be executed, using strict P-value
      thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3). Additionally
      a bias composition filter is used after the F1 stage (with P=0.02
      survival threshold). Any hit that survives all stages and has an HMM
      E-value or bit score above the reporting threshold will be output.
  - id: query_sequences
    type: File
    format: edam:format_1929  # FASTA
    inputBinding:
      position: 2
    # streamable: true
  - id: search_space_size
    type: int
    inputBinding:
      position: 0
      prefix: '-Z'
    label: search space size in *Mb* to <x> for E-value calculations

arguments:
  - position: 0
    prefix: '--tblout'
    valueFrom: |
      ${
        var name = "";
        if (typeof inputs.covariance_model_database === "string") {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.split("/").slice(-1)[0] +
            ".cmsearch_matches.tbl";
        } else {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.nameroot +
            ".cmsearch_matches.tbl";
        }
        return name;
      }
  - position: 0
    prefix: '-o'
    valueFrom: |
      ${
        var name = "";
        if (typeof inputs.covariance_model_database == "string") {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.split("/").slice(-1)[0] +
            ".cmsearch.out";
        } else {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.nameroot +
            ".cmsearch.out";
        }
        return name;
      }
12
baseCommand: [ esl-index.sh ]
35
baseCommand: [ esl-sfetch ]
28
baseCommand: get_subunits_coords.py
45
baseCommand: get_subunits.py
25
26
27
28
29
baseCommand: ktImportText

arguments:
  - valueFrom: "krona.html"
    prefix: -o
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
baseCommand: [ 'mapseq2biom.pl' ]

inputs:
  otu_table:
    type: [string, File]
    doc: |
      the OTU table produced for the taxonomies found in the reference
      databases that was used with MAPseq
    inputBinding:
      prefix: --otuTable 

  query:
    type: File
    label: the output from the MAPseq that assigns a taxonomy to a sequence
    format: iana:text/tab-separated-values
    inputBinding:
      prefix: --query

  label:
    type: string
    label: label to add to the top of the outfile OTU table
    inputBinding:
      prefix: --label

  taxid_flag:
    type: boolean?
    label: output NCBI taxids for all databases bar UNITE
    inputBinding:
        prefix: --taxid

arguments:
  - valueFrom: $(inputs.query.basename).tsv
    prefix: --outfile
  - valueFrom: $(inputs.query.basename).txt
    prefix: --krona
  - valueFrom: $(inputs.query.basename).notaxid.tsv
    prefix: --notaxidfile
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
baseCommand: mapseq

inputs:

  prefix: File

  sequences:
    type: File
    inputBinding:
      position: 1
    format: edam:format_1929  # FASTA

  database:
    type: File
    inputBinding:
      position: 2
    secondaryFiles: .mscluster
    format: edam:format_1929  

  taxonomy:
    type: [string, File]
    inputBinding:
      position: 4

  threads: 
    type: int?
    default: 8
    inputBinding:
      prefix: "-nthreads"
      position: 5


arguments: ['-tophits', '80', '-topotus', '40', '-outfmt', 'simple']
28
baseCommand: [pull_ncrnas.sh]
44
baseCommand: [functional_stats.py]
52
baseCommand: [write_summaries.py]
18
19
20
21
22
23
24
25
26
27
28
29
baseCommand: [ add_header ]

inputs:
  input_table:
    #format: [edam:format_3475, edam:format_2333]
    type: File
    inputBinding:
      prefix: -i
  header:
    type: string
    inputBinding:
      prefix: -h
15
16
17
18
19
20
21
22
23
24
25
baseCommand: [ count_lines.py ]

inputs:
  sequences:
    type: File
    inputBinding:
      prefix: -f
  number:
    type: int
    inputBinding:
      prefix: -n
24
25
26
27
28
29
baseCommand: [ bash ]

arguments:
  - valueFrom: |
      expr \$(cat $(inputs.input_file.path) | wc -l)
    prefix: -c
25
26
27
28
29
arguments:
  - valueFrom: $(inputs.fastq.nameroot).unclean
    prefix: '-o'

baseCommand: [ fastq_to_fasta.py ]
22
baseCommand: [ generate_checksum.py ]
17
18
baseCommand: [ pigz ]
arguments: ["-p", "8", "-c"]
CWL From line 17 of pigz/gzip.cwl
29
30
31
32
33
34
35
36
37
38
39
40
41
42
arguments:
    - prefix: -n
      valueFrom: |
        ${
          if (inputs.size_limit) { return inputs.size_limit }
          if (inputs.type_fasta == 'n') {
            return 1980
          }
          if (inputs.type_fasta == 'p') {
            return 1442
          }
        }

baseCommand: [ split_fasta_by_size.sh ]
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
baseCommand: [ megahit ]

inputs:

  memory:
    type: float?
    label: Memory to run assembly. When 0 < -m < 1, fraction of all available memory of the machine is used, otherwise it specifies the memory in BYTE.
    default: 0.9
    inputBinding:
      position: 4
      prefix: "--memory"

  min-contig-len:
    type: int?
    default: 500
    inputBinding:
      position: 3
      prefix: "--min-contig-len"

  forward_reads:
    type:
      - File?
      - type: array
        items: File
    inputBinding:
      position: 1
      prefix: "-1"

  reverse_reads:
    type:
      - File?
      - type: array
        items: File
    inputBinding:
      position: 2
      prefix: "-2"

  threads: 
    type: int
    default: 1
    inputBinding: 
      position: 5
      prefix: "--num-cpu-threads"
ShowHide 23 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/emo-bon/MetaGOflow.git
Name: a-workflow-for-marine-genomic-observatories-data-a
Version: eosc-life-gos @ deb5427
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Boost Software License 1.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...