MGnify - raw-reads analysis pipeline

public public 1yr ago Version: Version 1 0 bookmarks

MGnify ( http://www.ebi.ac.uk/metagenomics ) provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in particular environments. Over the past 2 years, MGnify (formerly EBI Metagenomics) has more than doubled the number of publicly available analysed datasets held within the resource. Recently, an updated approach to data analysis has been unveiled (version 5.0), replacing the previous single pipeline with multiple analysis pipelines that are tailored according to the input data, and that are formally described using the Common Workflow Language, enabling greater provenance, reusability, and reproducibility. MGnify's new analysis pipelines offer additional approaches for taxonomic assertions based on ribosomal internal transcribed spacer regions (ITS1/2) and expanded protein functional annotations. Biochemical pathways and systems predictions have also been added for assembled contigs. MGnify's growing focus on the assembly of metagenomic data has also seen the number of datasets it has assembled and analysed increase six-fold. The non-redundant protein database constructed from the proteins encoded by these assemblies now exceeds 1 billion sequences. Meanwhile, a newly developed contig viewer provides fine-grained visualisation of the assembled contigs and their enriched annotations.

Documentation: https://docs.mgnify.org/en/latest/analysis.html#raw-reads-analysis-pipeline

Code Snippets

42
baseCommand: [ run_antismash_short.sh ]
35
baseCommand: [ change_antismash_output.py ]
27
baseCommand: [ change_geneclusters_ctg.py ]
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
baseCommand: [antismash_to_gff.py]

inputs:
  antismash_geneclus:
    type: File
    inputBinding:
      prefix: -g
  antismash_embl:
    type: File
    inputBinding:
      prefix: -e
  output_name:
    type: string
    inputBinding:
      prefix: -o
13
14
15
16
17
18
19
20
21
22
23
24
25
baseCommand: [reformat_antismash.py]

inputs:
  glossary:
    type: string
    inputBinding:
      position: 1
      prefix: -g
  geneclusters:
    type: File
    inputBinding:
        position: 2
        prefix: -a
31
baseCommand: [ antismash_rename_contigs.py ]
21
baseCommand: [move_antismash_summary.py]
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
baseCommand:
  - diamond
  - blastp
inputs:
  - id: blockSize
    type: float?
    inputBinding:
      position: 0
      prefix: '--block-size'
    label: sequence block size in billions of letters (default=2.0)
  - id: databaseFile
    type: string
    inputBinding:
      position: 0
      prefix: '--db'
    label: DIAMOND database input file
    doc: Path to the DIAMOND database file.
  - id: outputFormat
    type: string?  # Diamond-output_formats.yaml#output_formats?
    inputBinding:
      position: 0
      prefix: '--outfmt'
    label: Format of the output file
    doc: |-
      0   = BLAST pairwise
      5   = BLAST XML
      6   = BLAST tabular
      100 = DIAMOND alignment archive (DAA)
      101 = SAM

      Value 6 may be followed by a space-separated list of these keywords
  - id: queryGeneticCode
    type: int?
    inputBinding:
      position: 0
      prefix: '--min-orf'
    label: Genetic code used for the translation of the query sequences
    doc: >
      Ignore translated sequences that do not contain an open reading frame of
      at least this length.

      By default this feature is disabled for sequences of length below 30, set
      to 20 for sequences of length below 100, and set to 40 otherwise. Setting
      this option to 1 will disable this feature.
  - id: queryInputFile
    format: edam:format_1929
    type: File
    inputBinding:
      position: 0
      prefix: '--query'
    label: Query input file in FASTA
    doc: >
      Path to the query input file in FASTA or FASTQ format (may be gzip
      compressed). If this parameter is omitted, the input will be read from
      stdin
  - id: strand
    type: string?  # Diamond-strand_values.yaml#strand?
    inputBinding:
      position: -3
      prefix: '--strand'
    label: Set strand of query to align for translated searches
    doc: >-
      Set strand of query to align for translated searches. By default both
      strands are searched. Valid values are {both, plus, minus}
  - id: taxonList
    type: 'int[]?'
    inputBinding:
      position: 0
      prefix: '--taxonlist'
    label: Protein accession to taxon identifier NCBI mapping file
    doc: >
      Comma-separated list of NCBI taxonomic IDs to filter the database by. Any
      taxonomic rank can be used, and only reference sequences matching one of
      the specified taxon ids will be searched against. Using this option
      requires setting the --taxonmap and --taxonnodes parameters for makedb.
  - id: threads
    type: int?
    inputBinding:
      position: 0
      prefix: '--threads'
    label: Number of CPU threads
    doc: >
      Number of CPU threads. By default, the program will auto-detect and use
      all available virtual cores on the machine.
  - id: maxTargetSeqs
    type: int?
    inputBinding:
      position: 0
      prefix: '--max-target-seqs'
    label: Max number of target sequences per query
    doc: >
      The maximum number of target sequences per query to report alignments for (default=25).
      Setting this to 0 will report all alignments that were found.
  - id: top
    type: int?
    inputBinding:
      position: 0
      prefix: '--top'
    label: Percentage range of the top alignment score
    doc: >
      Report alignments within the given percentage range of the top alignment score for a query
      (overrides --max-target-seqs option). For example, setting this to 10 will report all align-
      ments whose score is at most 10% lower than the best alignment score for a query.



outputs:
  - id: matches
    type: File
    outputBinding:
      glob: $(inputs.queryInputFile.basename).diamond_matches
    format: edam:format_2333
doc: |
  DIAMOND is a sequence aligner for protein and translated DNA searches,
  designed for high performance analysis of big sequence data.

  The key features are:
        + Pairwise alignment of proteins and translated DNA at 500x-20,000x speed of BLAST.
        + Frameshift alignments for long read analysis.
        + Low resource requirements and suitable for running on standard desktops or laptops.
        + Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

  Please visit https://github.com/bbuchfink/diamond for full documentation.

  Releases can be downloaded from https://github.com/bbuchfink/diamond/releases
label: Aligns DNA query sequences against a protein reference database

arguments:
  - position: 0
    prefix: '--out'
    valueFrom: $(inputs.queryInputFile.basename).diamond_matches
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
baseCommand: [diamond_post_run_join.sh]

inputs:
  input_diamond:
    format: edam:format_2333
    type: File
    inputBinding:
      separate: true
      prefix: -i
  input_db:
    type: string
    inputBinding:
      separate: true
      prefix: -d
  filename: string
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
baseCommand: [emapper_wrapper.sh]

inputs:
  fasta_file:
    format: edam:format_1929  # FASTA
    type: File?
    inputBinding:
      separate: true
      prefix: -i
    label: Input FASTA file containing query sequences

  db:
    type: string?  # data/eggnog.db
    inputBinding:
      prefix: --database
    label: specify the target database for sequence searches (euk,bact,arch, host:port, local hmmpressed database)

  db_diamond:
    type: string?  # data/eggnog_proteins.dmnd
    inputBinding:
      prefix: --dmnd_db
    label: Path to DIAMOND-compatible database

  data_dir:
    type: string?  # data/
    inputBinding:
      prefix: --data_dir
    label: Directory to use for DATA_PATH

  mode:
    type: string?
    inputBinding:
      prefix: -m
    label: hmmer or diamond

  no_annot:
    type: boolean?
    inputBinding:
      prefix: --no_annot
    label: Skip functional annotation, reporting only hits

  no_file_comments:
    type: boolean?
    inputBinding:
      prefix: --no_file_comments
    label: No header lines nor stats are included in the output files

  cpu:
    type: int?
    inputBinding:
      prefix: --cpu

  annotate_hits_table:
    type: File?
    inputBinding:
      prefix: --annotate_hits_table
    label: Annotatate TSV formatted table of query->hits

  output:
    type: string?
    inputBinding:
      prefix: -o
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
baseCommand: [assign_genome_properties.pl]    # without docker

arguments:
  - position: 1
    valueFrom: "-all"
  - position: 2
    valueFrom: "table"
    prefix: "-outfiles"
  - position: 4
    valueFrom: "summary"
    prefix: "-outfiles"
  - position: 3
    valueFrom: "web_json"
    prefix: "-outfiles"

inputs:
  input_tsv_file:
    type: File
    format: edam:format_3475
    inputBinding:
      separate: true
      prefix: "-matches"

  flatfiles_path:
    type: string
    inputBinding:
      prefix: "-gpdir"
  GP_txt:
    type: string
    inputBinding:
      prefix: "-gpff"

  out_dir:
    type: string?
    inputBinding:
      prefix: "-outdir"
  name:
    type: string?
    inputBinding:
      prefix: "-name"
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
baseCommand: [ build_assembly_gff.py ]

inputs:
  ips_results:
    type: File
    format: edam:format_3475
    inputBinding:
      prefix: -i
  eggnog_results:
    format: edam:format_3475
    type: File
    inputBinding:
      prefix: -e
  input_faa:
    format: edam:format_1929
    type: File
    inputBinding:
      prefix: -f
  output_name:
    type: string
    inputBinding:
      prefix: -o
20
21
22
arguments: ["-n", $(inputs.fasta.basename)]

baseCommand: [ "run_samtools.sh" ]
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
baseCommand: [give_pathways.py]

inputs:
  input_table:
    format: edam:format_3475  # TXT
    type: File
    inputBinding:
      separate: true
      prefix: -i
  graphs:
    type: string
    inputBinding:
      prefix: -g
  pathways_names:
    type: string
    inputBinding:
      prefix: -n
  pathways_classes:
    type: string
    inputBinding:
      prefix: -c
  outputname:
    type: string
    inputBinding:
      prefix: -o
17
18
19
20
21
22
23
24
25
26
27
28
29
30
baseCommand: ['parsing_hmmscan.py']

inputs:
  table:
    format: edam:format_3475
    type: File
    inputBinding:
      separate: true
      prefix: -i
  fasta:
    type: File
    inputBinding:
      separate: true
      prefix: -f
45
baseCommand: [ esl-ssplit.sh ]
40
41
42
43
44
45
46
47
48
49
arguments:
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11


baseCommand: [ split_to_chunks.py ]
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
baseCommand: [ run_FGS.sh ]

arguments:

inputs:
  input_fasta:
    format: 'edam:format_1929'
    type: File
    inputBinding:
      separate: true
      prefix: "-i"
  output:
    type: string
    inputBinding:
      separate: true
      prefix: "-o"
  seq_type:
    type: string
    inputBinding:
      separate: true
      prefix: "-s"
  train:
    type: string
    inputBinding:
      separate: true
      prefix: "-t"
    default: "illumina_5"
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
baseCommand: [ unite_protein_predictions.py ]

inputs:
  masking_file:
    type: File
    inputBinding:
      prefix: "--mask"
  predicted_proteins_prodigal_out:
    type: File?
    inputBinding:
      prefix: "--prodigal-out"
  predicted_proteins_prodigal_ffn:
    type: File?
    inputBinding:
      prefix: "--prodigal-ffn"
  predicted_proteins_prodigal_faa:
    type: File?
    inputBinding:
      prefix: "--prodigal-faa"
  predicted_proteins_fgs_out:
    type: File
    inputBinding:
      prefix: "--fgs-out"
  predicted_proteins_fgs_ffn:
    type: File
    inputBinding:
      prefix: "--fgs-ffn"
  predicted_proteins_fgs_faa:
    inputBinding:
      prefix: "--fgs-faa"
    type: File
  basename:
    inputBinding:
      prefix: "--name"
    type: string
  genecaller_order:
    inputBinding:
      prefix: "--caller-priority"
    type: string
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
baseCommand: [ prodigal ]

arguments:
  - valueFrom: "sco"
    prefix: "-f"
  - valueFrom: "meta"
    prefix: "-p"
  - valueFrom: $(inputs.input_fasta.basename).prodigal
    prefix: "-o"
  - valueFrom: $(inputs.input_fasta.basename).prodigal.ffn
    prefix: "-d"
  - valueFrom: $(inputs.input_fasta.basename).prodigal.faa
    prefix: "-a"

inputs:
  input_fasta:
    format: 'edam:format_1929'
    type: File
    inputBinding:
      separate: true
      prefix: "-i"
39
baseCommand: ["go_summary_pipeline-1.0.py"]
23
24
25
26
27
baseCommand: [ hmmscan_tab.py ]  # old was with sed

arguments:
  - valueFrom: $(inputs.input_table.nameroot).tsv
    prefix: -o
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
baseCommand: ["hmmsearch"]

arguments:
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11
  - prefix: --domtblout
    valueFrom: $(inputs.seqfile.nameroot)_hmmsearch.tbl
    position: 2
  - prefix: --cpu
    valueFrom: '4'
  - prefix: -o
    valueFrom: '/dev/null'

inputs:

  omit_alignment:
    type: boolean?
    inputBinding:
      position: 1
      prefix: "--noali"

  gathering_bit_score:
    type: boolean?
    inputBinding:
      position: 4
      prefix: "--cut_ga"

  path_database:
    type: string
    inputBinding:
      position: 5

  seqfile:
    format: edam:format_1929  # FASTA
    type: File
    inputBinding:
      position: 6
      separate: true
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
baseCommand: interproscan.sh
inputs:
  - id: inputFile
    type: File
    format: edam:format_1929
    inputBinding:
      position: 8
      prefix: '--input'
    label: Input file path
    doc: >-
      Optional, path to fasta file that should be loaded on Master startup.
      Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert.
  - id: applications
    type: string[]?
    inputBinding:
      position: 9
      itemSeparator: ','
      prefix: '--applications'
    label: Analysis
    doc: >-
      Optional, comma separated list of analyses. If this option is not set, ALL
      analyses will be run.
  - id: outputFormat
    type: string[]
    inputBinding:
      position: 10
      itemSeparator: ','
      prefix: '--formats'
    label: output format
    doc: >-
      Optional, case-insensitive, comma separated list of output formats.
      Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for
      protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3
      and XML.
  - id: databases
    type: string? #Directory?
  - id: disableResidueAnnotation
    type: boolean?
    inputBinding:
      position: 11
      prefix: '--disable-residue-annot'
    label: Disables residue annotation
    doc: 'Optional, excludes sites from the XML, JSON output.'
  - id: seqtype
    type:
      - 'null'
      - type: enum
        symbols:
          - p
          - n
        name: seqtype
    inputBinding:
      position: 12
      prefix: '--seqtype'
    label: Sequence type
    doc: >-
      Optional, the type of the input sequences (dna/rna (n) or protein (p)).
      The default sequence type is protein.
outputs:
  - id: i5Annotations
    format: edam:format_3475
    type: File
    outputBinding:
      glob: $(inputs.inputFile.nameroot).f*.tsv
doc: >-
  InterProScan is the software package that allows sequences (protein and
  nucleic) to be scanned against InterPro's signatures. Signatures are
  predictive models, provided by several different databases, that make up the
  InterPro consortium.
  This tool description is using a Docker container tagged as version
  v5.30-69.0.
  Documentation on how to run InterProScan 5 can be found here:
  https://github.com/ebi-pf-team/interproscan/wiki/HowToRun
label: 'InterProScan: protein sequence classifier'
arguments:
  - position: 0
    valueFrom: '--disable-precalc'
  - position: 1
    valueFrom: '--goterms'
  - position: 2
    valueFrom: '--pathways'
  - position: 3
    prefix: '--tempdir'
    valueFrom: $(runtime.tmpdir)
32
33
34
35
36
baseCommand: [bedtools, maskfasta]

arguments:
  - valueFrom: ITS_masked.fasta
    prefix: -fo
23
24
25
baseCommand: [format_bedfile]

#reverse start and end where start < end (i.e. neg strand)
42
baseCommand: [ its-length-new.py ]
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
baseCommand: ["run_quality_filtering.py"]

inputs:
  seq_file:
    type: File
    # format: edam:format_1929  # FASTA
    inputBinding:
      position: 1
    label: 'Trimmed sequence file'
    doc: >
      Trimmed and FASTQ to FASTA converted sequences file.
  submitted_seq_count:
    type: int
    label: 'Number of submitted sequences'
    doc: >
      Number of originally submitted sequences as in the user
      submitted FASTQ file - single end FASTQ or pair end merged FASTQ file.
  stats_file_name:
    type: string
    default: stats_summary
    label: 'Post QC stats output file name'
    doc: >
      Give a name for the file which will hold the stats after QC.
  min_length:
    type: int
    default: 100 # For assemblies we need to set this in the input YAML to 500
    label: 'Minimum read or contig length'
    doc: >
      Specify the minimum read or contig length for sequences to pass QC filtering.
  input_file_format: string


outputs:
  filtered_file:
    label: Filtered output file
    format: edam:format_1929  # FASTA
    type: File
    outputBinding:
      glob: $(inputs.seq_file.nameroot).fasta
  stats_summary_file:
    label: Stats summary output file
    type: File
    outputBinding:
      glob: $(inputs.stats_file_name)

arguments:
   - position: 2
     valueFrom: $(inputs.seq_file.nameroot).fasta
   - position: 3
     valueFrom: $(inputs.stats_file_name)
   - position: 4
     valueFrom: $(inputs.submitted_seq_count)
   - position: 5
     prefix: '--min_length'
     valueFrom: $(inputs.min_length)
   - position: 6
     prefix: '--extension'
     valueFrom: $(inputs.input_file_format)
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
baseCommand: ["MGRAST_base.py" ]

inputs:
  QCed_reads:
    type: File
    format: edam:format_1929  # FASTA
    inputBinding:
      prefix: -i
  length_sum:
    label: Prefix for the files assocaited with sequence length distribution
    type: string
    default: seq-length.out
  gc_sum:
    label: Prefix for the files associated with GC distribution
    type: string
    default: GC-distribution.out
  nucleotide_distribution:
    label: Prefix for the files associated with nucleotide distribution
    type: string
    default: nucleotide-distribution.out
  summary:
    label: File names for summary of sequences, e.g. number, min/max length etc.
    type: string
    default: summary.out
  max_seq:
    label: Maximum number of sequences to sub-sample 
    type: int?
    default: 2000000
  out_dir_name:
    label: Specifies output subdirectory
    type: string
    default: qc-statistics
  sequence_count:
    label: Specifies the number of sequences in the input read file (FASTA formatted)
    type: int


outputs:
  output_dir:
    label: Contains all stats output files
    type: Directory
    outputBinding:
      glob: $(inputs.out_dir_name)
  summary_out:
    label: Contains the summary statistics for the input sequence file
    type: File
    format: iana:text/plain
    outputBinding:
      glob: $(inputs.out_dir_name)/$(inputs.summary)

arguments:
   - position: 1
     prefix: '-o'
     valueFrom: $(inputs.out_dir_name)/$(inputs.summary)
   - position: 2
     prefix: '-d'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.nucleotide_distribution, suffix);
       }
   - position: 3
     prefix: '-g'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.gc_sum, suffix);
       }
   - position: 4
     prefix: '-l'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.length_sum, suffix);
       }
   - position: 5
     valueFrom: ${ if (inputs.sequence_count > inputs.max_seq) { return '-m '.concat(inputs.max_seq)} else { return ''} }
21
baseCommand: [clean_motus_output.sh]
38
39
40
baseCommand: [motus]

arguments: [profile, -c, -q]
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
baseCommand: [ "biom-convert.sh" ]

inputs:
  biom:
    type: File?
    format: edam:format_3746  # BIOM
    inputBinding:
      prefix: --input-fp

  table_type:
    type: string? #biom-convert-table.yaml#table_type?
    inputBinding:
      prefix: --table-type  # --table-type=    <- worked for cwlexec
      separate: true # false                                  <- worked for cwlexec
      valueFrom: $(inputs.table_type)  # $('"' + inputs.table_type + '"')      <- worked for cwlexec

  json:
    type: boolean?
    label: Output as JSON-formatted table.
    inputBinding:
      prefix: --to-json

  hdf5:
    type: boolean?
    label: Output as HDF5-formatted table.
    inputBinding:
      prefix: --to-hdf5

  tsv:
    type: boolean?
    label: Output as TSV-formatted (classic) table.
    inputBinding:
      prefix: --to-tsv

  header_key:
    type: string?
    doc: |
      The observation metadata to include from the input BIOM table file when
      creating a tsv table file. By default no observation metadata will be
      included.
    inputBinding:
      prefix: --header-key

arguments:
  - valueFrom: |
     ${ var ext = "";
        if (inputs.json) { ext = "_json.biom"; }
        if (inputs.hdf5) { ext = "_hdf5.biom"; }
        if (inputs.tsv) { ext = "_tsv.biom"; }
        var pre = inputs.biom.nameroot.split('.');
        pre.pop()
        return pre.join('.') + ext; }
    prefix: --output-fp
  - valueFrom: "--collapsed-observations"
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
baseCommand: [ cmsearch-deoverlap.pl ]

inputs:
  - id: clan_information
    type: string?
    inputBinding:
      position: 0
      prefix: '--clanin'
    label: clan information on the models provided
    doc: Not all models provided need to be a member of a clan
  - id: cmsearch_matches
    type: File
    format: edam:format_3475
    inputBinding:
      position: 1
      valueFrom: $(self.basename)
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
baseCommand:
  - cmsearch
inputs:
  - id: covariance_model_database
    type: [ string, File ]
    inputBinding:
      position: 1
  - id: cpu
    type: int?
    inputBinding:
      position: 0
      prefix: '--cpu'
    label: Number of parallel CPU workers to use for multithreads
  - default: false
    id: cut_ga
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--cut_ga'
    label: use CM's GA gathering cutoffs as reporting thresholds
  - id: omit_alignment_section
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--noali'
    label: Omit the alignment section from the main output.
    doc: This can greatly reduce the output volume.
  - default: false
    id: only_hmm
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--hmmonly'
    label: 'Only use the filter profile HMM for searches, do not use the CM'
    doc: |
      Only filter stages F1 through F3 will be executed, using strict P-value
      thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3). Additionally
      a bias composition filter is used after the F1 stage (with P=0.02
      survival threshold). Any hit that survives all stages and has an HMM
      E-value or bit score above the reporting threshold will be output.
  - id: query_sequences
    type: File
    format: edam:format_1929  # FASTA
    inputBinding:
      position: 2
    # streamable: true
  - id: search_space_size
    type: int
    inputBinding:
      position: 0
      prefix: '-Z'
    label: search space size in *Mb* to <x> for E-value calculations

outputs:
  - id: matches
    doc: 'http://eddylab.org/infernal/Userguide.pdf#page=60'
    label: 'target hits table, format 2'
    type: File
    format: edam:format_3475
    outputBinding:
      glob: |
        ${
          var name = "";
          if (typeof inputs.covariance_model_database == "string") {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.split("/").slice(-1)[0] +
              ".cmsearch_matches.tbl";
          } else {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.nameroot +
              ".cmsearch_matches.tbl";
          }
          return name;
        }
  - id: programOutput
    label: 'direct output to file, not stdout'
    type: File
    format: edam:format_3475
    outputBinding:
      glob: |
        ${
          var name = "";
          if (typeof inputs.covariance_model_database == "string") {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.split("/").slice(-1)[0] +
              ".cmsearch.out";
          } else {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.nameroot +
              ".cmsearch.out";
          }
          return name;
        }

doc: >
  Infernal ("INFERence of RNA ALignment") is for searching DNA sequence
  databases for RNA structure and sequence similarities. It is an implementation
  of a special case of profile stochastic context-free grammars called
  covariance models (CMs). A CM is like a sequence profile, but it scores a
  combination of sequence consensus and RNA secondary structure consensus,
  so in many cases, it is more capable of identifying RNA homologs that
  conserve their secondary structure more than their primary sequence.

  Please visit http://eddylab.org/infernal/ for full documentation.

  Version 1.1.2 can be downloaded from
  http://eddylab.org/infernal/infernal-1.1.2.tar.gz
label: Search sequence(s) against a covariance model database

arguments:
  - position: 0
    prefix: '--tblout'
    valueFrom: |
      ${
        var name = "";
        if (typeof inputs.covariance_model_database == "string") {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.split("/").slice(-1)[0] +
            ".cmsearch_matches.tbl";
        } else {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.nameroot +
            ".cmsearch_matches.tbl";
        }
        return name;
      }
  - position: 0
    prefix: '-o'
    valueFrom: |
      ${
        var name = "";
        if (typeof inputs.covariance_model_database == "string") {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.split("/").slice(-1)[0] +
            ".cmsearch.out";
        } else {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.nameroot +
            ".cmsearch.out";
        }
        return name;
      }
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11

hints:
  - class: SoftwareRequirement
    packages:
      infernal:
        specs:
          - 'https://identifiers.org/rrid/RRID:SCR_011809'
        version:
          - 1.1.2
  - class: DockerRequirement
    dockerPull: 'quay.io/biocontainers/infernal:1.1.2--h470a237_1'
12
baseCommand: [ esl-index.sh ]
35
baseCommand: [ esl-sfetch ]
35
baseCommand: awk_tool
28
baseCommand: get_subunits_coords.py
45
baseCommand: get_subunits.py
21
22
23
24
25
baseCommand: ktImportText

arguments:
  - valueFrom: "krona.html"
    prefix: -o
43
44
45
46
47
48
49
50
51
baseCommand: ['mapseq2biom.pl']

arguments:
  - valueFrom: $(inputs.query.basename).tsv
    prefix: --outfile
  - valueFrom: $(inputs.query.basename).txt
    prefix: --krona
  - valueFrom: $(inputs.query.basename).notaxid.tsv
    prefix: --notaxidfile
40
41
baseCommand: mapseq
arguments: ['-nthreads', '8', '-tophits', '80', '-topotus', '40', '-outfmt', 'simple']
24
baseCommand: [pull_ncrnas.sh]
34
35
36
37
38
39
40
41
42
43
44
45
46
47
baseCommand: SeqPrep

arguments:
 - "-1"
 - forward_unmerged.fastq.gz
 - "-2"
 - reverse_unmerged.fastq.gz
 - valueFrom: |
     ${ return inputs.namefile.nameroot.split('_')[0] + '_MERGED.fastq.gz' }
   prefix: "-s"
 # - "-3"
 # - forward_discarded.fastq.gz
 # - "-4"
 # - reverse_discarded.fastq.gz
43
baseCommand: [functional_stats.py]
52
baseCommand: [write_summaries.py]
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
baseCommand: [ trimmomatic.sh ]

inputs:
  phred:
    type: string?  #trimmomatic-phred.yaml#phred?
    inputBinding:
      prefix: -phred
      separate: false
      position: 4
    label: 'quality score format'
    doc: >
      Either PHRED "33" or "64" specifies the base quality encoding. Default: 64

  tophred64:
    type: boolean?
    inputBinding:
      position: 12
      prefix: TOPHRED64
      separate: false
    label: 'quality score conversion to phred64'
    doc: >
      This (re)encodes the quality part of the FASTQ file to base 64.

  headcrop:
    type: int?
    inputBinding:
      position: 13
      prefix: 'HEADCROP:'
      separate: false
    label: 'read head trimming'
    doc: >
      Removes the specified number of bases, regardless of quality, from the
      beginning of the read.
      The numbser specified is the number of bases to keep, from the start of
      the read.

  tophred33:
    type: boolean?
    inputBinding:
      position: 12
      prefix: TOPHRED33
      separate: false
    label: 'quality score conversion to phred33'
    doc: >
      This (re)encodes the quality part of the FASTQ file to base 33.

  minlen:
    type: int?
    inputBinding:
      position: 100
      prefix: 'MINLEN:'
      separate: false
    label: 'minimum length read filter'
    doc: >
      This module removes reads that fall below the specified minimal length.
      If required, it should normally be after all other processing steps.
      Reads removed by this step will be counted and included in the "dropped
      reads" count presented in the trimmomatic summary.

  java_opts:
    type: string?
    inputBinding:
      position: 1
      shellQuote: false
    doc: >
      JVM arguments should be a quoted, space separated list
      (e.g. "-Xms128m -Xmx512m")

  leading:
    type: int?
    inputBinding:
      position: 14
      prefix: 'LEADING:'
      separate: false
    label: 'read tail trimming'
    doc: >
      Remove low quality bases from the beginning. As long as a base has a
      value below this threshold the base is removed and the next base will be
      investigated.

  slidingwindow:
    type: string?  #trimmomatic-sliding_window.yaml#slidingWindow?
    inputBinding:
      position: 15
      prefix: 'SLIDINGWINDOW:'
      separate: false
    label: 'read filtering sliding window'
    doc: >
      Perform a sliding window trimming, cutting once the average quality
      within the window falls below a threshold. By considering multiple
      bases, a single poor quality base will not cause the removal of high
      quality data later in the read.
      <windowSize> specifies the number of bases to average across
      <requiredQuality> specifies the average quality required

  illuminaClip:
    type:  File? #trimmomatic-illumina_clipping.yaml#illuminaClipping?
    inputBinding:
      valueFrom: |
        ${ if ( self ) {
             return "ILLUMINACLIP:" + inputs.illuminaClip.adapters.path + ":"
               + self.seedMismatches + ":" + self.palindromeClipThreshold + ":"
               + self.simpleClipThreshold + ":" + self.minAdapterLength + ":"
               + self.keepBothReads;
           } else {
             return self;
           }
         }
      position: 11
    label: 'sequencing adaptater removing'
    doc: >
      Cut adapter and other illumina-specific sequences from the read.

  crop:
    type: int?
    inputBinding:
      position: 13
      prefix: 'CROP:'
      separate: false
    label: 'read cropping'
    doc: >
      Removes bases regardless of quality from the end of the read, so that the
      read has maximally the specified length after this step has been
      performed. Steps performed after CROP might of course further shorten the
      read. The value is the number of bases to keep, from the start of the read.

  reads2:
    type: File?
    inputBinding:
      position: 6
    label: 'FASTQ read file 2'
    doc: >
      FASTQ file of R2 reads in Paired End mode

  reads1:
    type: File
    inputBinding:
      position: 5
    label: 'FASTQ read file 1'
    doc: >
      FASTQ file of reads (R1 reads in Paired End mode)

  avgqual:
    type: int?
    inputBinding:
      position: 101
      prefix: 'AVGQUAL:'
      separate: false
    label: 'minimum average quality required'
    doc: >
      Drop the read if the average quality is below the specified level

  trailing:
    type: int?
    inputBinding:
      position: 14
      prefix: 'TRAILING:'
      separate: false
    label: 'read tail quality filtering'
    doc: >
      Remove low quality bases from the end. As long as a base has a value
      below this threshold the base is removed and the next base (which as
      trimmomatic is starting from the 3' prime end would be base preceding
      the just removed base) will be investigated. This approach can be used
      removing the special Illumina "low quality segment" regions (which are
      marked with quality score of 2), but we recommend Sliding Window or
      MaxInfo instead

  maxinfo:
    type: int?  #trimmomatic-max_info.yaml#maxinfo?
    inputBinding:
      position: 15
      valueFrom: |
        ${ if ( self ) {
             return "MAXINFO:" + self.targetLength + ":" + self.strictness;
           } else {
             return self;
           }
         }
    label: 'maxinfo: read score quality filtering'
    doc: >
      Performs an adaptive quality trim, balancing the benefits of retaining
      longer reads against the costs of retaining bases with errors.
      <targetLength>: This specifies the read length which is likely to allow
      the location of the read within the target sequence to be determined.
      <strictness>: This value, which should be set between 0 and 1, specifies
      the balance between preserving as much read length as possible vs.
      removal of incorrect bases. A low value of this parameter (<0.2) favours
      longer reads, while a high value (>0.8) favours read correctness.

  end_mode:
    type: string  #trimmomatic-end_mode.yaml#end_mode
    inputBinding:
      position: 3
    label: 'read -end mode format'
    doc: >
      Single End (SE) or Paired End (PE) mode

outputs:
  reads1_trimmed:
    type: File
    format: edam:format_1930  # fastq
    outputBinding:
      glob: $(inputs.reads1.nameroot).trimmed

  log_file:
    type: File
    outputBinding:
      glob: 'trim.log'
    label: Log file
    doc: |
      log of all read trimmings, indicating the following details:
        the read name
        the surviving sequence length
        the location of the first surviving base, aka. the amount trimmed from the start
        the location of the last surviving base in the original read
        the amount trimmed from the end

  reads1_trimmed_unpaired:
    type: File?
    format: edam:format_1930  # fastq
    outputBinding:
      glob: $(inputs.reads1.basename).trimmed.unpaired.fastq

arguments:
- valueFrom: trim.log
  prefix: -trimlog
  position: 4
- valueFrom: $(runtime.cores)
  position: 4
  prefix: -threads
- valueFrom: $(inputs.reads1.nameroot).trimmed
  position: 7
#- valueFrom: |
#    ${
#      if (inputs.end_mode == "PE" && inputs.reads2) {
#        return inputs.reads1.nameroot + '.trimmed.unpaired.fastq';
#      } else {
#        return null;
#      }
#    }
#  position: 8
#- valueFrom: |
#    ${
#      if (inputs.end_mode == "PE" && inputs.reads2) {
#        return inputs.reads2.nameroot + '.trimmed.fastq';
#      } else {
#        return null;
#      }
#    }
#  position: 9
#- valueFrom: |
#    ${
#      if (inputs.end_mode == "PE" && inputs.reads2) {
#        return inputs.reads2.nameroot + '.trimmed.unpaired.fastq';
#      } else {
#        return null;
#      }
#    }
#  position: 10
18
19
20
21
22
23
24
25
26
27
28
29
baseCommand: [ add_header ]

inputs:
  input_table:
    #format: [edam:format_3475, edam:format_2333]
    type: File
    inputBinding:
      prefix: -i
  header:
    type: string
    inputBinding:
      prefix: -h
14
15
16
17
18
19
20
21
22
23
24
baseCommand: [ count_lines.py ]

inputs:
  sequences:
    type: File
    inputBinding:
      prefix: -f
  number:
    type: int
    inputBinding:
      prefix: -n
24
25
26
27
28
29
baseCommand: [ bash ]

arguments:
  - valueFrom: |
      expr \$(cat $(inputs.input_file.path) | wc -l)
    prefix: -c
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
baseCommand: [ fastp ]

inputs:
  fastq1:
    type: File
    format:
      - edam:format_1930 # FASTA
      - edam:format_1929 # FASTQ
    inputBinding:
      prefix: -i
  fastq2:
    format:
      - edam:format_1930 # FASTA
      - edam:format_1929 # FASTQ
    type: File?
    inputBinding:
      prefix: -I
  threads:
    type: int?
    default: 1
    inputBinding:
      prefix: --thread
  qualified_phred_quality:
    type: int?
    default: 20
    inputBinding:
      prefix: --qualified_quality_phred
  unqualified_phred_quality:
    type: int?
    default: 20
    inputBinding:
      prefix: --unqualified_percent_limit
  min_length_required:
    type: int?
    default: 50
    inputBinding:
      prefix: --length_required
  force_polyg_tail_trimming:
    type: boolean?
    inputBinding:
      prefix: --trim_poly_g
  disable_trim_poly_g:
    type: boolean?
    default: true
    inputBinding:
      prefix: --disable_trim_poly_g
  base_correction:
    type: boolean?
    default: true
    inputBinding:
      prefix: --correction

arguments:
  - prefix: -o
    valueFrom: $(inputs.fastq1.nameroot).fastp.fastq
  - |
    ${
      if (inputs.fastq2){
        return '-O';
      } else {
        return '';
      }
    }
  - |
    ${
      if (inputs.fastq2){
        return inputs.fastq2.nameroot + ".fastp.fastq";
      } else {
        return '';
      }
    }
25
26
27
28
29
arguments:
  - valueFrom: $(inputs.fastq.nameroot).unclean
    prefix: '-o'

baseCommand: [ fastq_to_fasta.py ]
22
baseCommand: [ generate_checksum.py ]
12
13
14
15
16
17
18
19
20
21
22
23
baseCommand: [ make_csv.py ]

inputs:
  tab_sep_table:
    format: edam:format_3475
    type: File
    inputBinding:
      prefix: '-i'
  output_name:
    type: string
    inputBinding:
      prefix: '-o'
46
baseCommand: [ gunzip, -c ]
17
18
baseCommand: [ pigz ]
arguments: ["-p", "16", "-c"]
CWL From line 17 of pigz/gzip.cwl
32
baseCommand: [run_result_file_chunker.py]
ShowHide 43 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/EBI-Metagenomics/pipeline-v5.git
Name: mgnify-raw-reads-analysis-pipeline
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...