MGnify - raw-reads analysis pipeline

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

mgnify-raw-reads-analysis-pipeline — View Workflow

MGnify ( http://www.ebi.ac.uk/metagenomics ) provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in particular environments. Over the past 2 years, MGnify (formerly EBI Metagenomics) has more than doubled the number of publicly available analysed datasets held within the resource. Recently, an updated approach to data analysis has been unveiled (version 5.0), replacing the previous single pipeline with multiple analysis pipelines that are tailored according to the input data, and that are formally described using the Common Workflow Language, enabling greater provenance, reusability, and reproducibility. MGnify's new analysis pipelines offer additional approaches for taxonomic assertions based on ribosomal internal transcribed spacer regions (ITS1/2) and expanded protein functional annotations. Biochemical pathways and systems predictions have also been added for assembled contigs. MGnify's growing focus on the assembly of metagenomic data has also seen the number of datasets it has assembled and analysed increase six-fold. The non-redundant protein database constructed from the proteins encoded by these assemblies now exceeds 1 billion sequences. Meanwhile, a newly developed contig viewer provides fine-grained visualisation of the assembled contigs and their enriched annotations.

Documentation: https://docs.mgnify.org/en/latest/analysis.html#raw-reads-analysis-pipeline

Code Snippets

baseCommand: [ run_antismash_short.sh ]

CWL From line 42 of antismash/antismash_v4.cwl

baseCommand: [ change_antismash_output.py ]

CWL From line 35 of fix_embl_gbk/change_output.cwl

baseCommand: [ change_geneclusters_ctg.py ]

CWL From line 27 of fix_geneclusters_txt/change_geneclusters_txt.cwl

baseCommand: [antismash_to_gff.py]

inputs:
  antismash_geneclus:
    type: File
    inputBinding:
      prefix: -g
  antismash_embl:
    type: File
    inputBinding:
      prefix: -e
  output_name:
    type: string
    inputBinding:
      prefix: -o

CWL From line 16 of gff_antismash/antismash_to_gff.cwl

baseCommand: [reformat_antismash.py]

inputs:
  glossary:
    type: string
    inputBinding:
      position: 1
      prefix: -g
  geneclusters:
    type: File
    inputBinding:
        position: 2
        prefix: -a

CWL From line 13 of reformat_antismash/reformat_antismash.cwl

baseCommand: [ antismash_rename_contigs.py ]

CWL From line 31 of rename_contigs/rename_contigs.cwl

baseCommand: [move_antismash_summary.py]

CWL From line 21 of move_antismash_summary/move_antismash_summary.cwl

baseCommand:
  - diamond
  - blastp
inputs:
  - id: blockSize
    type: float?
    inputBinding:
      position: 0
      prefix: '--block-size'
    label: sequence block size in billions of letters (default=2.0)
  - id: databaseFile
    type: string
    inputBinding:
      position: 0
      prefix: '--db'
    label: DIAMOND database input file
    doc: Path to the DIAMOND database file.
  - id: outputFormat
    type: string?  # Diamond-output_formats.yaml#output_formats?
    inputBinding:
      position: 0
      prefix: '--outfmt'
    label: Format of the output file
    doc: |-
      0   = BLAST pairwise
      5   = BLAST XML
      6   = BLAST tabular
      100 = DIAMOND alignment archive (DAA)
      101 = SAM

      Value 6 may be followed by a space-separated list of these keywords
  - id: queryGeneticCode
    type: int?
    inputBinding:
      position: 0
      prefix: '--min-orf'
    label: Genetic code used for the translation of the query sequences
    doc: >
      Ignore translated sequences that do not contain an open reading frame of
      at least this length.

      By default this feature is disabled for sequences of length below 30, set
      to 20 for sequences of length below 100, and set to 40 otherwise. Setting
      this option to 1 will disable this feature.
  - id: queryInputFile
    format: edam:format_1929
    type: File
    inputBinding:
      position: 0
      prefix: '--query'
    label: Query input file in FASTA
    doc: >
      Path to the query input file in FASTA or FASTQ format (may be gzip
      compressed). If this parameter is omitted, the input will be read from
      stdin
  - id: strand
    type: string?  # Diamond-strand_values.yaml#strand?
    inputBinding:
      position: -3
      prefix: '--strand'
    label: Set strand of query to align for translated searches
    doc: >-
      Set strand of query to align for translated searches. By default both
      strands are searched. Valid values are {both, plus, minus}
  - id: taxonList
    type: 'int[]?'
    inputBinding:
      position: 0
      prefix: '--taxonlist'
    label: Protein accession to taxon identifier NCBI mapping file
    doc: >
      Comma-separated list of NCBI taxonomic IDs to filter the database by. Any
      taxonomic rank can be used, and only reference sequences matching one of
      the specified taxon ids will be searched against. Using this option
      requires setting the --taxonmap and --taxonnodes parameters for makedb.
  - id: threads
    type: int?
    inputBinding:
      position: 0
      prefix: '--threads'
    label: Number of CPU threads
    doc: >
      Number of CPU threads. By default, the program will auto-detect and use
      all available virtual cores on the machine.
  - id: maxTargetSeqs
    type: int?
    inputBinding:
      position: 0
      prefix: '--max-target-seqs'
    label: Max number of target sequences per query
    doc: >
      The maximum number of target sequences per query to report alignments for (default=25).
      Setting this to 0 will report all alignments that were found.
  - id: top
    type: int?
    inputBinding:
      position: 0
      prefix: '--top'
    label: Percentage range of the top alignment score
    doc: >
      Report alignments within the given percentage range of the top alignment score for a query
      (overrides --max-target-seqs option). For example, setting this to 10 will report all align-
      ments whose score is at most 10% lower than the best alignment score for a query.



outputs:
  - id: matches
    type: File
    outputBinding:
      glob: $(inputs.queryInputFile.basename).diamond_matches
    format: edam:format_2333
doc: |
  DIAMOND is a sequence aligner for protein and translated DNA searches,
  designed for high performance analysis of big sequence data.

  The key features are:
        + Pairwise alignment of proteins and translated DNA at 500x-20,000x speed of BLAST.
        + Frameshift alignments for long read analysis.
        + Low resource requirements and suitable for running on standard desktops or laptops.
        + Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

  Please visit https://github.com/bbuchfink/diamond for full documentation.

  Releases can be downloaded from https://github.com/bbuchfink/diamond/releases
label: Aligns DNA query sequences against a protein reference database

arguments:
  - position: 0
    prefix: '--out'
    valueFrom: $(inputs.queryInputFile.basename).diamond_matches

CWL Diamond From line 9 of Diamond/Diamond.blastp-v0.9.21.cwl

baseCommand: [diamond_post_run_join.sh]

inputs:
  input_diamond:
    format: edam:format_2333
    type: File
    inputBinding:
      separate: true
      prefix: -i
  input_db:
    type: string
    inputBinding:
      separate: true
      prefix: -d
  filename: string

CWL From line 16 of Diamond-Post-Processing/diamond_join.cwl

baseCommand: [emapper_wrapper.sh]

inputs:
  fasta_file:
    format: edam:format_1929  # FASTA
    type: File?
    inputBinding:
      separate: true
      prefix: -i
    label: Input FASTA file containing query sequences

  db:
    type: string?  # data/eggnog.db
    inputBinding:
      prefix: --database
    label: specify the target database for sequence searches (euk,bact,arch, host:port, local hmmpressed database)

  db_diamond:
    type: string?  # data/eggnog_proteins.dmnd
    inputBinding:
      prefix: --dmnd_db
    label: Path to DIAMOND-compatible database

  data_dir:
    type: string?  # data/
    inputBinding:
      prefix: --data_dir
    label: Directory to use for DATA_PATH

  mode:
    type: string?
    inputBinding:
      prefix: -m
    label: hmmer or diamond

  no_annot:
    type: boolean?
    inputBinding:
      prefix: --no_annot
    label: Skip functional annotation, reporting only hits

  no_file_comments:
    type: boolean?
    inputBinding:
      prefix: --no_file_comments
    label: No header lines nor stats are included in the output files

  cpu:
    type: int?
    inputBinding:
      prefix: --cpu

  annotate_hits_table:
    type: File?
    inputBinding:
      prefix: --annotate_hits_table
    label: Annotatate TSV formatted table of query->hits

  output:
    type: string?
    inputBinding:
      prefix: -o

CWL Diamond hmmer From line 19 of eggNOG/eggnog.cwl

baseCommand: [assign_genome_properties.pl]    # without docker

arguments:
  - position: 1
    valueFrom: "-all"
  - position: 2
    valueFrom: "table"
    prefix: "-outfiles"
  - position: 4
    valueFrom: "summary"
    prefix: "-outfiles"
  - position: 3
    valueFrom: "web_json"
    prefix: "-outfiles"

inputs:
  input_tsv_file:
    type: File
    format: edam:format_3475
    inputBinding:
      separate: true
      prefix: "-matches"

  flatfiles_path:
    type: string
    inputBinding:
      prefix: "-gpdir"
  GP_txt:
    type: string
    inputBinding:
      prefix: "-gpff"

  out_dir:
    type: string?
    inputBinding:
      prefix: "-outdir"
  name:
    type: string?
    inputBinding:
      prefix: "-name"

CWL From line 22 of Genome_properties/genome_properties.cwl

baseCommand: [ build_assembly_gff.py ]

inputs:
  ips_results:
    type: File
    format: edam:format_3475
    inputBinding:
      prefix: -i
  eggnog_results:
    format: edam:format_3475
    type: File
    inputBinding:
      prefix: -e
  input_faa:
    format: edam:format_1929
    type: File
    inputBinding:
      prefix: -f
  output_name:
    type: string
    inputBinding:
      prefix: -o

CWL From line 16 of GFF/gff_generation.cwl

arguments: ["-n", $(inputs.fasta.basename)]

baseCommand: [ "run_samtools.sh" ]

CWL From line 20 of index_fasta/fasta_index.cwl

baseCommand: [give_pathways.py]

inputs:
  input_table:
    format: edam:format_3475  # TXT
    type: File
    inputBinding:
      separate: true
      prefix: -i
  graphs:
    type: string
    inputBinding:
      prefix: -g
  pathways_names:
    type: string
    inputBinding:
      prefix: -n
  pathways_classes:
    type: string
    inputBinding:
      prefix: -c
  outputname:
    type: string
    inputBinding:
      prefix: -o

CWL From line 19 of KEGG_pathways/kegg_pathways.cwl

baseCommand: ['parsing_hmmscan.py']

inputs:
  table:
    format: edam:format_3475
    type: File
    inputBinding:
      separate: true
      prefix: -i
  fasta:
    type: File
    inputBinding:
      separate: true
      prefix: -f

CWL From line 17 of Parsing_hmmscan/parsing_hmmscan.cwl

baseCommand: [ esl-ssplit.sh ]

CWL From line 45 of dna_chunker/fasta_chunker.cwl

arguments:
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11


baseCommand: [ split_to_chunks.py ]

CWL From line 40 of chunks/protein_chunker.cwl

baseCommand: [ run_FGS.sh ]

arguments:

inputs:
  input_fasta:
    format: 'edam:format_1929'
    type: File
    inputBinding:
      separate: true
      prefix: "-i"
  output:
    type: string
    inputBinding:
      separate: true
      prefix: "-o"
  seq_type:
    type: string
    inputBinding:
      separate: true
      prefix: "-s"
  train:
    type: string
    inputBinding:
      separate: true
      prefix: "-t"
    default: "illumina_5"

CWL FragGeneScan From line 16 of Combined_gene_caller/FGS.cwl

baseCommand: [ unite_protein_predictions.py ]

inputs:
  masking_file:
    type: File
    inputBinding:
      prefix: "--mask"
  predicted_proteins_prodigal_out:
    type: File?
    inputBinding:
      prefix: "--prodigal-out"
  predicted_proteins_prodigal_ffn:
    type: File?
    inputBinding:
      prefix: "--prodigal-ffn"
  predicted_proteins_prodigal_faa:
    type: File?
    inputBinding:
      prefix: "--prodigal-faa"
  predicted_proteins_fgs_out:
    type: File
    inputBinding:
      prefix: "--fgs-out"
  predicted_proteins_fgs_ffn:
    type: File
    inputBinding:
      prefix: "--fgs-ffn"
  predicted_proteins_fgs_faa:
    inputBinding:
      prefix: "--fgs-faa"
    type: File
  basename:
    inputBinding:
      prefix: "--name"
    type: string
  genecaller_order:
    inputBinding:
      prefix: "--caller-priority"
    type: string

CWL From line 19 of Combined_gene_caller/post-processing.cwl

baseCommand: [ prodigal ]

arguments:
  - valueFrom: "sco"
    prefix: "-f"
  - valueFrom: "meta"
    prefix: "-p"
  - valueFrom: $(inputs.input_fasta.basename).prodigal
    prefix: "-o"
  - valueFrom: $(inputs.input_fasta.basename).prodigal.ffn
    prefix: "-d"
  - valueFrom: $(inputs.input_fasta.basename).prodigal.faa
    prefix: "-a"

inputs:
  input_fasta:
    format: 'edam:format_1929'
    type: File
    inputBinding:
      separate: true
      prefix: "-i"

CWL prodigal From line 16 of Combined_gene_caller/prodigal.cwl

baseCommand: ["go_summary_pipeline-1.0.py"]

CWL From line 39 of GO-slim/go_summary.cwl

baseCommand: [ hmmscan_tab.py ]  # old was with sed

arguments:
  - valueFrom: $(inputs.input_table.nameroot).tsv
    prefix: -o

CWL From line 23 of hmmscan_tab_modification/hmmscan_tab_modification.cwl

baseCommand: ["hmmsearch"]

arguments:
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11
  - prefix: --domtblout
    valueFrom: $(inputs.seqfile.nameroot)_hmmsearch.tbl
    position: 2
  - prefix: --cpu
    valueFrom: '4'
  - prefix: -o
    valueFrom: '/dev/null'

inputs:

  omit_alignment:
    type: boolean?
    inputBinding:
      position: 1
      prefix: "--noali"

  gathering_bit_score:
    type: boolean?
    inputBinding:
      position: 4
      prefix: "--cut_ga"

  path_database:
    type: string
    inputBinding:
      position: 5

  seqfile:
    format: edam:format_1929  # FASTA
    type: File
    inputBinding:
      position: 6
      separate: true

CWL hmmsearch (genouest) From line 22 of hmmer/hmmsearch.cwl

baseCommand: interproscan.sh
inputs:
  - id: inputFile
    type: File
    format: edam:format_1929
    inputBinding:
      position: 8
      prefix: '--input'
    label: Input file path
    doc: >-
      Optional, path to fasta file that should be loaded on Master startup.
      Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert.
  - id: applications
    type: string[]?
    inputBinding:
      position: 9
      itemSeparator: ','
      prefix: '--applications'
    label: Analysis
    doc: >-
      Optional, comma separated list of analyses. If this option is not set, ALL
      analyses will be run.
  - id: outputFormat
    type: string[]
    inputBinding:
      position: 10
      itemSeparator: ','
      prefix: '--formats'
    label: output format
    doc: >-
      Optional, case-insensitive, comma separated list of output formats.
      Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for
      protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3
      and XML.
  - id: databases
    type: string? #Directory?
  - id: disableResidueAnnotation
    type: boolean?
    inputBinding:
      position: 11
      prefix: '--disable-residue-annot'
    label: Disables residue annotation
    doc: 'Optional, excludes sites from the XML, JSON output.'
  - id: seqtype
    type:
      - 'null'
      - type: enum
        symbols:
          - p
          - n
        name: seqtype
    inputBinding:
      position: 12
      prefix: '--seqtype'
    label: Sequence type
    doc: >-
      Optional, the type of the input sequences (dna/rna (n) or protein (p)).
      The default sequence type is protein.
outputs:
  - id: i5Annotations
    format: edam:format_3475
    type: File
    outputBinding:
      glob: $(inputs.inputFile.nameroot).f*.tsv
doc: >-
  InterProScan is the software package that allows sequences (protein and
  nucleic) to be scanned against InterPro's signatures. Signatures are
  predictive models, provided by several different databases, that make up the
  InterPro consortium.
  This tool description is using a Docker container tagged as version
  v5.30-69.0.
  Documentation on how to run InterProScan 5 can be found here:
  https://github.com/ebi-pf-team/interproscan/wiki/HowToRun
label: 'InterProScan: protein sequence classifier'
arguments:
  - position: 0
    valueFrom: '--disable-precalc'
  - position: 1
    valueFrom: '--goterms'
  - position: 2
    valueFrom: '--pathways'
  - position: 3
    prefix: '--tempdir'
    valueFrom: $(runtime.tmpdir)

CWL InterProScan (EBI) From line 8 of InterProScan/InterProScan-v5-none_docker.cwl

baseCommand: [bedtools, maskfasta]

arguments:
  - valueFrom: ITS_masked.fasta
    prefix: -fo

CWL From line 32 of mask-for-ITS/bedtools.cwl

baseCommand: [format_bedfile]

#reverse start and end where start < end (i.e. neg strand)

CWL From line 23 of mask-for-ITS/format-bedfile.cwl

baseCommand: [ its-length-new.py ]

CWL From line 42 of mask-for-ITS/suppress_tax.cwl

baseCommand: ["run_quality_filtering.py"]

inputs:
  seq_file:
    type: File
    # format: edam:format_1929  # FASTA
    inputBinding:
      position: 1
    label: 'Trimmed sequence file'
    doc: >
      Trimmed and FASTQ to FASTA converted sequences file.
  submitted_seq_count:
    type: int
    label: 'Number of submitted sequences'
    doc: >
      Number of originally submitted sequences as in the user
      submitted FASTQ file - single end FASTQ or pair end merged FASTQ file.
  stats_file_name:
    type: string
    default: stats_summary
    label: 'Post QC stats output file name'
    doc: >
      Give a name for the file which will hold the stats after QC.
  min_length:
    type: int
    default: 100 # For assemblies we need to set this in the input YAML to 500
    label: 'Minimum read or contig length'
    doc: >
      Specify the minimum read or contig length for sequences to pass QC filtering.
  input_file_format: string


outputs:
  filtered_file:
    label: Filtered output file
    format: edam:format_1929  # FASTA
    type: File
    outputBinding:
      glob: $(inputs.seq_file.nameroot).fasta
  stats_summary_file:
    label: Stats summary output file
    type: File
    outputBinding:
      glob: $(inputs.stats_file_name)

arguments:
   - position: 2
     valueFrom: $(inputs.seq_file.nameroot).fasta
   - position: 3
     valueFrom: $(inputs.stats_file_name)
   - position: 4
     valueFrom: $(inputs.submitted_seq_count)
   - position: 5
     prefix: '--min_length'
     valueFrom: $(inputs.min_length)
   - position: 6
     prefix: '--extension'
     valueFrom: $(inputs.input_file_format)

CWL From line 16 of qc-filtering/qc-filtering.cwl

baseCommand: ["MGRAST_base.py" ]

inputs:
  QCed_reads:
    type: File
    format: edam:format_1929  # FASTA
    inputBinding:
      prefix: -i
  length_sum:
    label: Prefix for the files assocaited with sequence length distribution
    type: string
    default: seq-length.out
  gc_sum:
    label: Prefix for the files associated with GC distribution
    type: string
    default: GC-distribution.out
  nucleotide_distribution:
    label: Prefix for the files associated with nucleotide distribution
    type: string
    default: nucleotide-distribution.out
  summary:
    label: File names for summary of sequences, e.g. number, min/max length etc.
    type: string
    default: summary.out
  max_seq:
    label: Maximum number of sequences to sub-sample 
    type: int?
    default: 2000000
  out_dir_name:
    label: Specifies output subdirectory
    type: string
    default: qc-statistics
  sequence_count:
    label: Specifies the number of sequences in the input read file (FASTA formatted)
    type: int


outputs:
  output_dir:
    label: Contains all stats output files
    type: Directory
    outputBinding:
      glob: $(inputs.out_dir_name)
  summary_out:
    label: Contains the summary statistics for the input sequence file
    type: File
    format: iana:text/plain
    outputBinding:
      glob: $(inputs.out_dir_name)/$(inputs.summary)

arguments:
   - position: 1
     prefix: '-o'
     valueFrom: $(inputs.out_dir_name)/$(inputs.summary)
   - position: 2
     prefix: '-d'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.nucleotide_distribution, suffix);
       }
   - position: 3
     prefix: '-g'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.gc_sum, suffix);
       }
   - position: 4
     prefix: '-l'
     valueFrom: |
       ${ var suffix = '.full';
          if (inputs.sequence_count > inputs.max_seq) {
            suffix = '.sub-set';
          }
          return "".concat(inputs.out_dir_name, '/', inputs.length_sum, suffix);
       }
   - position: 5
     valueFrom: ${ if (inputs.sequence_count > inputs.max_seq) { return '-m '.concat(inputs.max_seq)} else { return ''} }

CWL From line 24 of qc-stats/qc-stats.cwl

baseCommand: [clean_motus_output.sh]

CWL Metagenomic operational taxonomic units (mOTUs) From line 21 of mOTUs/clean_motus_output.cwl

baseCommand: [motus]

arguments: [profile, -c, -q]

CWL Metagenomic operational taxonomic units (mOTUs) From line 38 of mOTUs/mOTUs.cwl

baseCommand: [ "biom-convert.sh" ]

inputs:
  biom:
    type: File?
    format: edam:format_3746  # BIOM
    inputBinding:
      prefix: --input-fp

  table_type:
    type: string? #biom-convert-table.yaml#table_type?
    inputBinding:
      prefix: --table-type  # --table-type=    <- worked for cwlexec
      separate: true # false                                  <- worked for cwlexec
      valueFrom: $(inputs.table_type)  # $('"' + inputs.table_type + '"')      <- worked for cwlexec

  json:
    type: boolean?
    label: Output as JSON-formatted table.
    inputBinding:
      prefix: --to-json

  hdf5:
    type: boolean?
    label: Output as HDF5-formatted table.
    inputBinding:
      prefix: --to-hdf5

  tsv:
    type: boolean?
    label: Output as TSV-formatted (classic) table.
    inputBinding:
      prefix: --to-tsv

  header_key:
    type: string?
    doc: |
      The observation metadata to include from the input BIOM table file when
      creating a tsv table file. By default no observation metadata will be
      included.
    inputBinding:
      prefix: --header-key

arguments:
  - valueFrom: |
     ${ var ext = "";
        if (inputs.json) { ext = "_json.biom"; }
        if (inputs.hdf5) { ext = "_hdf5.biom"; }
        if (inputs.tsv) { ext = "_tsv.biom"; }
        var pre = inputs.biom.nameroot.split('.');
        pre.pop()
        return pre.join('.') + ext; }
    prefix: --output-fp
  - valueFrom: "--collapsed-observations"

CWL From line 23 of biom-convert/biom-convert.cwl

baseCommand: [ cmsearch-deoverlap.pl ]

inputs:
  - id: clan_information
    type: string?
    inputBinding:
      position: 0
      prefix: '--clanin'
    label: clan information on the models provided
    doc: Not all models provided need to be a member of a clan
  - id: cmsearch_matches
    type: File
    format: edam:format_3475
    inputBinding:
      position: 1
      valueFrom: $(self.basename)

CWL From line 7 of cmsearch-deoverlap/cmsearch-deoverlap-v0.02.cwl

baseCommand:
  - cmsearch
inputs:
  - id: covariance_model_database
    type: [ string, File ]
    inputBinding:
      position: 1
  - id: cpu
    type: int?
    inputBinding:
      position: 0
      prefix: '--cpu'
    label: Number of parallel CPU workers to use for multithreads
  - default: false
    id: cut_ga
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--cut_ga'
    label: use CM's GA gathering cutoffs as reporting thresholds
  - id: omit_alignment_section
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--noali'
    label: Omit the alignment section from the main output.
    doc: This can greatly reduce the output volume.
  - default: false
    id: only_hmm
    type: boolean?
    inputBinding:
      position: 0
      prefix: '--hmmonly'
    label: 'Only use the filter profile HMM for searches, do not use the CM'
    doc: |
      Only filter stages F1 through F3 will be executed, using strict P-value
      thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3). Additionally
      a bias composition filter is used after the F1 stage (with P=0.02
      survival threshold). Any hit that survives all stages and has an HMM
      E-value or bit score above the reporting threshold will be output.
  - id: query_sequences
    type: File
    format: edam:format_1929  # FASTA
    inputBinding:
      position: 2
    # streamable: true
  - id: search_space_size
    type: int
    inputBinding:
      position: 0
      prefix: '-Z'
    label: search space size in *Mb* to <x> for E-value calculations

outputs:
  - id: matches
    doc: 'http://eddylab.org/infernal/Userguide.pdf#page=60'
    label: 'target hits table, format 2'
    type: File
    format: edam:format_3475
    outputBinding:
      glob: |
        ${
          var name = "";
          if (typeof inputs.covariance_model_database == "string") {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.split("/").slice(-1)[0] +
              ".cmsearch_matches.tbl";
          } else {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.nameroot +
              ".cmsearch_matches.tbl";
          }
          return name;
        }
  - id: programOutput
    label: 'direct output to file, not stdout'
    type: File
    format: edam:format_3475
    outputBinding:
      glob: |
        ${
          var name = "";
          if (typeof inputs.covariance_model_database == "string") {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.split("/").slice(-1)[0] +
              ".cmsearch.out";
          } else {
            name =
              inputs.query_sequences.basename +
              "." +
              inputs.covariance_model_database.nameroot +
              ".cmsearch.out";
          }
          return name;
        }

doc: >
  Infernal ("INFERence of RNA ALignment") is for searching DNA sequence
  databases for RNA structure and sequence similarities. It is an implementation
  of a special case of profile stochastic context-free grammars called
  covariance models (CMs). A CM is like a sequence profile, but it scores a
  combination of sequence consensus and RNA secondary structure consensus,
  so in many cases, it is more capable of identifying RNA homologs that
  conserve their secondary structure more than their primary sequence.

  Please visit http://eddylab.org/infernal/ for full documentation.

  Version 1.1.2 can be downloaded from
  http://eddylab.org/infernal/infernal-1.1.2.tar.gz
label: Search sequence(s) against a covariance model database

arguments:
  - position: 0
    prefix: '--tblout'
    valueFrom: |
      ${
        var name = "";
        if (typeof inputs.covariance_model_database == "string") {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.split("/").slice(-1)[0] +
            ".cmsearch_matches.tbl";
        } else {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.nameroot +
            ".cmsearch_matches.tbl";
        }
        return name;
      }
  - position: 0
    prefix: '-o'
    valueFrom: |
      ${
        var name = "";
        if (typeof inputs.covariance_model_database == "string") {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.split("/").slice(-1)[0] +
            ".cmsearch.out";
        } else {
          name =
            inputs.query_sequences.basename +
            "." +
            inputs.covariance_model_database.nameroot +
            ".cmsearch.out";
        }
        return name;
      }
  - valueFrom: '> /dev/null'
    shellQuote: false
    position: 10
  - valueFrom: '2> /dev/null'
    shellQuote: false
    position: 11

hints:
  - class: SoftwareRequirement
    packages:
      infernal:
        specs:
          - 'https://identifiers.org/rrid/RRID:SCR_011809'
        version:
          - 1.1.2
  - class: DockerRequirement
    dockerPull: 'quay.io/biocontainers/infernal:1.1.2--h470a237_1'

CWL Consensus cmsearch From line 7 of cmsearch/infernal-cmsearch-v1.1.2.cwl

baseCommand: [ esl-index.sh ]

CWL easel From line 12 of easel/esl-sfetch-index.cwl

baseCommand: [ esl-sfetch ]

CWL easel From line 35 of easel/esl-sfetch-manyseqs.cwl

baseCommand: awk_tool

CWL From line 35 of extract-coords/extract-coords_awk.cwl

baseCommand: get_subunits_coords.py

CWL From line 28 of get_subunits_coords/get_subunits_coords.cwl

baseCommand: get_subunits.py

CWL From line 45 of get_subunits_fasta/get_subunits.cwl

baseCommand: ktImportText

arguments:
  - valueFrom: "krona.html"
    prefix: -o

CWL Krona From line 21 of krona/krona.cwl

baseCommand: ['mapseq2biom.pl']

arguments:
  - valueFrom: $(inputs.query.basename).tsv
    prefix: --outfile
  - valueFrom: $(inputs.query.basename).txt
    prefix: --krona
  - valueFrom: $(inputs.query.basename).notaxid.tsv
    prefix: --notaxidfile

CWL From line 43 of mapseq2biom/mapseq2biom.cwl

baseCommand: mapseq
arguments: ['-nthreads', '8', '-tophits', '80', '-topotus', '40', '-outfmt', 'simple']

CWL MAPseq From line 40 of mapseq/mapseq.cwl

baseCommand: [pull_ncrnas.sh]

CWL From line 24 of pull_ncrnas/pull_ncrnas.cwl

baseCommand: SeqPrep

arguments:
 - "-1"
 - forward_unmerged.fastq.gz
 - "-2"
 - reverse_unmerged.fastq.gz
 - valueFrom: |
     ${ return inputs.namefile.nameroot.split('_')[0] + '_MERGED.fastq.gz' }
   prefix: "-s"
 # - "-3"
 # - forward_discarded.fastq.gz
 # - "-4"
 # - reverse_discarded.fastq.gz

CWL seqprep From line 34 of SeqPrep/seqprep.cwl

baseCommand: [functional_stats.py]

CWL From line 43 of summaries/functional_stats.cwl

baseCommand: [write_summaries.py]

CWL From line 52 of summaries/write_summaries.cwl

baseCommand: [ trimmomatic.sh ]

inputs:
  phred:
    type: string?  #trimmomatic-phred.yaml#phred?
    inputBinding:
      prefix: -phred
      separate: false
      position: 4
    label: 'quality score format'
    doc: >
      Either PHRED "33" or "64" specifies the base quality encoding. Default: 64

  tophred64:
    type: boolean?
    inputBinding:
      position: 12
      prefix: TOPHRED64
      separate: false
    label: 'quality score conversion to phred64'
    doc: >
      This (re)encodes the quality part of the FASTQ file to base 64.

  headcrop:
    type: int?
    inputBinding:
      position: 13
      prefix: 'HEADCROP:'
      separate: false
    label: 'read head trimming'
    doc: >
      Removes the specified number of bases, regardless of quality, from the
      beginning of the read.
      The numbser specified is the number of bases to keep, from the start of
      the read.

  tophred33:
    type: boolean?
    inputBinding:
      position: 12
      prefix: TOPHRED33
      separate: false
    label: 'quality score conversion to phred33'
    doc: >
      This (re)encodes the quality part of the FASTQ file to base 33.

  minlen:
    type: int?
    inputBinding:
      position: 100
      prefix: 'MINLEN:'
      separate: false
    label: 'minimum length read filter'
    doc: >
      This module removes reads that fall below the specified minimal length.
      If required, it should normally be after all other processing steps.
      Reads removed by this step will be counted and included in the "dropped
      reads" count presented in the trimmomatic summary.

  java_opts:
    type: string?
    inputBinding:
      position: 1
      shellQuote: false
    doc: >
      JVM arguments should be a quoted, space separated list
      (e.g. "-Xms128m -Xmx512m")

  leading:
    type: int?
    inputBinding:
      position: 14
      prefix: 'LEADING:'
      separate: false
    label: 'read tail trimming'
    doc: >
      Remove low quality bases from the beginning. As long as a base has a
      value below this threshold the base is removed and the next base will be
      investigated.

  slidingwindow:
    type: string?  #trimmomatic-sliding_window.yaml#slidingWindow?
    inputBinding:
      position: 15
      prefix: 'SLIDINGWINDOW:'
      separate: false
    label: 'read filtering sliding window'
    doc: >
      Perform a sliding window trimming, cutting once the average quality
      within the window falls below a threshold. By considering multiple
      bases, a single poor quality base will not cause the removal of high
      quality data later in the read.
      <windowSize> specifies the number of bases to average across
      <requiredQuality> specifies the average quality required

  illuminaClip:
    type:  File? #trimmomatic-illumina_clipping.yaml#illuminaClipping?
    inputBinding:
      valueFrom: |
        ${ if ( self ) {
             return "ILLUMINACLIP:" + inputs.illuminaClip.adapters.path + ":"
               + self.seedMismatches + ":" + self.palindromeClipThreshold + ":"
               + self.simpleClipThreshold + ":" + self.minAdapterLength + ":"
               + self.keepBothReads;
           } else {
             return self;
           }
         }
      position: 11
    label: 'sequencing adaptater removing'
    doc: >
      Cut adapter and other illumina-specific sequences from the read.

  crop:
    type: int?
    inputBinding:
      position: 13
      prefix: 'CROP:'
      separate: false
    label: 'read cropping'
    doc: >
      Removes bases regardless of quality from the end of the read, so that the
      read has maximally the specified length after this step has been
      performed. Steps performed after CROP might of course further shorten the
      read. The value is the number of bases to keep, from the start of the read.

  reads2:
    type: File?
    inputBinding:
      position: 6
    label: 'FASTQ read file 2'
    doc: >
      FASTQ file of R2 reads in Paired End mode

  reads1:
    type: File
    inputBinding:
      position: 5
    label: 'FASTQ read file 1'
    doc: >
      FASTQ file of reads (R1 reads in Paired End mode)

  avgqual:
    type: int?
    inputBinding:
      position: 101
      prefix: 'AVGQUAL:'
      separate: false
    label: 'minimum average quality required'
    doc: >
      Drop the read if the average quality is below the specified level

  trailing:
    type: int?
    inputBinding:
      position: 14
      prefix: 'TRAILING:'
      separate: false
    label: 'read tail quality filtering'
    doc: >
      Remove low quality bases from the end. As long as a base has a value
      below this threshold the base is removed and the next base (which as
      trimmomatic is starting from the 3' prime end would be base preceding
      the just removed base) will be investigated. This approach can be used
      removing the special Illumina "low quality segment" regions (which are
      marked with quality score of 2), but we recommend Sliding Window or
      MaxInfo instead

  maxinfo:
    type: int?  #trimmomatic-max_info.yaml#maxinfo?
    inputBinding:
      position: 15
      valueFrom: |
        ${ if ( self ) {
             return "MAXINFO:" + self.targetLength + ":" + self.strictness;
           } else {
             return self;
           }
         }
    label: 'maxinfo: read score quality filtering'
    doc: >
      Performs an adaptive quality trim, balancing the benefits of retaining
      longer reads against the costs of retaining bases with errors.
      <targetLength>: This specifies the read length which is likely to allow
      the location of the read within the target sequence to be determined.
      <strictness>: This value, which should be set between 0 and 1, specifies
      the balance between preserving as much read length as possible vs.
      removal of incorrect bases. A low value of this parameter (<0.2) favours
      longer reads, while a high value (>0.8) favours read correctness.

  end_mode:
    type: string  #trimmomatic-end_mode.yaml#end_mode
    inputBinding:
      position: 3
    label: 'read -end mode format'
    doc: >
      Single End (SE) or Paired End (PE) mode

outputs:
  reads1_trimmed:
    type: File
    format: edam:format_1930  # fastq
    outputBinding:
      glob: $(inputs.reads1.nameroot).trimmed

  log_file:
    type: File
    outputBinding:
      glob: 'trim.log'
    label: Log file
    doc: |
      log of all read trimmings, indicating the following details:
        the read name
        the surviving sequence length
        the location of the first surviving base, aka. the amount trimmed from the start
        the location of the last surviving base in the original read
        the amount trimmed from the end

  reads1_trimmed_unpaired:
    type: File?
    format: edam:format_1930  # fastq
    outputBinding:
      glob: $(inputs.reads1.basename).trimmed.unpaired.fastq

arguments:
- valueFrom: trim.log
  prefix: -trimlog
  position: 4
- valueFrom: $(runtime.cores)
  position: 4
  prefix: -threads
- valueFrom: $(inputs.reads1.nameroot).trimmed
  position: 7
#- valueFrom: |
#    ${
#      if (inputs.end_mode == "PE" && inputs.reads2) {
#        return inputs.reads1.nameroot + '.trimmed.unpaired.fastq';
#      } else {
#        return null;
#      }
#    }
#  position: 8
#- valueFrom: |
#    ${
#      if (inputs.end_mode == "PE" && inputs.reads2) {
#        return inputs.reads2.nameroot + '.trimmed.fastq';
#      } else {
#        return null;
#      }
#    }
#  position: 9
#- valueFrom: |
#    ${
#      if (inputs.end_mode == "PE" && inputs.reads2) {
#        return inputs.reads2.nameroot + '.trimmed.unpaired.fastq';
#      } else {
#        return null;
#      }
#    }
#  position: 10

CWL Trimmomatic From line 28 of Trimmomatic/Trimmomatic-v0.36-SE.cwl

baseCommand: [ add_header ]

inputs:
  input_table:
    #format: [edam:format_3475, edam:format_2333]
    type: File
    inputBinding:
      prefix: -i
  header:
    type: string
    inputBinding:
      prefix: -h

CWL From line 18 of add_header/add_header.cwl

baseCommand: [ count_lines.py ]

inputs:
  sequences:
    type: File
    inputBinding:
      prefix: -f
  number:
    type: int
    inputBinding:
      prefix: -n

CWL From line 14 of count_lines/count_lines.cwl

baseCommand: [ bash ]

arguments:
  - valueFrom: |
      expr \$(cat $(inputs.input_file.path) | wc -l)
    prefix: -c

CWL From line 24 of utils/count_number_lines.cwl

baseCommand: [ fastp ]

inputs:
  fastq1:
    type: File
    format:
      - edam:format_1930 # FASTA
      - edam:format_1929 # FASTQ
    inputBinding:
      prefix: -i
  fastq2:
    format:
      - edam:format_1930 # FASTA
      - edam:format_1929 # FASTQ
    type: File?
    inputBinding:
      prefix: -I
  threads:
    type: int?
    default: 1
    inputBinding:
      prefix: --thread
  qualified_phred_quality:
    type: int?
    default: 20
    inputBinding:
      prefix: --qualified_quality_phred
  unqualified_phred_quality:
    type: int?
    default: 20
    inputBinding:
      prefix: --unqualified_percent_limit
  min_length_required:
    type: int?
    default: 50
    inputBinding:
      prefix: --length_required
  force_polyg_tail_trimming:
    type: boolean?
    inputBinding:
      prefix: --trim_poly_g
  disable_trim_poly_g:
    type: boolean?
    default: true
    inputBinding:
      prefix: --disable_trim_poly_g
  base_correction:
    type: boolean?
    default: true
    inputBinding:
      prefix: --correction

arguments:
  - prefix: -o
    valueFrom: $(inputs.fastq1.nameroot).fastp.fastq
  - |
    ${
      if (inputs.fastq2){
        return '-O';
      } else {
        return '';
      }
    }
  - |
    ${
      if (inputs.fastq2){
        return inputs.fastq2.nameroot + ".fastp.fastq";
      } else {
        return '';
      }
    }

CWL fastp From line 14 of utils/fastp.cwl

arguments:
  - valueFrom: $(inputs.fastq.nameroot).unclean
    prefix: '-o'

baseCommand: [ fastq_to_fasta.py ]

CWL From line 25 of fastq_to_fasta/fastq_to_fasta.cwl

baseCommand: [ generate_checksum.py ]

CWL From line 22 of generate_checksum/generate_checksum.cwl

baseCommand: [ make_csv.py ]

inputs:
  tab_sep_table:
    format: edam:format_3475
    type: File
    inputBinding:
      prefix: '-i'
  output_name:
    type: string
    inputBinding:
      prefix: '-o'

CWL From line 12 of make_csv/make_csv.cwl

baseCommand: [ gunzip, -c ]

CWL From line 46 of utils/multiple-gunzip.cwl

baseCommand: [ pigz ]
arguments: ["-p", "16", "-c"]

CWL From line 17 of pigz/gzip.cwl

baseCommand: [run_result_file_chunker.py]

CWL From line 32 of result-file-chunker/result_chunker.cwl

ShowHide 43 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/EBI-Metagenomics/pipeline-v5.git

Name: mgnify-raw-reads-analysis-pipeline

Version: Version 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

CWL FASTQ raw sequence reads Genome assembly Quality control report easel hmmer prodigal seqprep Consensus Diamond fastp FragGeneScan hmmsearch (genouest) InterProScan (EBI) Krona MAPseq Metagenomic operational taxonomic units (mOTUs) Trimmomatic cmsearch metagenomics

Refs:

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free