Trinity RNA Assembly pipeline

public public 1yr ago Version: Version 2 0 bookmarks

Trinity assembly pipeline for BioCommons / USydney Informatics Hub

The pipeline requires Nextflow to run. DSL2 syntax is used, so that Nextflow version 20.07.1 or higher is required.

NOTE : this project was run in the context of targeting workflow automation, reproducibility and scalability. The scope was to port an existing bash pipeline into Nextflow, and in doing so investigating a few points, namely:

  • packing of multiple serial analyses into a single process;

  • option to leverage node-local disks;

  • option to leverage overlayFS in Singularity;

  • ease of adding configuration files for more computing clusters (Gadi was tested in this case).

The first three items tackle scalability, in that they allow to process large input datasets; the fourth one is clearly about portability.

Pipeline and requirements

This pipeline is based on SIH-Raijin-Trinity , with scheduler parameters updated following Gadi-Trinity :

Jellyfish -> Inchworm -> Chrysalis -> Butterfly mini-assemblies -> Aggregate

There are two software requirements:

  • Trinity , the main bioinformatics package; tests have been run with Trinity version 2.8.6 (official container);

  • GNU Parallel , to orchestrate mini-assemblies within each compute node; version 20191022 has been tested.

Basic usage

nextflow run marcodelapierre/trinity-nf \ --reads='reads_{1,2}.fq.gz' \ -profile zeus --slurm_account='<Your Pawsey Project>'

The flag --reads is required to specify the name of the pair of input read files.
Note some syntax requirements:

  • encapsulate the file name specification between single quotes;

  • within a file pair, use names that differ only by a character, which distinguishes the two files, in this case 1 or 2 ;

  • use curly brackets to specify the wild character within the file pair, e.g. {1,2} ;

  • the prefix to the wild character serves as the sample ID, e.g. reads_ .

The flag -profile (note the single dash) allows to select the appropriate profile for the machine in use, Zeus in this case. On Zeus, use the flag --slurm_account to set your Pawsey account; on Gadi (NCI), use the flag --pbs_account instead.

The pipeline will output two files prefixed by the sample ID, in this case: reads_Trinity.fasta and reads_Trinity.fasta.gene_trans_map . By default, they are saved in the same directory as the input read files.

Multiple inputs at once

The pipeline allows to feed in multiple datasets at once. You can use input file name patterns to this end:

  1. multiple input read pairs in the same directory, e.g. sample1_R{1,2}.fq , sample2_R{1,2}.fq and so on, use: --reads='sample*{1,2}.fq' ;

  2. multiple read pairs in distinct directories, e.g. sample1/R{1,2}.fq , sample2/R{1,2}.fq and so on, use: --reads='sample*/R{1,2}.fq' .

Major options

The pipeline can be used with the additional profile localdisk , for instance -profile zeus,localdisk , to enable executing I/O intensive processes in node-local disks; a configuration parameter allows to define the naming convention for the corresponding node-local scratch directories.

In alternative, the pipeline can be used with the additional profile overlay , as in -profile zeus,overlay , to enable execution inside an overlayFS (virtual filesystem in a file) and mitigate I/O intensive analyses. This option requires the use of Singularity. A configuration parameter allows to define the size for the overlay files (one file per concurrent task).

In the case of Gadi at NCI, you can use -profile gadi,localdisk to enable executing I/O intensive processes in node-local disks (JOBFS). The default Gadi profile makes use of environment modules to provide the required packages; to switch to a Singularity container instead, use the flag -profile gadi,singularity .

Usage on different systems

The main pipeline file, main.nf , contains the pipeline logic and its almost completely machine independent.
All system specific information is contained in configuration files under the config directory, whose information is included in nextflow.config .

Examples are provided for Zeus and Nimbus at Pawsey, and Gadi at NCI; you can use them as templates for other systems.
Typical information to be specified includes scheduler configuration (including project name), software availability (containers, conda, modules, ..), and eventually other specificities such as location of the work directory for runtime, filesystem options ( e.g. set cache mode to lenient when using parallel filesystems), pipeline configurations ( e.g. local directory naming for localdisk , size of overlay files).

NOTE on Gadi: it is assumed that all scripts and data required at runtime can be found in directories belonging to the PBS project that is specified in the gadi.config and in the pipeline submission script (the two must match).

Additional resources

The extra directory contains an example Slurm script, job_zeus.sh , to run on Zeus, and an example PBS script, job_gadi.sh , to run on Gadi at NCI. There is also a sample script log.sh that takes a run name as input and displays formatted runtime information.
This directory also contains scripts that can be used to install a patched version of Nextflow on Gadi at NCI, required to comply with its PBS configuration: hack-nextflow-pbs-gadi.sh , which in turn requires patch.PbsProExecutor.groovy .

The test directory contains a small input dataset and launching scripts for quick testing of the pipeline (both for Zeus and Gadi), with total runtime of a few minutes.

Code Snippets

31
32
33
34
35
36
37
38
39
"""
singularity exec docker://ubuntu:18.04 bash -c ' \
out_file=\"${params.overfileprefix}one\" && \
mkdir -p overlay_tmp/upper overlay_tmp/work && \
dd if=/dev/zero of=\${out_file} count=${params.overlay_size_mb_one} bs=1M && \
mkfs.ext3 -d overlay_tmp \${out_file} && \
rm -rf overlay_tmp \
'
"""
57
58
59
60
61
62
63
64
65
"""
singularity exec docker://ubuntu:18.04 bash -c ' \
out_file=\"${params.overfileprefix}${reads_fa.toString().minus('.tgz')}\" && \
mkdir -p overlay_tmp/upper overlay_tmp/work && \
dd if=/dev/zero of=\${out_file} count=${params.overlay_size_mb_many} bs=1M && \
mkfs.ext3 -d overlay_tmp \${out_file} && \
rm -rf overlay_tmp \
'
"""
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
"""
mem='${task.memory}'
mem=\${mem%B}
mem=\${mem// /}

Trinity \
  --left $read1 \
  --right $read2 \
  --seqType fq \
  --no_normalize_reads \
  --verbose \
  --no_version_check \
  --output ${params.taskoutdir} \
  --max_memory \${mem} \
  --CPU ${task.cpus} \
  --no_run_inchworm
"""
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
"""
mem='${task.memory}'
mem=\${mem%B}
mem=\${mem// /}

Trinity \
  --left $read1 \
  --right $read2 \
  --seqType fq \
  --no_normalize_reads \
  --verbose \
  --no_version_check \
  --output ${params.taskoutdir} \
  --max_memory \${mem} \
  --CPU ${task.cpus} \
  --inchworm_cpu ${task.cpus} \
  --no_run_chrysalis
"""
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
"""
if [ "${params.localdisk}" == "true" ] ; then
  here=\$PWD
  rm -rf ${params.localdir}
  mkdir ${params.localdir}
  cp -r \$( readlink $read1 ) ${params.localdir}/
  cp -r \$( readlink $read2 ) ${params.localdir}/
  cp -r \$( readlink ${params.taskoutdir} ) ${params.localdir}/
  cd ${params.localdir}
fi

mem='${task.memory}'
mem=\${mem%B}
mem=\${mem// /}

Trinity \
  --left $read1 \
  --right $read2 \
  --seqType fq \
  --no_normalize_reads \
  --verbose \
  --no_version_check \
  --output ${params.taskoutdir} \
  --max_memory \${mem} \
  --CPU ${task.cpus} \
  --no_distributed_trinity_exec

if [ "${params.localdisk}" == "true" ] ; then
  find ${params.taskoutdir}/read_partitions -name "*inity.reads.fa" >output_list
  split -l ${params.bf_collate} -a 4 output_list chunk
  for f in chunk* ; do
    tar -cz -h -f \${f}.tgz -T \${f}
  done
  cd \$here
  cp ${params.localdir}/chunk*.tgz .
  rm -r ${params.localdir}
fi
"""
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
  """
  if [ "${params.localdisk}" == "true" ] ; then
    here=\$PWD
    rm -rf ${params.localdir}
    mkdir ${params.localdir}
    cp -r \$( readlink $reads_fa ) ${params.localdir}/
    cd ${params.localdir}
  fi

  mem='${params.bf_mem}'
  mem=\${mem%B}
  export mem=\${mem// /}

  cat << "EOF" >trinity.sh
Trinity \
  --single \${1} \
  --run_as_paired \
  --seqType fa \
  --verbose \
  --no_version_check \
  --workdir trinity_workdir \
  --output \${1}.out \
  --max_memory \${mem} \
  --CPU ${params.bf_cpus} \
  --trinity_complete \
  --full_cleanup \
  --no_distributed_trinity_exec
EOF
  chmod +x trinity.sh

  if [ "${params.localdisk}" == "true" ] ; then
    tar xzf ${reads_fa}
    find ${params.taskoutdir}/read_partitions -name "*inity.reads.fa" | parallel -j ${task.cpus} ./trinity.sh {}
    find ${params.taskoutdir}/read_partitions -name "*inity.fasta" | tar -cz -h -f out_${reads_fa} -T -
    cd \$here
    cp ${params.localdir}/out_chunk*.tgz .
    rm -r ${params.localdir}
  else
    ls *inity.reads.fa | parallel -j ${task.cpus} ./trinity.sh {}
  fi
  """
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
"""
my_trinity=\$(which Trinity)
my_trinity=\$(dirname \$my_trinity)

if [ "${params.localdisk}" == "true" ] ; then
  here=\$PWD
  rm -rf ${params.localdir}
  mkdir ${params.localdir}
  cd ${params.localdir}
  for f in ${reads_fasta} ; do
    cp \$( readlink \$here/\$f ) .
    tar xzf \${f}
  done
  find ${params.taskoutdir}/read_partitions -name "*inity.fasta" >input_list
else
  ls *inity.fasta >input_list
fi

cat input_list | \${my_trinity}/util/support_scripts/partitioned_trinity_aggregator.pl \
  --token_prefix TRINITY_DN --output_prefix Trinity.tmp
mv Trinity.tmp.fasta Trinity.fasta

\${my_trinity}/util/support_scripts/get_Trinity_gene_to_trans_map.pl Trinity.fasta > Trinity.fasta.gene_trans_map

if [ "${params.localdisk}" == "true" ] ; then
  cd \$here
  cp ${params.localdir}/Trinity.fasta* .
  rm -r ${params.localdir}
fi
"""
NextFlow From line 248 of master/main.nf
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/marcodelapierre/trinity-nf
Name: trinity-rna-assembly
Version: Version 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...