Trinity De Novo Transcriptome Assembly on NCI-Gadi HPC

public public 1yr ago Version: Version 1 0 bookmarks

Gadi-Trinity

Description

This repository contains a staged Trinity workflow that can be run on the National Computational Infrastructure’s (NCI) Gadi supercomputer. Trinity performs de novo transcriptome assembly of RNA-seq data by combining three independent software modules Inchworm, Chrysalis and Butterfly to process RNA-seq reads. The algorithm can detect isoforms, handle paired-end reads, multiple insert sizes and strandedness. For more information see the Trinity user guide .

The Gadi-Trinity workflow leverages multiple nodes on NCI Gadi to run a number of Butterfly processes in parallel. This workflow is suitable for single sample and global assemblies of genomes < 2 Gb.



Set up

This repository contains all scripts and software required to run Gadi-Trinity. Before running this workflow, you will need to do the following:

  1. Clone the Gadi-Trinity repository from github (see ‘Installation’ below)

  2. Prepare the module archive by running create-apps.sh from the resources directory (See 'Software requirements' below)

  3. Copy the template submission script Scripts/template.sh into the project directory and edit for the project. Give it a meaningful name.

  4. Make a list of fastq files to be submitted (See ‘Input’ below)

  5. Edit key input variables in template.sh (See ‘Input’ below)

    • project= (er00)

    • list= (path/to/fastq/list)

    • seqtype= (fq or fa)

    • tissue= make sure this pulls the correct field from your fq file name

    • storage= the string to pass the PBS storage key, ie scratch/+gdata/

    • version= Choose from trinity version 2.9.1 or 2.12.0

Installation

Clone the trinity-NCI-Gadi repository to your project’s scratch directory

module load git 
git clone https://github.com/Sydney-Informatics-Hub/Gadi-Trinity.git 

Software requirements

Trinity requires the following software to be installed and loaded as modules from apps already installed on Gadi. A module archive of these software is created by running create-apps.sh in the resources directory.

trinity/2.9.1 or trinity/2.12.0
bowtie2/2.3.5.1
samtools/1.10
salmon/1.1.0
python2/2.7.17 or python3/3.7.4
jellyfish/2.3.0

Input

A plain text file containing a list of input fastq files is required input. In this file, each row corresponds to 1 sample. Each row consists of column 1: incremental number (for job array), column 2: read 1 name and column 3: read 2 name. This file can be created by running the following from the directory containing your fastq files:

readlink -f *.fastq.gz | sort -V | xargs -n 2 | cat -n > fastq.list

You will also need to edit key input variables in ‘set variables’ in template.sh that are required to run Trinity:

  • project= (er00)

  • list= (fastqlist.txt)

  • seqtype= (fq)

Usage

Overview

To manage the data-intensive computation of Trinity, each job utilises /jobfs , requiring jobs to be copied between file systems on Gadi.

Once you have made the fastq.list and set the variables in template.sh simply run the workflow by:

sh template.sh

template.sh runs Trinity in 3 phases (Trinity_1-3_fb.pbs), each being launched as an independent PBS script.

  • trinity_1_fb.pbs : clusters inchworm contigs with Chrysalis and maps reads. Stops before the parallel assembly of clustered reads

  • trinity_2_fb.pbs : assembles clusters of reads using Inchworm, Chrysalis and Butterfly. Chrysalis and Butterfly can be executed in parallel, each having independent input and output. This is the distributed part of the workflow.

  • trinity_3_fb.pbs : final assembly. Harvests all assembled transcripts into a single multi-fasta file.

HPC usage report scripts are provided at the SIH repository for users to evaluate the KSU, walltime and resource consumption and efficiency of their job submissions. These scripts gather job request metrics from Gadi log files. To use, run all scripts from within the directories containing log files to be read.

Resource usage

The Trinity pipeline consists of a series of executables launched with a single command. Each of these stages have different compute resource requirements depending on the stage of the pipeline. The initial stages of the workflow (Inchworm and Chrysalis) are data-intensive and require high memory per core and the latter stages are scalable, embarrassingly parallel, single core jobs. General computing requirement recommendation from Trinity is ~1 Gb of RAM per ~1 M pairs of Illumina sequence reads.

The distributed part of the workflow is unlikely to require significant jobfs or memory resources. However, the initial phase of the workflow may need to run on the hugemem nodes. If this is the case, edit the qsub definition at the bottom of the template.sh script. As there are some serial bottlenecks in the first part of the workflow, reducing the requested resources may improve the 'efficiency' of the calculation. For instance half of a hugemem node (24 cores, 750 GB memory, 700 GB jobfs) may be sufficient for a larger assembly. Memory and jobfs requirements to process samples are sufficiently serviced with NCI Gadi’s normal nodes (48 CPUs, 400 Gb of /jobfs disk space).

Benchmarking metrics

The following benchmarking metrics were obtained using stem rust ( Puccinia graminis ) datasets with a genome size of ~170 Mb. Each of these were run on Gadi’s normal nodes (48 CPUs, 400 Gb of /jobfs disk space).

Wheat stem rust

Job CPUs Mem CPUtime Walltime_used JobFS_used Efficiency Service_units
trinity_1.pbs 48 182.49GB 68:27:15 2:59:16 193.35GB 0.48 286.83
trinity_2_fb_0.pbs 48 80.33GB 115:52:10 2:33:03 19.89GB 0.95 244.88
trinity_2_fb_1.pbs 48 17.42GB 18:51:03 0:26:00 243.04MB 0.91 41.6
trinity_3.pbs 48 5.14GB 0:00:12 0:01:26 267.1MB 0 2.29
Total 5:33:45 576

Rye rust

Job CPUs Mem CPUtime Walltime_used JobFS_used Efficiency Service_units
trinity_1.pbs 48 182.26GB 23:37:52 2:51:09 182.89GB 0.52 273.84
trinity_2_fb_0.pbs 48 66.32GB 21:48:17 2:05:58 19.73GB 0.93 201.55
trinity_2_fb_1.pbs 48 25.09GB 1:51:32 0:02:51 61.87MB 0.82 4.56
trinity_3.pbs 48 4.39GB 0:00:08 0:00:16 192.7MB 0.01 0.43
Total 4:57:23 480

Scabrum rust

Job CPUs Mem CPUtime Walltime_used JobFS_used Efficiency Service_units
trinity_1.pbs 48 141.51GB 37:21:15 1:46:16 111.15GB 0.44 170.03
trinity_2_fb_0.pbs 48 53.1GB 99:12:17 2:12:50 11.39GB 0.93 212.53
trinity_2_fb_1.pbs 48 20.24GB 11:01:19 0:15:38 185.78MB 0.88 25.01
trinity_3.pbs 48 4.54GB 0:00:08 0:00:13 233.19MB 0.01 0.35
Total 3:59:19 408

Additional notes

  • Trinity’s running time is exponentially related to the number of de Bruijn graph branches created. Given walltime limitations on Gadi, the Gadi-Trinity workflow is not recommended for use on genomes >2 Gb. For larger single sample and global assemblies, we recommend the Flashlite-Trinity workflow that runs Trinity on the University of Queensland’s HPC, Flashlite.

  • All work is performed local to the node in /jobfs or in /dev/shm .

  • At the end of trinity_1_fb.pbs , a single tar file containing the full Trinity output directory is copied back to network storage. This will be >100 Gb.

  • Each task running trinity_2_fb.pbs works on a single file bin representing ~100,000 tasks. Only the recursive_trinity.cmds and the relevant data from read_partitions are copied to the node. The full read_partitions directory is archived and pushed back to network storage at the end of processing. This will be up to 10 Gb.

  • In trinity_3_fb.pbs , only the fasta files from the distributed step are copied to the node. Only the full assembly is copied back.

  • The sort-recursive.py script is run by trinity_2_fb.pbs . It will sort the recursive commands to run on input files based on their size (largest to smallest). This is included to avoid long running, single CPU jobs from holding up a whole node. It improves overall job efficiency.

  • The scripts were designed to use a single project for KSU debiting and storage.

Acknowledgements

Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.

Authors

  • Tracy Chew (Sydney Informatics Hub, University of Sydney)

  • Georgina Samaha (Sydney Informatics Hub, University of Sydney)

  • Cali Willet (Sydney Informatics Hub, University of Sydney)

  • Rosemarie Sadsad (Sydney Informatics Hub, University of Sydney)

  • Rika Kobayashi (National Computational Infrastructure)

  • Matthew Downton (National Computational Infrastructure)

  • Ben Menadue (National Computational Infrastructure)

Suggested acknowledgement:

The authors acknowledge the support provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government, and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia funding.

Cite us to support us!

Chew, T., Samaha, G., Downton, M., Willet, C., Menadue, B. J., Kobayashi, R., & Sadsad, R. (2021). Gadi-Trinity (Version 1.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.145.1

References

Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644-652. Published 2011 May 15. doi:10.1038/nbt.1883

Haas BJ, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8(8):1494-1512. doi:10.1038/nprot.2013.084

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import sys

if(len(sys.argv) != 2):
    print("Call with name of recursive_trinity.cmds file")
    exit

fname = sys.argv[1]

# Picky about where it is called
def commandToSize(command):
    return os.stat("/".join(
        command.split()[2].lstrip("\"").rstrip("\"").split("/")[-5:])).st_size

with open(fname) as f:
    commands = f.readlines()

lofc = [{"size":commandToSize(x),"command":x} for x in commands]
sortedlofc = sorted(lofc,key=lambda x: x['size'],reverse=True)
with open(fname+'.sorted','w') as f:
    for c in sortedlofc:
        f.write(c['command'])
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
project= <project>
list= <fastq.list>
seqtype= <seqtype>
# Set version of trinity. 2.9.1 or 2.12.0
version=
# Storage script for PBS command, e.g. scratch/<project>
storage=

io=$PWD
script=${io}/Scripts
resources=${io}/resources
logs=${io}/Logs

# Resource requests for steps 2 and 3. If the initial step (which will
# run through to the end of chrysalis) is likely to be very large (for
# instance using the hugemem queue), edit the qsub submission below.
cpu_per_node=48
mem_per_node=190
jobfs_per_node=400

echo "CPUs per node: ${cpu_per_node}, mem per node: ${mem_per_node}"
echo "JobFS per node: ${jobfs_per_node}"

num_pairs=$(grep -c -v '^$' ${list})

mkdir -p ${logs}

# Loop through each line of fastq.list, and submit all trinity jobs. Trinity 2-3 will only start
# when the previous part has run successfully
for i in $(seq 1 ${num_pairs}); do
	# Extracts "tissue" name from filename - change to suit your samples
	tissue=$(basename -- "$(awk -v taskID=$i '$1==taskID {print $2}' ${list})" | cut -d _ -f 1 | cut -d . -f 1)
	out=${io}/Trinity/${tissue}

	echo `date` ": STARTING TRINITY FOR ${tissue}"
	# trinity_1.pbs
	echo `date` ": Launching Trinity Part 1"
	qsub \
	    -v input="${i}",seqtype="${seqtype}",out="${out}",list="${list}",tissue="${tissue}",resources="${resources}",cpu_per_node="${cpu_per_node}",jobfs_per_node="${jobfs_per_node}",mem_per_node="${mem_per_node}",project="${project}",script="${script}",logs="${logs}",io="${io}",storage="${storage}",version="${version}" \
	    -N ${tissue}_1 \
	    -P ${project} \
	    -l wd,ncpus=48,mem=190GB,walltime=48:00:00,jobfs=400GB \
	    -W umask=022 \
	    -l storage=${storage} \
	    -q normal \
	    -o ${logs}/${tissue}_job1.o \
	    -e ${logs}/${tissue}_job1.e \
	    ${script}/trinity_1_fb.pbs

done
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
tar xf ${resources}/apps.tar -C /tmp/
export APPS_DIR=/tmp/
module use ${APPS_DIR}/apps/Modules/modulefiles
module load bowtie2/2.3.5.1
module load samtools/1.10
module load salmon/1.1.0
if [[ $version == '2.9.1' ]]
then
    echo "Loading python2/2.7.17 and Trinity/2.9.1"
    module load python2/2.7.17
    module load trinity/2.9.1
elif [[ $version == '2.12.0' ]]
then
    echo "Loading python3/3.7.4 and Trinity/2.12.0"
    module load python3/3.7.4
    module load trinity/2.12.0
fi
module load jellyfish/2.3.0

# Set trap
# EXIT runs on any exit, signalled or not.
finish(){
	echo "$(date) : Archiving trinity outdir and copying to ${out}"
	cd ${PBS_JOBFS}
	tar cf ${out}/trinity_outdir_1.tar trinity_outdir
	echo "$(date) : Finished archiving trinity_1.pbs"

    # Submit the follow up recursive jobs
    cd trinity_outdir/${tissue}_trinity_${version}/read_partitions
    for fb in Fb_*
    do
        jobids=${jobids}:$(qsub \
                               -v out="${out}",tissue="${tissue}",fb="${fb}",resources="${resources}",io="${io}",version="${version}" \
                               -N ${tissue}_${fb} \
                               -P ${project} \
                               -l wd,ncpus=${cpu_per_node},mem=${mem_per_node}GB,walltime=48:00:00,jobfs=${jobfs_per_node}GB \
                               -q normal \
                               -W umask=022 \
                               -l storage=${storage} \
                               -o ${logs}/${tissue}_job_2_${fb}.o \
                               -e ${logs}/${tissue}_job_2_${fb}.e \
                               ${script}/trinity_2_fb.pbs)
    done

    jobids=$(echo $jobids | sed -e 's/^://' | sed -e 's/.gadi-pbs//g')
    echo "Final assembly will commence after jobs: ${jobids}"

    # Submit the final assembly
    qsub \
        -W depend=afterok:${jobids} \
        -v resources="${resources}",tissue="${tissue}",out="${out}",io="${io}",version="${version}" \
        -P ${project} \
        -l wd,ncpus=${cpu_per_node},mem=${mem_per_node}GB,walltime=48:00:00,jobfs=${jobfs_per_node}GB \
        -q normal \
        -W umask=022 \
        -l storage=${storage} \
        -o ${logs}/${tissue}_job_3.o \
        -e ${logs}/${tissue}_job_3.e \
        ${script}/trinity_3_fb.pbs

}
trap finish EXIT

# Set variables
first=$(awk -v taskID=${input} '$1==taskID {print $2}' ${list})
second=$(awk -v taskID=${input} '$1==taskID {print $3}' ${list})

mkdir -p ${out}

echo "$(date) : Beginning trinity_1_fb.pbs: Run to end of Chrysalis"

export TRINITY_WORKDIR=/dev/shm/trinity_workdir

export TRINITY_OUTDIR=${PBS_JOBFS}/trinity_outdir
mkdir -p ${TRINITY_OUTDIR}
cd ${TRINITY_OUTDIR}

# Run trinity, stop before the distributed tasks
# Set the memory and cpu count based on the PBS variables.
${TRINITY_HOME}/Trinity \
               --seqType ${seqtype} \
	           --max_memory $(($PBS_VMEM/1024/1024/1024))G \
	           --no_version_check \
	           --left ${first} \
	           --right ${second} \
	           --no_normalize_reads \
	           --CPU ${PBS_NCPUS} \
	           --workdir ${TRINITY_WORKDIR} \
	           --output ${tissue}_trinity_${version} \
	           --verbose \
	           --no_distributed_trinity_exec

echo "$(date) : Finished trinity_1_fb.pbs"
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
cd ${io}

echo $resources
tar xf ${resources}/apps.tar -C /tmp/
export APPS_DIR=/tmp/
module use ${APPS_DIR}/apps/Modules/modulefiles
module load bowtie2/2.3.5.1
module load samtools/1.10
module load salmon/1.1.0
if [[ $version == '2.9.1' ]]
then
    echo "Loading python2/2.7.17 and Trinity/2.9.1"
    module load python2/2.7.17
    module load trinity/2.9.1
elif [[ $version == '2.12.0' ]]
then
    echo "Loading python3/3.7.4 and Trinity/2.12.0"
    module load python3/3.7.4
    module load trinity/2.12.0
fi
module load jellyfish/2.3.0


finish(){
    echo "$(date): Copying data from jobFS to ${out}..."

    cd ${TRINITY_OUTDIR}/${tissue}_trinity_${version}
    tar cf ${out}/trinity_outdir_2_${fb}.tar read_partitions

    echo "$(date) : Finished trinity_2_fb.pbs for ${fb}"
}
trap finish EXIT

echo "$(date): Beginning recursive_trinity step on ${fb} directory: Assemble clusters of reads in parallel"

echo "Total number of CPUs: ${PBS_NCPUS}"

# Only untar the relevant fb directory
tar xf ${out}/trinity_outdir_1.tar -C ${PBS_JOBFS} */read_partitions/${fb}/ */recursive_trinity.cmds

export TRINITY_OUTDIR=${PBS_JOBFS}/trinity_outdir
cd ${TRINITY_OUTDIR}

echo "$(date): Currently in ${TRINITY_OUTDIR}"

# Re-write "partitioned_reads.files.list" so it has the correct paths
echo "$(date): Updating paths for partitioned_reads.files.list"
find ${PWD}/ -iname '*trinity.reads.fa' > ${tissue}_trinity_${version}/partitioned_reads.files.list
find ${PWD}/ -iname '*trinity.reads.fa'
head ${tissue}_trinity_${version}/partitioned_reads.files.list

# Re-write "recursive_trinity.cmds" so that it has correct paths
echo "$(date): Updating paths for recursive_trinity.cmds"
echo "before:" 
head ${tissue}_trinity_${version}/recursive_trinity.cmds
sed -i -e 's|\/jobfs\/[0-9]\+\.gadi-pbs|'${PBS_JOBFS}'|g' ${tissue}_trinity_${version}/recursive_trinity.cmds
echo "after:"
head ${tissue}_trinity_${version}/recursive_trinity.cmds

# Select relevant commands
grep ${fb}/ ${tissue}_trinity_${version}/recursive_trinity.cmds > ${tissue}_trinity_${version}/recursive_trinity.${fb}.cmds

${io}/Scripts/sort-recursive.py ${tissue}_trinity_${version}/recursive_trinity.${fb}.cmds

export OMP_PROC_BIND=TRUE
${TRINITY_HOME}/trinity-plugins/ParaFly-0.1.0/bin/ParaFly -c ${tissue}_trinity_${version}/recursive_trinity.${fb}.cmds.sorted -CPU ${PBS_NCPUS}
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
cd ${io}

tar xf ${resources}/apps.tar -C /tmp/
export APPS_DIR=/tmp/
module use ${APPS_DIR}/apps/Modules/modulefiles
module load trinity/${version}

echo "$(date): Beginning trinity_3_fb.pbs: Harvest reads into a final assembly"

export TRINITY_OUTDIR=${PBS_JOBFS}/${tissue}_trinity_${version}
mkdir -p ${TRINITY_OUTDIR}

cd ${TRINITY_OUTDIR}

echo "$(date): Currently in ${TRINITY_OUTDIR}"

for t in ${out}/trinity_outdir_2_Fb_*.tar
do
	tar xf $t *.fasta
done

echo "$(date): ** Harvesting all assembled transcripts into a single multi-fasta file with "${TRINITY_HOME}"/util/support_scripts/partitioned_trinity_aggregator.pl..."
find read_partitions/ -name '*inity.fasta'  | \
    ${TRINITY_HOME}/util/support_scripts/partitioned_trinity_aggregator.pl \
                     --token_prefix TRINITY_DN \
                     --output_prefix Trinity.tmp

mv Trinity.tmp.fasta ${tissue}.trinity_${version}.fasta

echo "$(date): ** Creating genes_trans_map file with "${TRINITY_HOME}"/util/support_scripts/get_Trinity_gene_to_trans_map.pl..."

${TRINITY_HOME}/util/support_scripts/get_Trinity_gene_to_trans_map.pl ${tissue}.trinity_${version}.fasta > ${tissue}.trinity_${version}.fasta.gene_trans_map

echo "$(date): Moving files back from node"

mv ${tissue}.trinity_${version}.fasta ${out}/${tissue}_trinity_${version}.fasta
mv ${tissue}.trinity_${version}.fasta.gene_trans_map ${out}/${tissue}_trinity_${version}.gene_trans_map

echo "$(date): Done"
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Sydney-Informatics-Hub/Gadi-Trinity
Name: trinity-nci-gadi
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...