HiFi de novo genome assembly workflow used to analyse Pacbio CCS reads
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
HiFi de novo genome assembly workflow
HiFi-assembly-workflow is a bioinformatics pipeline that can be used to analyse Pacbio CCS reads for de novo genome assembly using PacBio Circular Consensus Sequencing (CCS) reads. This workflow is implemented in Nextflow and has 3 major sections.
Please refer to the following documentation for detailed description of each workflow section:
HiFi assembly workflow flowchart
Quick Usage:
The pipeline has been tested on NCI Gadi and AGRF balder cluster. If needed to run on AGRF cluster, please contact us at [email protected] . Please note for running this on NCI Gadi you need access. Please refer to Gadi guidelines for account creation and usage: these can be found at https://opus.nci.org.au/display/Help/Access .
Here is an example that can be used to run a phased assembly on Gadi:
Module load nextflow/21.04.3
nextflow run Hifi_assembly.nf –bam_folder -profile gadi
The workflow accepts 2 mandatory arguments:
--bam_folder -- Full Path to the CCS bam files
-profile -- gadi/balder/local
General recommendations for using the HiFi de novo genome assembly workflow
exeReport
This folder contains a computation resource usage summary in various charts and a text file.
report.html
provides a comprehensive summary.
Results
The
Results
folder contains three sub-directories preQC, assembly and postqc. As the name suggests, outputs from the respective workflow sections are placed in each of these folders.
preQC
The following table contains list of files and folder from preQC results
Output folder/file File Description .fa Bam files converted to fasta format kmer_analysis Folder containing kmer analysis outputs .jf k-mer counts from each sample .histo histogram of k-mer occurrence genome_profiling genomescope profiling outputs summary.txt Summary metrics of genome scope outputs linear_plot.png Plot showing no. of times a k-mer observed by no. of k-mers with that coverage
Assembly
This folder contains final assembly results in format.
-
_primary.fa
- Fasta file containing primary contigs -
_associate.fa
- Fasta file containing associated contigs
postqc
The postqc folder contains two sub folders
-
assembly_completeness
-
assembly_evaluation
assembly_completeness
This contains BUSCO evaluation results for primary and associate contig.
assembly_evaluation
Assembly evaluation folder contains various file formats, here is a brief description for each of the outputs.
File Description report.txt Assessment summary in plain text format report.tsv Tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc) report.tex LaTeX version of the summary icarus.html Icarus main menu with links to interactive viewers report.html HTML version of the report with interactive plots inside Infrastructure usage and recommendations
NCI facility access
One should have a user account set with NCI to access gadi high performance computational facility. Setting up a NCI account is mentioned in detail at the following URL: https://opus.nci.org.au/display/Help/Setting+up+your+NCI+Account
Code Snippets
21 22 23 24 25 26 27 28 29 | """ check_samplesheet.py \\ $complete_samplesheet \\ complete_samplesheet.valid.csv cat <<-END_VERSIONS > versions.yml "${task.process}": python: \$(python --version | sed 's/Python //g') END_VERSIONS """ |
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | """ # Nextflow changes the container --entrypoint to /bin/bash (container default entrypoint: /usr/local/env-execute) # Check for container variable initialisation script and source it. if [ -f "/usr/local/env-activate.sh" ]; then set +u # Otherwise, errors out because of various unbound variables . "/usr/local/env-activate.sh" set -u fi # If the augustus config directory is not writable, then copy to writeable area if [ ! -w "\${AUGUSTUS_CONFIG_PATH}" ]; then # Create writable tmp directory for augustus AUG_CONF_DIR=\$( mktemp -d -p \$PWD ) cp -r \$AUGUSTUS_CONFIG_PATH/* \$AUG_CONF_DIR export AUGUSTUS_CONFIG_PATH=\$AUG_CONF_DIR echo "New AUGUSTUS_CONFIG_PATH=\${AUGUSTUS_CONFIG_PATH}" fi # Ensure the input is uncompressed INPUT_SEQS=input_seqs mkdir "\$INPUT_SEQS" cd "\$INPUT_SEQS" for FASTA in ../tmp_input/*; do if [ "\${FASTA##*.}" == 'gz' ]; then gzip -cdf "\$FASTA" > \$( basename "\$FASTA" .gz ) else ln -s "\$FASTA" . fi done cd .. busco \\ --cpu $task.cpus \\ --in "\$INPUT_SEQS" \\ --out ${prefix}-busco \\ $busco_lineage \\ $busco_lineage_dir \\ $busco_config \\ $args # clean up rm -rf "\$INPUT_SEQS" # Move files to avoid staging/publishing issues mv ${prefix}-busco/batch_summary.txt ${prefix}-busco.batch_summary.txt mv ${prefix}-busco/*/short_summary.*.{json,txt} . || echo "Short summaries were not available: No genes were found." cat <<-END_VERSIONS > versions.yml "${task.process}": busco: \$( busco --version 2>&1 | sed 's/^BUSCO //' ) END_VERSIONS """ |
25 26 27 28 29 30 31 32 33 34 35 36 37 38 | """ quast.py \\ --output-dir QUAST \\ *.fasta.gz \\ --threads $task.cpus \\ $args ln -s QUAST/report.tsv cat <<-END_VERSIONS > versions.yml "${task.process}": quast: \$(quast.py --version 2>&1 | sed 's/^.*QUAST v//; s/ .*\$//') END_VERSIONS """ |
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | """ flye \\ $mode \\ $reads \\ --out-dir . \\ --threads \\ $task.cpus \\ $args gzip -c assembly.fasta > ${prefix}.assembly.fasta.gz gzip -c assembly_graph.gfa > ${prefix}.assembly_graph.gfa.gz gzip -c assembly_graph.gv > ${prefix}.assembly_graph.gv.gz mv assembly_info.txt ${prefix}.assembly_info.txt mv flye.log ${prefix}.flye.log mv params.json ${prefix}.params.json cat <<-END_VERSIONS > versions.yml "${task.process}": flye: \$( flye --version ) END_VERSIONS """ |
55 56 57 58 59 60 61 62 63 64 65 66 67 | """ echo stub > assembly.fasta | gzip -c assembly.fasta > ${prefix}.assembly.fasta.gz echo stub > assembly_graph.gfa | gzip -c assembly_graph.gfa > ${prefix}.assembly_graph.gfa.gz echo stub > assembly_graph.gv | gzip -c assembly_graph.gv > ${prefix}.assembly_graph.gv.gz echo contig_1 > ${prefix}.assembly_info.txt echo stub > ${prefix}.flye.log echo stub > ${prefix}.params.json cat <<-END_VERSIONS > versions.yml "${task.process}": flye: \$( flye --version ) END_VERSIONS """ |
28 29 30 31 32 33 34 35 36 37 38 39 40 | """ multiqc \\ --force \\ $args \\ $config \\ $extra_config \\ . cat <<-END_VERSIONS > versions.yml "${task.process}": multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" ) END_VERSIONS """ |
43 44 45 46 47 48 49 50 51 52 | """ touch multiqc_data touch multiqc_plots touch multiqc_report.html cat <<-END_VERSIONS > versions.yml "${task.process}": multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" ) END_VERSIONS """ |
25 26 27 28 29 30 31 32 33 34 35 36 37 | """ samtools \\ stats \\ --threads ${task.cpus} \\ ${reference} \\ ${input} \\ > ${prefix}.stats cat <<-END_VERSIONS > versions.yml "${task.process}": samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') END_VERSIONS """ |
41 42 43 44 45 46 47 48 | """ touch ${prefix}.stats cat <<-END_VERSIONS > versions.yml "${task.process}": samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') END_VERSIONS """ |
Support
- Future updates