HiFi de novo genome assembly workflow used to analyse Pacbio CCS reads

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

hifi-de-novo-genome-assembly-workflow — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

HiFi de novo genome assembly workflow

HiFi-assembly-workflow is a bioinformatics pipeline that can be used to analyse Pacbio CCS reads for de novo genome assembly using PacBio Circular Consensus Sequencing (CCS) reads. This workflow is implemented in Nextflow and has 3 major sections.

Please refer to the following documentation for detailed description of each workflow section:

HiFi assembly workflow flowchart

Quick Usage:

The pipeline has been tested on NCI Gadi and AGRF balder cluster. If needed to run on AGRF cluster, please contact us at [email protected] . Please note for running this on NCI Gadi you need access. Please refer to Gadi guidelines for account creation and usage: these can be found at https://opus.nci.org.au/display/Help/Access .

Here is an example that can be used to run a phased assembly on Gadi:

Module load nextflow/21.04.3 nextflow run Hifi_assembly.nf –bam_folder -profile gadi The workflow accepts 2 mandatory arguments: --bam_folder -- Full Path to the CCS bam files -profile -- gadi/balder/local

General recommendations for using the HiFi de novo genome assembly workflow

exeReport

This folder contains a computation resource usage summary in various charts and a text file. report.html provides a comprehensive summary.

Results

The Results folder contains three sub-directories preQC, assembly and postqc. As the name suggests, outputs from the respective workflow sections are placed in each of these folders.

preQC

The following table contains list of files and folder from preQC results

Output folder/file File Description .fa Bam files converted to fasta format kmer_analysis Folder containing kmer analysis outputs .jf k-mer counts from each sample .histo histogram of k-mer occurrence genome_profiling genomescope profiling outputs summary.txt Summary metrics of genome scope outputs linear_plot.png Plot showing no. of times a k-mer observed by no. of k-mers with that coverage

Assembly

This folder contains final assembly results in format.

_primary.fa - Fasta file containing primary contigs
_associate.fa - Fasta file containing associated contigs

postqc

The postqc folder contains two sub folders

assembly_completeness
assembly_evaluation

assembly_completeness

This contains BUSCO evaluation results for primary and associate contig.

assembly_evaluation

Assembly evaluation folder contains various file formats, here is a brief description for each of the outputs.

File Description report.txt Assessment summary in plain text format report.tsv Tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc) report.tex LaTeX version of the summary icarus.html Icarus main menu with links to interactive viewers report.html HTML version of the report with interactive plots inside Infrastructure usage and recommendations

NCI facility access

One should have a user account set with NCI to access gadi high performance computational facility. Setting up a NCI account is mentioned in detail at the following URL: https://opus.nci.org.au/display/Help/Setting+up+your+NCI+Account

Code Snippets

"""
check_samplesheet.py \\
    $complete_samplesheet \\
    complete_samplesheet.valid.csv
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version | sed 's/Python //g')
END_VERSIONS
"""

NextFlow From line 21 of local/check_samplesheet.nf

"""
# Nextflow changes the container --entrypoint to /bin/bash (container default entrypoint: /usr/local/env-execute)
# Check for container variable initialisation script and source it.
if [ -f "/usr/local/env-activate.sh" ]; then
    set +u  # Otherwise, errors out because of various unbound variables
    . "/usr/local/env-activate.sh"
    set -u
fi

# If the augustus config directory is not writable, then copy to writeable area
if [ ! -w "\${AUGUSTUS_CONFIG_PATH}" ]; then
    # Create writable tmp directory for augustus
    AUG_CONF_DIR=\$( mktemp -d -p \$PWD )
    cp -r \$AUGUSTUS_CONFIG_PATH/* \$AUG_CONF_DIR
    export AUGUSTUS_CONFIG_PATH=\$AUG_CONF_DIR
    echo "New AUGUSTUS_CONFIG_PATH=\${AUGUSTUS_CONFIG_PATH}"
fi

# Ensure the input is uncompressed
INPUT_SEQS=input_seqs
mkdir "\$INPUT_SEQS"
cd "\$INPUT_SEQS"
for FASTA in ../tmp_input/*; do
    if [ "\${FASTA##*.}" == 'gz' ]; then
        gzip -cdf "\$FASTA" > \$( basename "\$FASTA" .gz )
    else
        ln -s "\$FASTA" .
    fi
done
cd ..

busco \\
    --cpu $task.cpus \\
    --in "\$INPUT_SEQS" \\
    --out ${prefix}-busco \\
    $busco_lineage \\
    $busco_lineage_dir \\
    $busco_config \\
    $args

# clean up
rm -rf "\$INPUT_SEQS"

# Move files to avoid staging/publishing issues
mv ${prefix}-busco/batch_summary.txt ${prefix}-busco.batch_summary.txt
mv ${prefix}-busco/*/short_summary.*.{json,txt} . || echo "Short summaries were not available: No genes were found."

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    busco: \$( busco --version 2>&1 | sed 's/^BUSCO //' )
END_VERSIONS
"""

NextFlow augustus From line 32 of busco/main.nf

"""
quast.py \\
    --output-dir QUAST \\
    *.fasta.gz \\
    --threads $task.cpus \\
    $args

ln -s QUAST/report.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    quast: \$(quast.py --version 2>&1 | sed 's/^.*QUAST v//; s/ .*\$//')
END_VERSIONS
"""

NextFlow QUAST From line 25 of quast/main.nf

"""
flye \\
    $mode \\
    $reads \\
    --out-dir . \\
    --threads \\
    $task.cpus \\
    $args

gzip -c assembly.fasta > ${prefix}.assembly.fasta.gz
gzip -c assembly_graph.gfa > ${prefix}.assembly_graph.gfa.gz
gzip -c assembly_graph.gv > ${prefix}.assembly_graph.gv.gz
mv assembly_info.txt ${prefix}.assembly_info.txt
mv flye.log ${prefix}.flye.log
mv params.json ${prefix}.params.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    flye: \$( flye --version )
END_VERSIONS
"""

NextFlow Flye From line 31 of flye/main.nf

"""
echo stub > assembly.fasta | gzip -c assembly.fasta > ${prefix}.assembly.fasta.gz
echo stub > assembly_graph.gfa | gzip -c assembly_graph.gfa > ${prefix}.assembly_graph.gfa.gz
echo stub > assembly_graph.gv | gzip -c assembly_graph.gv > ${prefix}.assembly_graph.gv.gz
echo contig_1 > ${prefix}.assembly_info.txt
echo stub > ${prefix}.flye.log
echo stub > ${prefix}.params.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    flye: \$( flye --version )
END_VERSIONS
"""

NextFlow Flye From line 55 of flye/main.nf

"""
multiqc \\
    --force \\
    $args \\
    $config \\
    $extra_config \\
    .

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""

NextFlow MultiQC From line 28 of multiqc/main.nf

"""
touch multiqc_data
touch multiqc_plots
touch multiqc_report.html

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""

NextFlow MultiQC From line 43 of multiqc/main.nf

"""
samtools \\
    stats \\
    --threads ${task.cpus} \\
    ${reference} \\
    ${input} \\
    > ${prefix}.stats

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""

NextFlow SAMtools From line 25 of stats/main.nf

"""
touch ${prefix}.stats

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""