HiFi de novo genome assembly workflow used to analyse Pacbio CCS reads

public public 1yr ago Version: Version 1 0 bookmarks

HiFi de novo genome assembly workflow

HiFi-assembly-workflow is a bioinformatics pipeline that can be used to analyse Pacbio CCS reads for de novo genome assembly using PacBio Circular Consensus Sequencing (CCS) reads. This workflow is implemented in Nextflow and has 3 major sections.

Please refer to the following documentation for detailed description of each workflow section:

HiFi assembly workflow flowchart

Quick Usage:

The pipeline has been tested on NCI Gadi and AGRF balder cluster. If needed to run on AGRF cluster, please contact us at [email protected] . Please note for running this on NCI Gadi you need access. Please refer to Gadi guidelines for account creation and usage: these can be found at https://opus.nci.org.au/display/Help/Access .

Here is an example that can be used to run a phased assembly on Gadi:

Module load nextflow/21.04.3 nextflow run Hifi_assembly.nf –bam_folder -profile gadi The workflow accepts 2 mandatory arguments: --bam_folder -- Full Path to the CCS bam files -profile -- gadi/balder/local

General recommendations for using the HiFi de novo genome assembly workflow

exeReport

This folder contains a computation resource usage summary in various charts and a text file. report.html provides a comprehensive summary.

Results

The Results folder contains three sub-directories preQC, assembly and postqc. As the name suggests, outputs from the respective workflow sections are placed in each of these folders.

preQC

The following table contains list of files and folder from preQC results

Output folder/file File Description .fa Bam files converted to fasta format kmer_analysis Folder containing kmer analysis outputs .jf k-mer counts from each sample .histo histogram of k-mer occurrence genome_profiling genomescope profiling outputs summary.txt Summary metrics of genome scope outputs linear_plot.png Plot showing no. of times a k-mer observed by no. of k-mers with that coverage

Assembly

This folder contains final assembly results in format.

  • _primary.fa - Fasta file containing primary contigs
  • _associate.fa - Fasta file containing associated contigs

postqc

The postqc folder contains two sub folders

  • assembly_completeness
  • assembly_evaluation

assembly_completeness

This contains BUSCO evaluation results for primary and associate contig.

assembly_evaluation

Assembly evaluation folder contains various file formats, here is a brief description for each of the outputs.

File Description report.txt Assessment summary in plain text format report.tsv Tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc) report.tex LaTeX version of the summary icarus.html Icarus main menu with links to interactive viewers report.html HTML version of the report with interactive plots inside Infrastructure usage and recommendations

NCI facility access

One should have a user account set with NCI to access gadi high performance computational facility. Setting up a NCI account is mentioned in detail at the following URL: https://opus.nci.org.au/display/Help/Setting+up+your+NCI+Account

Code Snippets

21
22
23
24
25
26
27
28
29
"""
check_samplesheet.py \\
    $complete_samplesheet \\
    complete_samplesheet.valid.csv
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version | sed 's/Python //g')
END_VERSIONS
"""
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
"""
# Nextflow changes the container --entrypoint to /bin/bash (container default entrypoint: /usr/local/env-execute)
# Check for container variable initialisation script and source it.
if [ -f "/usr/local/env-activate.sh" ]; then
    set +u  # Otherwise, errors out because of various unbound variables
    . "/usr/local/env-activate.sh"
    set -u
fi

# If the augustus config directory is not writable, then copy to writeable area
if [ ! -w "\${AUGUSTUS_CONFIG_PATH}" ]; then
    # Create writable tmp directory for augustus
    AUG_CONF_DIR=\$( mktemp -d -p \$PWD )
    cp -r \$AUGUSTUS_CONFIG_PATH/* \$AUG_CONF_DIR
    export AUGUSTUS_CONFIG_PATH=\$AUG_CONF_DIR
    echo "New AUGUSTUS_CONFIG_PATH=\${AUGUSTUS_CONFIG_PATH}"
fi

# Ensure the input is uncompressed
INPUT_SEQS=input_seqs
mkdir "\$INPUT_SEQS"
cd "\$INPUT_SEQS"
for FASTA in ../tmp_input/*; do
    if [ "\${FASTA##*.}" == 'gz' ]; then
        gzip -cdf "\$FASTA" > \$( basename "\$FASTA" .gz )
    else
        ln -s "\$FASTA" .
    fi
done
cd ..

busco \\
    --cpu $task.cpus \\
    --in "\$INPUT_SEQS" \\
    --out ${prefix}-busco \\
    $busco_lineage \\
    $busco_lineage_dir \\
    $busco_config \\
    $args

# clean up
rm -rf "\$INPUT_SEQS"

# Move files to avoid staging/publishing issues
mv ${prefix}-busco/batch_summary.txt ${prefix}-busco.batch_summary.txt
mv ${prefix}-busco/*/short_summary.*.{json,txt} . || echo "Short summaries were not available: No genes were found."

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    busco: \$( busco --version 2>&1 | sed 's/^BUSCO //' )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
"""
quast.py \\
    --output-dir QUAST \\
    *.fasta.gz \\
    --threads $task.cpus \\
    $args

ln -s QUAST/report.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    quast: \$(quast.py --version 2>&1 | sed 's/^.*QUAST v//; s/ .*\$//')
END_VERSIONS
"""
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
"""
flye \\
    $mode \\
    $reads \\
    --out-dir . \\
    --threads \\
    $task.cpus \\
    $args

gzip -c assembly.fasta > ${prefix}.assembly.fasta.gz
gzip -c assembly_graph.gfa > ${prefix}.assembly_graph.gfa.gz
gzip -c assembly_graph.gv > ${prefix}.assembly_graph.gv.gz
mv assembly_info.txt ${prefix}.assembly_info.txt
mv flye.log ${prefix}.flye.log
mv params.json ${prefix}.params.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    flye: \$( flye --version )
END_VERSIONS
"""
55
56
57
58
59
60
61
62
63
64
65
66
67
"""
echo stub > assembly.fasta | gzip -c assembly.fasta > ${prefix}.assembly.fasta.gz
echo stub > assembly_graph.gfa | gzip -c assembly_graph.gfa > ${prefix}.assembly_graph.gfa.gz
echo stub > assembly_graph.gv | gzip -c assembly_graph.gv > ${prefix}.assembly_graph.gv.gz
echo contig_1 > ${prefix}.assembly_info.txt
echo stub > ${prefix}.flye.log
echo stub > ${prefix}.params.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    flye: \$( flye --version )
END_VERSIONS
"""
28
29
30
31
32
33
34
35
36
37
38
39
40
"""
multiqc \\
    --force \\
    $args \\
    $config \\
    $extra_config \\
    .

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""
43
44
45
46
47
48
49
50
51
52
"""
touch multiqc_data
touch multiqc_plots
touch multiqc_report.html

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""
25
26
27
28
29
30
31
32
33
34
35
36
37
"""
samtools \\
    stats \\
    --threads ${task.cpus} \\
    ${reference} \\
    ${input} \\
    > ${prefix}.stats

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
41
42
43
44
45
46
47
48
"""
touch ${prefix}.stats

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
NextFlow From line 41 of stats/main.nf
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Arcadia-Science/hifi2genome
Name: hifi-de-novo-genome-assembly-workflow
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...