A mini snakemake pipeline to batch construct consensus sequences from long nanopore amplicons w/ medaka for a set of samples
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A mini snakemake pipeline for batch constructing consensus sequences from long nanopore amplicons with medaka for a set of samples.
The pipeline includes the following steps:
-
usearch -cluster_fast
for clustering read sequences to find centroid with the largest cluster size (i.e. draft sequence) -
mini_align
for aligning the reads to a draft/intermediate sequence -
medaka consensus
for running consensus algorithm across assembly regions -
medaka stitch
for collating results from step 3 to create consensus sequence
Steps 2-4 (wrapped up in
medaka_polish.sh
) are the medaka polishing steps, and will be iterated for four rounds as advised by medaka documentation. This snakemake script will automatically install medaka package (defined in
envs/medaka.yaml
) in your working directory without requiring admin priviledges. The final consensus from the 4th polishing round will be renamed as *.final_consensus.fa.
At the end of pipeline, the final consensus sequences of all samples will be merged into a single FASTA file (
all_samples.final_consensus.fa
) in the output directory.
Clone this repository
git clone https://github.com/ritatam/snakemake_ont_consensus_build.git
cd snakemake_ont_consensus_build
Dependencies
Configuration
Paths and parameters can be edited in
config.yaml
.
-
input_reads_dir
: Input directory containing all the *.fastq amplicon read files (e.g. s_cerevisiae.fastq, a_flavus.fastq, ...) to be processed simultaneously (via snakemake) -
output_dir
: Output directory to store draft and polished consensus sequences per sample (as subdirectories). All the intermediate files will be stored in the output sample/polish/intermediate_files directory. -
usearch_path
: Path to the USEARCH binary. Note: This has been already configured to use the USEARCH binary in the repo's external_program folder, so user will not need to download it independently. -
usearch_id
: Minimum sequence identity for USEARCH read clustering, ranging between 0.0 and 1.0. -
medaka_model
: Medaka model for consensus polishing, based on the basecaller used. See medaka documentation for details. -
threads
: Number of threads to use in medaka polishing.
Executing the pipeline
Install the required conda environemnt without running the pipeline. Subsequent runs with flag
--use-conda
will utilise the local environment stored in your working directory (
/.snakemake/conda
) without requiring internet connection.
snakemake --use-conda --conda-create-envs-only -c1
Dry-run the pipeline to print the commands and the corresponding i/o files, without running the pipeline.
snakemake --use-conda -np
Run the pipeline and specify number of cores to use (e.g. 8 cores)
snakemake --use-conda -c8
Code Snippets
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | set -eo pipefail ### FUNCTIONS ### # prints help function function usage { echo -e "\nThis is a wrapper script that automates four rounds of medaka polishing workflow." echo -e "\nFlags:" echo -e "-r \t Input fastq file containing the nanopore reads." echo -e "-d \t Input fasta file containing the draft sequence to be polished. Required name format: <sample_name>.draft.fa" echo -e "-o \t Output path." echo -e "-t \t Number of threads." echo -e "-m \t Medaka model to use. See Medaka documentation for details." echo -e "\nExample usage: \n ./medaka_polish.sh -r reads.fastq -d draft.fa -o results/ -t 2 -m r941_min_sup_g507" exit 1 } # parses command line flags function getoptsfunction { local OPTIND while getopts ":r:d:o:t:m:h" flag; do case $flag in r) READS=$OPTARG;; d) DRAFT=$OPTARG echo ""; echo "Draft sequence used: ${DRAFT}" ;; o) OUTPUT=$OPTARG ;; t) THREADS=$OPTARG ;; m) MODEL=$OPTARG ;; h) usage ;; :) echo "Argument missing from -${OPTARG} option" >&2; exit 1 ;; \?) echo "Invalid option: -${OPTARG}" >&2; exit 1 ;; esac done shift $(( OPTIND-1 )) # Tells getopts to move onto the next argument if [[ -z "$READS" ]] ; then echo "Missing input read file (-r)."; usage; >&2; exit 1; fi if [[ -z "$DRAFT" ]] ; then echo "Missing input draft sequence file (-d)."; usage; >&2; exit 1; fi if [[ -z "$OUTPUT" ]] ; then echo "Missing output dir path (-o)."; usage; >&2; exit 1; fi if [[ -z "$THREADS" ]] ; then echo "Missing number of threads to use (-t)."; usage; >&2; exit 1; fi if [[ -z "$MODEL" ]] ; then echo "Missing medaka model to use (-m)."; usage; >&2; exit 1; fi } getoptsfunction "$@" # prints polishing round report function polishing_round_report() { echo "" echo "Polishing round $1/4..." echo "" } ### CODES ### mkdir -p ${OUTPUT} echo "" ; echo "###### Starting medaka polishing ######" # extract sample name from draft filename SAMPLE=$( basename ${DRAFT} ) SAMPLE=${SAMPLE%.draft.fa} # run four polishing rounds for i in $( seq 1 4 ); do polishing_round_report $i # filenames handling if [[ $i = 1 ]] # round 1 uses draft then mini_align_input=${DRAFT} prev_prefix=${OUTPUT}/${SAMPLE}.draft else # rounds starting from 2 use previous round prev=$( expr $i - 1 ) mini_align_input=${OUTPUT}/${SAMPLE}.polish.round${prev}.fasta prev_prefix=${OUTPUT}/${SAMPLE}.polish.round${prev} fi PREFIX=${OUTPUT}/${SAMPLE}.polish.round$i # run polishing pipeline mini_align -i ${READS} -r ${mini_align_input} -m -p ${prev_prefix} -t ${THREADS} medaka consensus ${prev_prefix}.bam ${PREFIX}.hdf --model ${MODEL} --chunk_len 6000 --threads ${THREADS} medaka stitch ${PREFIX}.hdf ${mini_align_input} ${PREFIX}.fasta done # produce final consensus and store intermediate files POLISH_ROUND4_FASTA=${OUTPUT}/${SAMPLE}.polish.round4.fasta FINAL_CONSENSUS_FASTA=${OUTPUT}/${SAMPLE}.final_consensus.fa if [[ -f $POLISH_ROUND4_FASTA ]] then cp $POLISH_ROUND4_FASTA $FINAL_CONSENSUS_FASTA # change header from read id (centroid) to sample name sed -i "1s/^.*/>${SAMPLE}/" $FINAL_CONSENSUS_FASTA mkdir -p ${OUTPUT}/intermediate_files for file in ${OUTPUT}/*; do if ! { [[ "$file" == *intermediate_files ]] || [[ "$file" == *final* ]]; } then mv $file ${OUTPUT}/intermediate_files/. fi done echo "" echo "The final polished consensus sequence is written to: $FINAL_CONSENSUS_FASTA" echo "" echo "Medaka polishing done!" else echo "Polishing round4 output does not exist. Please check ${SAMPLE}.medaka_polish.log in logs directory for errors." fi |
40 41 42 43 | shell: "export OMP_NUM_THREADS=10; " " {config[usearch_path]} -cluster_fast {input.reads} -id {config[usearch_id]} -strand both -centroids {output.centroids} -sizeout 2>&1 | tee {log} " ' && awk "/^>/ {{n++}} n>1 {{exit}} {{print}}" {output.centroids} > {output.draft} ' |
60 61 | shell: "./medaka_polish.sh -r {input.reads} -d {input.draft} -o {output.polish_outdir} -t {threads} -m {config[medaka_model]} 2>&1 | tee {log} " |
72 73 | shell: "cat {params} > {output.merged}" |
Support
- Future updates
Related Workflows





