A mini snakemake pipeline to batch construct consensus sequences from long nanopore amplicons w/ medaka for a set of samples

public public 1yr ago 0 bookmarks

A mini snakemake pipeline for batch constructing consensus sequences from long nanopore amplicons with medaka for a set of samples.

The pipeline includes the following steps:

  1. usearch -cluster_fast for clustering read sequences to find centroid with the largest cluster size (i.e. draft sequence)

  2. mini_align for aligning the reads to a draft/intermediate sequence

  3. medaka consensus for running consensus algorithm across assembly regions

  4. medaka stitch for collating results from step 3 to create consensus sequence

Steps 2-4 (wrapped up in medaka_polish.sh ) are the medaka polishing steps, and will be iterated for four rounds as advised by medaka documentation. This snakemake script will automatically install medaka package (defined in envs/medaka.yaml ) in your working directory without requiring admin priviledges. The final consensus from the 4th polishing round will be renamed as *.final_consensus.fa.

At the end of pipeline, the final consensus sequences of all samples will be merged into a single FASTA file ( all_samples.final_consensus.fa ) in the output directory.

Clone this repository

git clone https://github.com/ritatam/snakemake_ont_consensus_build.git
cd snakemake_ont_consensus_build

Dependencies

Configuration

Paths and parameters can be edited in config.yaml .

  1. input_reads_dir : Input directory containing all the *.fastq amplicon read files (e.g. s_cerevisiae.fastq, a_flavus.fastq, ...) to be processed simultaneously (via snakemake)

  2. output_dir : Output directory to store draft and polished consensus sequences per sample (as subdirectories). All the intermediate files will be stored in the output sample/polish/intermediate_files directory.

  3. usearch_path : Path to the USEARCH binary. Note: This has been already configured to use the USEARCH binary in the repo's external_program folder, so user will not need to download it independently.

  4. usearch_id : Minimum sequence identity for USEARCH read clustering, ranging between 0.0 and 1.0.

  5. medaka_model : Medaka model for consensus polishing, based on the basecaller used. See medaka documentation for details.

  6. threads : Number of threads to use in medaka polishing.

Executing the pipeline

Install the required conda environemnt without running the pipeline. Subsequent runs with flag --use-conda will utilise the local environment stored in your working directory ( /.snakemake/conda ) without requiring internet connection.

snakemake --use-conda --conda-create-envs-only -c1

Dry-run the pipeline to print the commands and the corresponding i/o files, without running the pipeline.

snakemake --use-conda -np

Run the pipeline and specify number of cores to use (e.g. 8 cores)

snakemake --use-conda -c8

Code Snippets

  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
set -eo pipefail

### FUNCTIONS ###

# prints help function
function usage {
	echo -e "\nThis is a wrapper script that automates four rounds of medaka polishing workflow." 
	echo -e "\nFlags:" 
	echo -e "-r \t Input fastq file containing the nanopore reads."
	echo -e "-d \t Input fasta file containing the draft sequence to be polished. Required name format: <sample_name>.draft.fa"
	echo -e "-o \t Output path."
	echo -e "-t \t Number of threads."
	echo -e "-m \t Medaka model to use. See Medaka documentation for details."
	echo -e "\nExample usage: \n ./medaka_polish.sh -r reads.fastq -d draft.fa -o results/ -t 2 -m r941_min_sup_g507"
	exit 1
}

# parses command line flags
function getoptsfunction {
	local OPTIND
	while getopts ":r:d:o:t:m:h" flag; do
		case $flag in
			r) READS=$OPTARG;;
			d) DRAFT=$OPTARG 
				echo ""; echo "Draft sequence used: ${DRAFT}" ;;
			o) OUTPUT=$OPTARG ;;
			t) THREADS=$OPTARG ;;
			m) MODEL=$OPTARG ;;
			h) usage ;;
			:) echo "Argument missing from -${OPTARG} option" >&2; exit 1 ;;
			\?) echo "Invalid option: -${OPTARG}" >&2; exit 1 ;;
		esac
	done
	shift $(( OPTIND-1 )) # Tells getopts to move onto the next argument

	if [[ -z "$READS" ]] ; then echo "Missing input read file (-r)."; usage; >&2; exit 1; fi
	if [[ -z "$DRAFT" ]] ; then echo "Missing input draft sequence file (-d)."; usage; >&2; exit 1; fi
	if [[ -z "$OUTPUT" ]] ; then echo "Missing output dir path (-o)."; usage; >&2; exit 1; fi
	if [[ -z "$THREADS" ]] ; then echo "Missing number of threads to use (-t)."; usage; >&2; exit 1; fi
	if [[ -z "$MODEL" ]] ; then echo "Missing medaka model to use (-m)."; usage; >&2; exit 1; fi

}

getoptsfunction "$@"

# prints polishing round report
function polishing_round_report() {
	echo ""
	echo "Polishing round $1/4..."
	echo ""
} 


### CODES ###

mkdir -p ${OUTPUT}
echo "" ; echo "###### Starting medaka polishing ######"

# extract sample name from draft filename
SAMPLE=$( basename ${DRAFT} )
SAMPLE=${SAMPLE%.draft.fa}

# run four polishing rounds
for i in $( seq 1 4 ); do

	polishing_round_report $i

	# filenames handling
	if [[ $i = 1 ]] # round 1 uses draft
	then
		mini_align_input=${DRAFT}
		prev_prefix=${OUTPUT}/${SAMPLE}.draft
	else # rounds starting from 2 use previous round
		prev=$( expr $i - 1 )
		mini_align_input=${OUTPUT}/${SAMPLE}.polish.round${prev}.fasta
		prev_prefix=${OUTPUT}/${SAMPLE}.polish.round${prev}
	fi
	PREFIX=${OUTPUT}/${SAMPLE}.polish.round$i

	# run polishing pipeline
	mini_align -i ${READS} -r ${mini_align_input} -m -p ${prev_prefix} -t ${THREADS}
	medaka consensus ${prev_prefix}.bam ${PREFIX}.hdf --model ${MODEL} --chunk_len 6000 --threads ${THREADS}
	medaka stitch ${PREFIX}.hdf ${mini_align_input} ${PREFIX}.fasta

done


# produce final consensus and store intermediate files
POLISH_ROUND4_FASTA=${OUTPUT}/${SAMPLE}.polish.round4.fasta
FINAL_CONSENSUS_FASTA=${OUTPUT}/${SAMPLE}.final_consensus.fa


if [[ -f $POLISH_ROUND4_FASTA ]]

then
	cp $POLISH_ROUND4_FASTA $FINAL_CONSENSUS_FASTA

	# change header from read id (centroid) to sample name
	sed -i "1s/^.*/>${SAMPLE}/" $FINAL_CONSENSUS_FASTA

	mkdir -p ${OUTPUT}/intermediate_files
	for file in ${OUTPUT}/*; do
		if ! { [[ "$file" == *intermediate_files ]] || [[ "$file" == *final* ]]; }
		then
			mv $file ${OUTPUT}/intermediate_files/.
		fi
	done

	echo ""
	echo "The final polished consensus sequence is written to: $FINAL_CONSENSUS_FASTA"
	echo ""
	echo "Medaka polishing done!"

else
	echo "Polishing round4 output does not exist. Please check ${SAMPLE}.medaka_polish.log in logs directory for errors."

fi
40
41
42
43
shell:
    "export OMP_NUM_THREADS=10; "
    " {config[usearch_path]} -cluster_fast {input.reads} -id {config[usearch_id]} -strand both -centroids {output.centroids} -sizeout 2>&1 | tee {log} "
    ' && awk "/^>/ {{n++}} n>1 {{exit}} {{print}}" {output.centroids} > {output.draft} '
SnakeMake From line 40 of main/Snakefile
60
61
shell:
    "./medaka_polish.sh -r {input.reads} -d {input.draft} -o {output.polish_outdir} -t {threads} -m {config[medaka_model]} 2>&1 | tee {log} "
SnakeMake From line 60 of main/Snakefile
72
73
shell:
    "cat {params} > {output.merged}"
SnakeMake From line 72 of main/Snakefile
ShowHide 3 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ritatam/snakemake_ont_consensus_build
Name: snakemake_ont_consensus_build
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...