A mini snakemake pipeline to batch construct consensus sequences from long nanopore amplicons w/ medaka for a set of samples

public 1yr ago 0 bookmarks

View Workflow

snakemake_ont_consensus_build — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A mini snakemake pipeline for batch constructing consensus sequences from long nanopore amplicons with medaka for a set of samples.

The pipeline includes the following steps:

usearch -cluster_fast for clustering read sequences to find centroid with the largest cluster size (i.e. draft sequence)
mini_align for aligning the reads to a draft/intermediate sequence
medaka consensus for running consensus algorithm across assembly regions
medaka stitch for collating results from step 3 to create consensus sequence

Steps 2-4 (wrapped up in medaka_polish.sh ) are the medaka polishing steps, and will be iterated for four rounds as advised by medaka documentation. This snakemake script will automatically install medaka package (defined in envs/medaka.yaml ) in your working directory without requiring admin priviledges. The final consensus from the 4th polishing round will be renamed as *.final_consensus.fa.

At the end of pipeline, the final consensus sequences of all samples will be merged into a single FASTA file ( all_samples.final_consensus.fa ) in the output directory.

Clone this repository

git clone https://github.com/ritatam/snakemake_ont_consensus_build.git
cd snakemake_ont_consensus_build

Dependencies

conda
python
Snakemake
medaka
USEARCH (already part of the release in external_program directory)

Configuration

Paths and parameters can be edited in config.yaml .

input_reads_dir : Input directory containing all the *.fastq amplicon read files (e.g. s_cerevisiae.fastq, a_flavus.fastq, ...) to be processed simultaneously (via snakemake)
output_dir : Output directory to store draft and polished consensus sequences per sample (as subdirectories). All the intermediate files will be stored in the output sample/polish/intermediate_files directory.
usearch_path : Path to the USEARCH binary. Note: This has been already configured to use the USEARCH binary in the repo's external_program folder, so user will not need to download it independently.
usearch_id : Minimum sequence identity for USEARCH read clustering, ranging between 0.0 and 1.0.
medaka_model : Medaka model for consensus polishing, based on the basecaller used. See medaka documentation for details.
threads : Number of threads to use in medaka polishing.

Executing the pipeline

Install the required conda environemnt without running the pipeline. Subsequent runs with flag --use-conda will utilise the local environment stored in your working directory ( /.snakemake/conda ) without requiring internet connection.

snakemake --use-conda --conda-create-envs-only -c1

Dry-run the pipeline to print the commands and the corresponding i/o files, without running the pipeline.

snakemake --use-conda -np

Run the pipeline and specify number of cores to use (e.g. 8 cores)

snakemake --use-conda -c8

Code Snippets

set -eo pipefail

### FUNCTIONS ###

# prints help function
function usage {
	echo -e "\nThis is a wrapper script that automates four rounds of medaka polishing workflow." 
	echo -e "\nFlags:" 
	echo -e "-r \t Input fastq file containing the nanopore reads."
	echo -e "-d \t Input fasta file containing the draft sequence to be polished. Required name format: <sample_name>.draft.fa"
	echo -e "-o \t Output path."
	echo -e "-t \t Number of threads."
	echo -e "-m \t Medaka model to use. See Medaka documentation for details."
	echo -e "\nExample usage: \n ./medaka_polish.sh -r reads.fastq -d draft.fa -o results/ -t 2 -m r941_min_sup_g507"
	exit 1
}

# parses command line flags
function getoptsfunction {
	local OPTIND
	while getopts ":r:d:o:t:m:h" flag; do
		case $flag in
			r) READS=$OPTARG;;
			d) DRAFT=$OPTARG 
				echo ""; echo "Draft sequence used: ${DRAFT}" ;;
			o) OUTPUT=$OPTARG ;;
			t) THREADS=$OPTARG ;;
			m) MODEL=$OPTARG ;;
			h) usage ;;
			:) echo "Argument missing from -${OPTARG} option" >&2; exit 1 ;;
			\?) echo "Invalid option: -${OPTARG}" >&2; exit 1 ;;
		esac
	done
	shift $(( OPTIND-1 )) # Tells getopts to move onto the next argument

	if [[ -z "$READS" ]] ; then echo "Missing input read file (-r)."; usage; >&2; exit 1; fi
	if [[ -z "$DRAFT" ]] ; then echo "Missing input draft sequence file (-d)."; usage; >&2; exit 1; fi
	if [[ -z "$OUTPUT" ]] ; then echo "Missing output dir path (-o)."; usage; >&2; exit 1; fi
	if [[ -z "$THREADS" ]] ; then echo "Missing number of threads to use (-t)."; usage; >&2; exit 1; fi
	if [[ -z "$MODEL" ]] ; then echo "Missing medaka model to use (-m)."; usage; >&2; exit 1; fi

}

getoptsfunction "$@"

# prints polishing round report
function polishing_round_report() {
	echo ""
	echo "Polishing round $1/4..."
	echo ""
} 


### CODES ###

mkdir -p ${OUTPUT}
echo "" ; echo "###### Starting medaka polishing ######"

# extract sample name from draft filename
SAMPLE=$( basename ${DRAFT} )
SAMPLE=${SAMPLE%.draft.fa}

# run four polishing rounds
for i in $( seq 1 4 ); do

	polishing_round_report $i

	# filenames handling
	if [[ $i = 1 ]] # round 1 uses draft
	then
		mini_align_input=${DRAFT}
		prev_prefix=${OUTPUT}/${SAMPLE}.draft
	else # rounds starting from 2 use previous round
		prev=$( expr $i - 1 )
		mini_align_input=${OUTPUT}/${SAMPLE}.polish.round${prev}.fasta
		prev_prefix=${OUTPUT}/${SAMPLE}.polish.round${prev}
	fi
	PREFIX=${OUTPUT}/${SAMPLE}.polish.round$i

	# run polishing pipeline
	mini_align -i ${READS} -r ${mini_align_input} -m -p ${prev_prefix} -t ${THREADS}
	medaka consensus ${prev_prefix}.bam ${PREFIX}.hdf --model ${MODEL} --chunk_len 6000 --threads ${THREADS}
	medaka stitch ${PREFIX}.hdf ${mini_align_input} ${PREFIX}.fasta

done


# produce final consensus and store intermediate files
POLISH_ROUND4_FASTA=${OUTPUT}/${SAMPLE}.polish.round4.fasta
FINAL_CONSENSUS_FASTA=${OUTPUT}/${SAMPLE}.final_consensus.fa


if [[ -f $POLISH_ROUND4_FASTA ]]

then
	cp $POLISH_ROUND4_FASTA $FINAL_CONSENSUS_FASTA

	# change header from read id (centroid) to sample name
	sed -i "1s/^.*/>${SAMPLE}/" $FINAL_CONSENSUS_FASTA

	mkdir -p ${OUTPUT}/intermediate_files
	for file in ${OUTPUT}/*; do
		if ! { [[ "$file" == *intermediate_files ]] || [[ "$file" == *final* ]]; }
		then
			mv $file ${OUTPUT}/intermediate_files/.
		fi
	done

	echo ""
	echo "The final polished consensus sequence is written to: $FINAL_CONSENSUS_FASTA"
	echo ""
	echo "Medaka polishing done!"

else
	echo "Polishing round4 output does not exist. Please check ${SAMPLE}.medaka_polish.log in logs directory for errors."

fi

Shell Consensus From line 9 of main/medaka_polish.sh

shell:
    "export OMP_NUM_THREADS=10; "
    " {config[usearch_path]} -cluster_fast {input.reads} -id {config[usearch_id]} -strand both -centroids {output.centroids} -sizeout 2>&1 | tee {log} "
    ' && awk "/^>/ {{n++}} n>1 {{exit}} {{print}}" {output.centroids} > {output.draft} '

SnakeMake From line 40 of main/Snakefile

shell:
    "./medaka_polish.sh -r {input.reads} -d {input.draft} -o {output.polish_outdir} -t {threads} -m {config[medaka_model]} 2>&1 | tee {log} "

SnakeMake From line 60 of main/Snakefile

shell:
    "cat {params} > {output.merged}"

SnakeMake From line 72 of main/Snakefile

ShowHide 3 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/ritatam/snakemake_ont_consensus_build

Name: snakemake_ont_consensus_build

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Consensus Medaka Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free