Snakemake Pipeline for Automating the Use of the Bioinformatics Tool RVHaplo

public 1yr ago Version: 2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Ce pipeline réalisé en Snakemake permet d'automatiser l'utilisation de l'outil bioinformatique RVHaplo (https://github.com/dhcai21/RVHaplo.git

Code Snippets

from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

# Read the alignment file

alignment = AlignIO.read(snakemake.input[0], "fasta")
print(alignment)

# Calculare the distance matrix

calculator = DistanceCalculator('identity')
distance_Matrix = calculator.get_distance(alignment)
print(distance_Matrix)

# Create a DistanceTreeConstructor object

constructor = DistanceTreeConstructor()

# Construct the phlyogenetic tree using NJ algorithm

NJ_tree = constructor.nj(distance_Matrix)

# Draw the phlyogenetic tree using terminal

Phylo.draw_ascii(NJ_tree)

# Write tree in new file
Phylo.write(NJ_tree, snakemake.output[0], "newick")

Python Biopython From line 1 of python_script/tree.py

file_sam=""
file_ref=""
### optional arguments
file_path='./result'
prefix="rvhaplo"
mq=0
thread=8
error_rate=0.1
signi_level=0.05
cond_pro=0.65
fre_snv=0.8
num_read_1=10
num_read_2=5
gap=15
smallest_snv=20
only_snv=0
ovlap_read=5
weight_read=0.85
mcl_inflation=2
lar_cluster=50
ovlap_cluster=10
depth=5
weight_cluster=0.8
abundance=0.005
s_pos=1
e_pos=10000000000


function help_info() {
	echo "Usage: $0 -i alignment.sam -r ref_genome.fasta [options]"
	echo ""
	echo "RVHaplo: Reconstructing viral haplotypes using long reads"
	echo ""
	echo "Author: Dehan CAI"
	echo "Date:   May 2022"
	echo "Version 2: Support mutli-thread processing; Use a C package of MCL; Cost less memory   "
	echo ""
	echo "    -i | --input:                     alignment file (sam)"
	echo "    -r | --refernece:                 reference genome (fasta)"
	echo ""
	echo "    Options:"
	echo "    -o  | --out:                      Path where to output the results. (default:./result)"
	echo "    -p  | --prefix STR:               Prefix of output file. (default: rvhaplo)"
	echo "    -t  | --thread INT:               Number of CPU cores for multiprocessing. (default:8)"
	echo "    -e  | --error_rate FLOAT:         Sequencing error rate. (default: 0.1)"
	echo "    -mq | --map_qual INT:             Smallest mapping quality for reads . (default:0)"
	echo "    -s  | --signi_level FLOAT:        Significance level for binomial tests. (default: 0.05)"
	echo "    -c  | --cond_pro FLOAT:           A threshold of the maximum conditional probability for SNV sites. (default: 0.65)"
	echo "    -f  | --fre_snv FLOAT:            The most dominant base' frequency at a to-be-verified site should >= fre_snv. (default: 0.80)"
	echo "    -n1 | --num_read_1 INT:           Minimum # of reads for calculating the conditional probability given one conditional site. (default:10)"
	echo "    -n2 | --num_read_2 INT:           Minimum # of reads for calculating the conditional probability given more than one conditional sites. (default: 5)"
	echo "    -g  | --gap INT:                  Minimum length of gap between SNV sites for calculating the conditional probability. (default:15)"
	echo "    -ss | --smallest_snv INT:         Minimum # of SNV sites for haplotype construction. (default:20)"
	echo "    -os | --only_snv (0 or 1) :       Only output the SNV sites without running the haplotype reconstruction part. (default: 0)"
	echo "    -or | --overlap_read INT:         Minimum length of overlap for creating edges between two read in the read graph. (default: 5)"
	echo "    -wr | --weight_read FLOAT:        Minimum weights of edges in the read graph. (default:0.8)"
	echo "    -m  | --mcl_inflaction FLOAT:     Inflaction of MCL algorithm. (default:2)"
	echo "    -l  | --lar_cluster INT:          A threshold for seperating clusters into two groups based on sizes of clusters. (default:50)"
	echo "    -oc | --overlap_cluster INT:      A parameter related to the minimum overlap between consensus sequences. (default:10) "
	echo "    -d  | --depth INT:                Depth limitation for consensus sequences generated from clusters. (default:5) "
	echo "    -wc | --weight_cluster FLOAT:     Minimum weights between clusters in the hierarchical clustering. (default: 0.8)"
	echo "    -sp | --start_pos INT:            Starting position for generating consensus sequences (default: 1)"
	echo "    -ep | --end_pos INT:              Ending position for generating consensus sequences. (default: 1e10)"
	echo "    -a  | --abundance FLOAT:          A threshold for filtering low-abundance haplotypes. (default: 0.005)"
	echo "    -h  | --help :                    Print help message."
	echo ""
	echo "    For further details of above arguments, please refer to https://github.com/dhcai21/RVHaplo   "
	echo ""
	exit 1
}

if [[ "$1" == "" ]];then
	help_info
	exit 1
fi

while [[ "$1" != "" ]]; do
	case "$1" in
		-h | --help ) ## print help message
		help_info
		exit 1
		;;
		-i | --input ) ### input sam file
		case "$2" in 
		"" )
			echo "Error: no sam file as input"
			exit 1
			;;
		*)
			file_sam="$2"
			if [[ "${file_sam:0:1}" == "-" ]]
			then
				echo "Error: no sam file as input"
				exit 1
			fi
			shift 2
			;;
		esac
		;;
		-r | --ref_genome) ### input reference genome
		case "$2" in 
		"")
			echo "Error: no fasta file as input"
			exit 1
			;;
		*)
			file_ref="$2"
			if [[ ""${file_ref:0:1}"" == "-" ]]
			then
				echo "Error: no fasta file as input"
				exit 1
			fi
			shift 2
			;;
		esac
		;;
		-o | --out )  ### output path
		case "$2" in 
		"" )
			echo "Error: no output path"
			exit 1
			;;
		*)
			file_path="$2"
			if [[ "${file_sam:0:1}" == "-" ]]
			then
				echo "Error: no output path"
				exit 1
			fi
			shift 2
			;;
		esac
		;;
		-p | --prefix )  ### prefix
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			prefix="$2"
			shift 2
			;;
		esac
		;;
		-mq | --map_qual )  ### mapping quality
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			mq="$2"
			shift 2
			;;
		esac
		;;
		-t | --thread )  ### threads
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			thread="$2"
			shift 2
			;;
		esac
		;;
		-e | --error_rate )  ### error_rate
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			error_rate="$2"
			shift 2
			;;
		esac
		;;
		-s | --signi_level )  ### significance_level
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			signi_level="$2"
			shift 2
			;;
		esac
		;;
		-c | --cond_pro )  ### conditional_probability
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			cond_pro="$2"
			shift 2
			;;
		esac
		;;
		-f | --fre_snv )  ### determine the set of to-be-verified SNV sites
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			fre_snv="$2"
			shift 2
			;;
		esac
		;;
		-n1 | --num_read_1 )  ### number of reads for p(ai|aj)
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			num_read_1="$2"
			shift 2
			;;
		esac
		;;
		-n2 | --num_read_2 )  ### number of reads for p(ai|aj1,aj2,...,ajp)
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			num_read_2="$2"
			shift 2
			;;
		esac
		;;
		-g | --gap )  ### Minimum distance between SNV sites
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			gap="$2"
			shift 2
			;;
		esac
		;;
		-ss | --smallest_snv )  ### Minimum number of SNV sites for haplotype reconstruction
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			smallest_snv="$2"
			shift 2
			;;
		esac
		;;
		-os | --only_snv )  ### Only output the SNV sites without running the haplotype reconstruction part.
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			only_snv="$2"
			shift 2
			;;
		esac
		;;
		-or | --ovlap_read )  ### overlap_read
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			ovlap_read="$2"
			shift 2
			;;
		esac
		;;
		-wr | --weight_read )  ### weight_read
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			weight_read="$2"
			shift 2
			;;
		esac
		;;
		-m | --mcl_inflaction )  ### inflaction of MCL
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			mcl_inflaction="$2"
			shift 2
			;;
		esac
		;;
		-oc | --ovlap_cluster )  ### overlap_cluster
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			ovlap_cluster="$2"
			shift 2
			;;
		esac
		;;
		-wc | --weight_cluster )  ### weight_cluster
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			weight_cluster="$2"
			shift 2
			;;
		esac
		;;
		-d | --depth )  ### depth limitation
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			depth="$2"
			shift 2
			;;
		esac
		;;
		-l | --lar_cluster )  ### large cluster size
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			lar_cluster="$2"
			shift 2
			;;
		esac
		;;
		-sp | --start_pos )  ### start_pos
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			s_pos="$2"
			shift 2
			;;
		esac
		;;
		-ep | --end_pos )  ### end_pos
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			e_pos="$2"
			shift 2
			;;
		esac
		;;
		-a | --abundance )  ### smallest abundance
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			abundance="$2"
			shift 2
			;;
		esac
		;;
		*)
			echo "Error: unknow parameter $1"
			exit 1
	esac
done

if [[ "$file_sam" == "" ]];then
	echo "Error: no sam file input"
	exit 1
fi

if [[ "$file_ref" == "" ]];then
	echo "Error: no reference genome input"
	exit 1
fi

if [[ ${file_path:0-1:1} == "/" ]];then
	path_len=`expr ${#file_path}-1`
	file_prefix=$file_path$prefix
	file_path=${file_path:0:path_len}
else
	file_prefix=$file_path"/"$prefix
fi

##########  count nucleotide occurrence  ##########
echo "count nucleotide occurrence"
if [[ "$file_path" != "." ]];then
	rm -rf $file_path
	mkdir $file_path
fi
rm -rf $file_path"/alignment"
mkdir $file_path"/alignment"
file_len=`expr ${#file_sam}-4`
unique_sam=$file_path"/alignment/"$prefix".sam"
samtools view -h -F 0x900 -q $mq $file_sam > $unique_sam
file_bam=$file_path"/alignment/"$prefix".bam"
samtools view -b -S $unique_sam > $file_bam
rm $unique_sam
file_bam_sorted=$file_path"/alignment/"$prefix"_sorted.bam"
samtools sort $file_bam -o $file_bam_sorted
samtools index $file_bam_sorted
file_acgt=$file_prefix"_acgt.txt"
python ./src/count_frequency.py $file_bam_sorted $file_acgt

########## two binomial tests  ##########
echo "SNV detection"
file_snv=$file_prefix"_snv.txt"
python ./src/two_binomial.py $error_rate $signi_level $file_acgt $file_snv $thread $s_pos $e_pos

## judge number of detected SNV sites
size="$(wc -l <"$file_snv")"
size="${size:0-1:1}"
if [[ $size != "0" ]];then
	exit 1
fi

## maximum conditional probability and construct reads graph
python ./src/mcp_read_graph.py $file_bam_sorted $file_snv $cond_pro $smallest_snv $num_read_1 $num_read_2 $gap \
	$weight_read $ovlap_read $file_prefix $fre_snv $thread $only_snv

## judge number of detected SNV sites
size="$(wc -l <"$file_snv")"
size="${size:0-1:1}"
if [[ $size != "0" ]];then
	exit 1
fi

if [[ $only_snv != 0 ]];then
	exit 1
fi

# MCL clustering
echo "MCL clustering"
mcxload -abc $file_prefix"_reads_graph.txt" --stream-mirror --write-binary -o $file_prefix"_reads_graph.mci" -write-tab $file_prefix"_reads_graph.tab"
rm $file_prefix"_reads_graph.txt"
mcl $file_prefix"_reads_graph.mci" -te $thread -I $mcl_inflation -l 1 -L 100 -o $file_prefix"_mcl_result.icl"
rm $file_prefix"_reads_graph.mci"
mcxdump -icl $file_prefix"_mcl_result.icl" -o $file_prefix"_reads_cluster.txt" -tabr $file_prefix"_reads_graph.tab"
rm $file_prefix"_mcl_result.icl"
rm $file_prefix"_reads_graph.tab"

## hierarchical clustering
echo "hierarchical clustering"
python ./src/hierarchical_cluster.py $file_prefix"_matrix.pickle" $lar_cluster $depth \
	$ovlap_cluster $weight_cluster $abundance $file_prefix

## reconstruct haplotypes
rm -rf $file_path"/clusters"
mkdir $file_path"/clusters"

echo "haplotypes reconstruction"

python ./src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt $file_ref \
	$file_prefix"_consensus.fasta" $s_pos $e_pos

echo "haplotypes polishment (medaka)"
python ./src/extract_reads.py $file_path $prefix
python ./src/run_medaka.py $file_path $prefix

rm $file_prefix"_matrix.pickle"
rm $file_prefix"_reads_cluster.txt"
rm $file_prefix"_clusters.pickle"
rm -rf $file_path/medaka/fastx
echo "complete reconstructing haplotypes"

exit 1

Shell SAMtools Consensus MCL From line 3 of main/rvhaplo.sh

shell:
    "pycoQC --summary_file {input} --html_outfile {output}"

SnakeMake pycoQC From line 81 of main/Snakefile

shell:
    "seqtk seq -a {input.fastq} > {output.conv_fasta}"

SnakeMake seqtk From line 91 of main/Snakefile

shell:
    "seqkit seq -g -m {params.filter} {input} > {output}"

SnakeMake seqkit From line 103 of main/Snakefile

shell:
    "bwa index {input}"

SnakeMake BWA From line 113 of main/Snakefile

shell:
    "bwa mem -t {threads} {input.reference} {input.fasta} > {output}"

SnakeMake BWA From line 127 of main/Snakefile

shell:
    "./rvhaplo.sh -i {input.sam} -r {input.reference} -o {params.outdir}_sup{filter_reads} -t {threads} || true" if filter_reads != 0
    else "./rvhaplo.sh -i {input.sam} -r {input.reference} -o {params.outdir}_allreads -t {threads} || true"

SnakeMake From line 143 of main/Snakefile

run:
    with open(input.all_refs_file) as file:
        refs = file.read()
    with open(input.all_CP_refs) as file:
        refs_CP = file.read()

    haplo_files = input.haplotypes
    with open(haplo_files) as file:
        haplo1 = file.read()
        haplo2 = file.read()
    haplo1 += refs
    haplo2 += refs_CP

    merge_file_1 = output.haplo_refs
    with open(merge_file_1, "w") as file:
        file.write(haplo1)

    merge_file_2 = output.haplo_refs_CP
    with open(merge_file_2, "w") as file:
        file.write(haplo2)

SnakeMake From line 158 of main/Snakefile

shell:
    """
    muscle3.8.31_i86linux64 -in {input.haplo_refs} -out {output.aln}
    muscle3.8.31_i86linux64 -in {input.haplo_refs_CP} -out {output.aln_CP}
    """