This workflow performs a variant analysis on mitochondrial genomes using the bcftools variant caller. For human samples, the workflow also performs a variant analysis using mutserve.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow performs a variant analysis on mitochondrial genomes using the bcftools variant caller. For human samples, the workflow also performs a variant analysis using mutserve. It is listed in the Snakemake Workflow Catalog where usage of standardized Snakemake workflows is described.
Local installation
- Clone this repo
git clone https://github.com/tlenfers/multispecies_mitochondrial_variant_analysis.git
cd multispecies_mitochondrial_variant_analysis
- Install dependencies
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# setup channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create & activate new env with installed deps
conda env create -n wf -f environment.yaml
source activate wf
Configuration
Config files :
-
config.yaml
- analysis-specific settings -
environment.yaml
- software dependencies and versions -
samples.tsv
- list of (paired) samples
Samples:
-
Put all sample names in a single column in
samples.tsv
. -
Add the data folder to
config.yaml
where all files to be analysed are located-
standard path is set to
data/
-
standard path is set to
-
assumed naming convention:
-
sampleName_R1.fastq.gz
-
sampleName_R2.fastq.gz
-
Reference:
To analyse dog, mouse or human samples the corresponding reference will be downloaded.
Define the to be analysed species in
config.yaml
under reference.
If you want to analyse a different species or use your own reference, enter the name of the file and it's path in
config.yaml
.
Execute the workflow
cd workflow
# 'dry' run only checks I/O files
snakemake -np
# To run mutlipecies variant analysis
snakemake -j n all --use-conda --use-singularity
# where n is the numer of cores to use
# To run human variant analysis with mutserve
snakemake -j n all_human --use-conda --use-singularity
# where n is the numer of cores to use
Output
All output files are put in
/results
and in their own subfolder regarding the used reference and caller.
The results are in a sub-folder corresponding to the name of the reference file used.
-
/results/calls_bcftools
contains all called variants using bcftools. The variants are normalized and saved assample_name.vcf.gz
. In addition, the filemergerd.vcf
is created in which all variants are merged together.- Example file without header:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT results/mapped/human/sample_name.bam
chrM 73 . A G 225.417 . DP=253;VDB=3.59147e-17;SGB=-0.693147;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,240,5;MQ=60 GT:PL 1:255,0
chrM 146 . T C 225.422 . DP=242;VDB=0.795672;SGB=-0.693147;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,165,52;MQ=60 GT:PL 1:255,0
-
/results/calls_mutserve
contains all called variants using mutserve. -
/results/mapped
contains all aligned reads assample_name.bam
ad their index filesample_name.bam.bai
. -
/results/plots
contains the created heatmap plots for the bctools caller. Example plots:-
The name of the samples is on the X-axis, the variants on the Y-axis
-
The values of the heatmap refer to the Phred-scaled likelihood for homomorphic reference allele (scale 0-255; 255: reference is very unlikely -> alternative more likely).
-
The plots of
alt_heatmap
are containing the Phred-scaled likelihood for homomorphic alternative allele, i.e. that the variant is present at this position (scale 0-255; 0: variant is present).
-
/results/sequences
contains the created consensus sequences for each sample in regard to the used reference in fasta format assample_name.fa
.
Snakedeploy usage
The usage of this workflow is described in the Snakemake Workflow Catalog .
# To run human variant analysis with mutserve
snakemake --cores all all_human --use-conda --use-singularity
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.
Planned features
This pipeline is work in progress. The following features are planned:
-
mutserve caller:
-
extraction of common variants
-
plotting variant heatmaps based on variants called by mutserve
-
building the consensus sequence
-
-
additional quality control strep:
- remove samples with low coverage from further analysis
-
dynamic sizing of the heatmap based on samples and variants
Code Snippets
16 17 | shell: "bwa mem {input.ref} {input.fastq_r1} {input.fastq_r2} | samtools sort| samtools view -b > {output}" |
32 33 | shell: "samtools index {input}" |
46 47 | shell: "samtools faidx {input}" |
16 17 | shell: "bcftools merge -m none -O v {input} > {output}" |
30 31 | script: "../scripts/common_variants.R" |
47 48 | script: "../scripts/plot_variant_heatmap.R" |
14 15 | shell: "vcftools --gzvcf {input.vcf} --remove-indels --minQ 200.0 --recode --recode-INFO-all --stdout |bgzip > {output.vcf}; tabix -p vcf {output.vcf}" |
30 31 | shell: "bcftools consensus {input.vcf} < {input.ref} > {output.fa} " |
13 14 | shell: "bgzip -f {input}; tabix -f -p vcf {output}" |
7 8 9 10 11 12 13 14 15 16 17 18 19 | run: if config["reference"] == "mouse": shell( "wget -O- http://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.chromosome.MT.fa.gz | zcat > {output}" ) if config["reference"] == "dog": shell( "wget -O- 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=U96639.2&db=nuccore&report=fasta' > {output}" ) if config["reference"] == "human": shell( "wget -O- http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz |zcat > {output} && sed -i '1c\>chrM' {output}" ) |
38 39 | shell: "bwa index {input}" |
19 20 | shell: "bcftools mpileup -f {input.ref} {input.bam} | bcftools call -mv --ploidy 1 -> {output}" |
34 35 | shell: "bcftools norm --fasta-ref {input.ref} --check-ref -m {input.vcf} | bcftools view -Ov -o {output}" |
23 24 | shell: "mutserve call --reference {input.ref} --output {output} {input.bam}" |
2 3 4 5 6 7 8 9 10 11 12 | x <- readLines(snakemake@input[['in_file']]) chr <- 0 for (i in 1:length(x)){ if (startsWith(x[i],"#CHROM")){ chr <- i } } x <- x[chr:length(x)] fileConn <- file(snakemake@output[['out']], "w") writeLines(x,fileConn) close(fileConn) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | library(pheatmap) library(RColorBrewer) in_file <- snakemake@input[[1]] out_file_1 <- snakemake@output[[1]] out_file_2 <- snakemake@output[[2]] out_file_3 <- snakemake@output[[3]] out_file_4 <- snakemake@output[[4]] variants <- read.table(in_file) first_line <- readLines(in_file) first_line <- strsplit(first_line,"\t")[[1]] colnames(variants) <- first_line variants$chr_pos_ref_alt <- paste(variants$`#CHROM`,variants$POS,variants$REF,variants$ALT,sep="_") variants[,c("INFO","ID","QUAL","FILTER","FORMAT","#CHROM","POS","REF","ALT")] <- NULL rownames(variants) <- variants$chr_pos_ref_alt variants$chr_pos_ref_alt <- NULL variants$N_CHR<- NULL variants[variants==".:."]<-510 alt_variants <- variants for(i in 1:ncol(variants)){ variants[,i] <-gsub("1:","",variants[,i]) variants[,i] <- gsub("\\d*,","",variants[,i]) variants[,i] <- as.numeric(variants[,i]) } for(i in 1:ncol(alt_variants)){ alt_variants[,i] <-gsub("1:","",alt_variants[,i]) alt_variants[,i] <- gsub(",\\d*","",alt_variants[,i]) alt_variants[,i] <- as.numeric(alt_variants[,i]) } names_variants <- colnames(variants) for(i in 1:length(names_variants)){ names_variants[i] <- gsub("\\w*\\/","",names_variants[i]) names_variants[i] <- gsub(".bam","",names_variants[i]) } colnames(variants) <- names_variants colnames(alt_variants) <- names_variants var_names <- colnames(variants) variants <- as.matrix(variants) my.breaks <- c(seq(0,255, by=25)) my.colors <- c(colorRampPalette(colors = c("red","grey","white"))(length(my.breaks))) plt_size <- 7 # heatmap with dendograms only for samples pdf(out_file_1, # File name width = 8, height = 7, # Width and height in inches bg = "white", # Background color paper = "A4") print(head(variants)) pheatmap(variants, cluster_rows = FALSE, legend=TRUE, cellwidth = plt_size, cellheight = plt_size, fontsize_row = plt_size, fontsize_col = plt_size, fontsize = plt_size, color=my.colors, breaks=my.breaks, ) dev.off() # heatmap with dendogramm for variants and dendogramm for samples pdf(out_file_2, # File name width = 8, height = 7, # Width and height in inches bg = "white", # Background color paper = "A4") pheatmap(variants, cluster_rows =TRUE, legend=TRUE, cellwidth = plt_size, cellheight = plt_size, fontsize_row = plt_size, fontsize_col = plt_size, fontsize = plt_size, color=my.colors, breaks=my.breaks ) dev.off() ############################################################################################ # heatmap for alternative variants ############################################################################################ var_names <- colnames(alt_variants) variants <- as.matrix(alt_variants) my.breaks <- c(seq(0,255, by=25)) my.colors <- c(colorRampPalette(colors = c("red","grey","white"))(length(my.breaks))) # heatmap with dendograms only for samples pdf(out_file_3, # File name width = 8, height = 7, # Width and height in inches bg = "white", # Background color paper = "A4") pheatmap(alt_variants, cluster_rows = FALSE, legend=TRUE, cellwidth = plt_size, cellheight = plt_size, fontsize_row = plt_size, fontsize_col = plt_size, fontsize = plt_size, color=my.colors, breaks=my.breaks, ) dev.off() # heatmap with dendogramm for variants and dendogramm for samples pdf(out_file_4, # File name width = 8, height = 7, # Width and height in inches bg = "white", # Background color paper = "A4") print(head(alt_variants)) pheatmap(alt_variants, cluster_rows =TRUE, legend=TRUE, cellwidth = plt_size, cellheight = plt_size, fontsize_row = plt_size, fontsize_col = plt_size, fontsize = plt_size, color=my.colors, breaks=my.breaks ) dev.off() |
Support
- Future updates