This workflow performs a variant analysis on mitochondrial genomes using the bcftools variant caller. For human samples, the workflow also performs a variant analysis using mutserve.

public 10mo ago Version: v1.0.1 0 bookmarks

View Workflow

multispecies_mitochondrial_variant_analysis — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This workflow performs a variant analysis on mitochondrial genomes using the bcftools variant caller. For human samples, the workflow also performs a variant analysis using mutserve. It is listed in the Snakemake Workflow Catalog where usage of standardized Snakemake workflows is described.

mt_analyses

Local installation

Clone this repo

git clone https://github.com/tlenfers/multispecies_mitochondrial_variant_analysis.git
cd multispecies_mitochondrial_variant_analysis

Install dependencies

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# setup channels 
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create & activate new env with installed deps
conda env create -n wf -f environment.yaml
source activate wf

Configuration

Config files :

config.yaml - analysis-specific settings
environment.yaml - software dependencies and versions
samples.tsv - list of (paired) samples

Samples:

Put all sample names in a single column in samples.tsv .
Add the data folder to config.yaml where all files to be analysed are located
- standard path is set to data/
assumed naming convention:
- sampleName_R1.fastq.gz
- sampleName_R2.fastq.gz

Reference:

To analyse dog, mouse or human samples the corresponding reference will be downloaded. Define the to be analysed species in config.yaml under reference.

If you want to analyse a different species or use your own reference, enter the name of the file and it's path in config.yaml .

Execute the workflow

cd workflow
# 'dry' run only checks I/O files
snakemake -np
# To run mutlipecies variant analysis
snakemake -j n all --use-conda --use-singularity
# where n is the numer of cores to use
# To run human variant analysis with mutserve
snakemake -j n all_human --use-conda --use-singularity
# where n is the numer of cores to use

Output

All output files are put in /results and in their own subfolder regarding the used reference and caller.
The results are in a sub-folder corresponding to the name of the reference file used.

/results/calls_bcftools contains all called variants using bcftools. The variants are normalized and saved as sample_name.vcf.gz . In addition, the file mergerd.vcf is created in which all variants are merged together.
- Example file without header:

 #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	results/mapped/human/sample_name.bam
 chrM	73	.	A	G	225.417	.	DP=253;VDB=3.59147e-17;SGB=-0.693147;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,240,5;MQ=60	GT:PL	1:255,0
 chrM	146	.	T	C	225.422	.	DP=242;VDB=0.795672;SGB=-0.693147;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,165,52;MQ=60	GT:PL	1:255,0

/results/calls_mutserve contains all called variants using mutserve.
/results/mapped contains all aligned reads as sample_name.bam ad their index file sample_name.bam.bai .
/results/plots contains the created heatmap plots for the bctools caller. Example plots:
- ref_heatmap.pdf
- ref_heatmap_clusterrow.pdf
- The name of the samples is on the X-axis, the variants on the Y-axis
- The values of the heatmap refer to the Phred-scaled likelihood for homomorphic reference allele (scale 0-255; 255: reference is very unlikely -> alternative more likely).
- The plots of alt_heatmap are containing the Phred-scaled likelihood for homomorphic alternative allele, i.e. that the variant is present at this position (scale 0-255; 0: variant is present).
/results/sequences contains the created consensus sequences for each sample in regard to the used reference in fasta format as sample_name.fa .

Snakedeploy usage

The usage of this workflow is described in the Snakemake Workflow Catalog .

# To run human variant analysis with mutserve
snakemake --cores all all_human --use-conda --use-singularity

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.

Planned features

This pipeline is work in progress. The following features are planned:

mutserve caller:
- extraction of common variants
- plotting variant heatmaps based on variants called by mutserve
- building the consensus sequence
additional quality control strep:
- remove samples with low coverage from further analysis
dynamic sizing of the heatmap based on samples and variants

Code Snippets

shell:
    "bwa mem {input.ref} {input.fastq_r1} {input.fastq_r2} | samtools sort| samtools view -b > {output}"

SnakeMake SAMtools BWA From line 16 of rules/align.smk

shell:
    "samtools index {input}"

SnakeMake SAMtools From line 32 of rules/align.smk

shell:
    "samtools faidx {input}"

SnakeMake SAMtools From line 46 of rules/align.smk

shell:
    "bcftools merge -m none -O v {input} > {output}"

SnakeMake BCFtools From line 16 of rules/analysis.smk

script:
    "../scripts/common_variants.R"

SnakeMake From line 30 of rules/analysis.smk

script:
    "../scripts/plot_variant_heatmap.R"

SnakeMake From line 47 of rules/analysis.smk

shell:
    "vcftools --gzvcf {input.vcf} --remove-indels --minQ 200.0 --recode --recode-INFO-all --stdout |bgzip > {output.vcf}; tabix -p vcf {output.vcf}"

SnakeMake tabix VCFtools From line 14 of rules/consensus_sequence.smk

shell:
    "bcftools consensus {input.vcf} < {input.ref} > {output.fa} "

SnakeMake BCFtools Consensus From line 30 of rules/consensus_sequence.smk

shell:
    "bgzip -f {input}; tabix -f -p vcf {output}"

SnakeMake tabix From line 13 of rules/helper.smk

run:
    if config["reference"] == "mouse":
        shell(
            "wget -O- http://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.chromosome.MT.fa.gz | zcat > {output}"
        )
    if config["reference"] == "dog":
        shell(
            "wget -O- 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=U96639.2&db=nuccore&report=fasta' > {output}"
        )
    if config["reference"] == "human":
        shell(
            "wget -O- http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz |zcat > {output} && sed -i '1c\>chrM' {output}"
        )

SnakeMake From line 7 of rules/reference.smk

shell:
    "bwa index {input}"

SnakeMake BWA From line 38 of rules/reference.smk

shell:
    "bcftools mpileup -f {input.ref} {input.bam} | bcftools call -mv --ploidy 1 -> {output}"

SnakeMake BCFtools From line 19 of rules/variants_bcftools.smk

shell:
    "bcftools norm  --fasta-ref {input.ref} --check-ref -m {input.vcf} | bcftools view -Ov -o {output}"

SnakeMake BCFtools From line 34 of rules/variants_bcftools.smk

shell:
    "mutserve call --reference {input.ref} --output {output} {input.bam}"

SnakeMake From line 23 of rules/variants_mutserve.smk

x <- readLines(snakemake@input[['in_file']])
chr <- 0
for (i in 1:length(x)){
  if (startsWith(x[i],"#CHROM")){
    chr <- i
  }
}
x <- x[chr:length(x)]
fileConn <- file(snakemake@output[['out']], "w")
writeLines(x,fileConn)
close(fileConn)

R From line 2 of scripts/common_variants.R

library(pheatmap)
library(RColorBrewer)

in_file <- snakemake@input[[1]]
out_file_1 <- snakemake@output[[1]]
out_file_2 <- snakemake@output[[2]]
out_file_3 <- snakemake@output[[3]]
out_file_4 <- snakemake@output[[4]]


variants <- read.table(in_file)
first_line <- readLines(in_file)
first_line <- strsplit(first_line,"\t")[[1]]
colnames(variants) <- first_line

variants$chr_pos_ref_alt <- paste(variants$`#CHROM`,variants$POS,variants$REF,variants$ALT,sep="_")
variants[,c("INFO","ID","QUAL","FILTER","FORMAT","#CHROM","POS","REF","ALT")] <- NULL


rownames(variants) <- variants$chr_pos_ref_alt
variants$chr_pos_ref_alt <- NULL
variants$N_CHR<- NULL

variants[variants==".:."]<-510

alt_variants <- variants

for(i in 1:ncol(variants)){
  variants[,i] <-gsub("1:","",variants[,i])
  variants[,i] <- gsub("\\d*,","",variants[,i])
  variants[,i] <- as.numeric(variants[,i])
}

for(i in 1:ncol(alt_variants)){
  alt_variants[,i] <-gsub("1:","",alt_variants[,i])
  alt_variants[,i] <- gsub(",\\d*","",alt_variants[,i])
  alt_variants[,i] <- as.numeric(alt_variants[,i])
}

names_variants <- colnames(variants)
for(i in 1:length(names_variants)){
  names_variants[i] <- gsub("\\w*\\/","",names_variants[i])
  names_variants[i] <- gsub(".bam","",names_variants[i])
}
colnames(variants) <- names_variants
colnames(alt_variants) <- names_variants



var_names <- colnames(variants)
variants <- as.matrix(variants)

my.breaks <- c(seq(0,255, by=25)) 
my.colors <- c(colorRampPalette(colors = c("red","grey","white"))(length(my.breaks)))

plt_size <- 7

# heatmap with dendograms only for samples
pdf(out_file_1,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")     
print(head(variants))
pheatmap(variants,
         cluster_rows = FALSE,
         legend=TRUE,
         cellwidth = plt_size,
         cellheight = plt_size,
         fontsize_row = plt_size,
         fontsize_col = plt_size,
         fontsize = plt_size,
         color=my.colors,
         breaks=my.breaks,
         )
dev.off()

# heatmap with dendogramm for variants and dendogramm for samples

pdf(out_file_2,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")     
pheatmap(variants,
         cluster_rows =TRUE,
         legend=TRUE,
         cellwidth = plt_size,
         cellheight = plt_size,
         fontsize_row = plt_size,
         fontsize_col = plt_size,
         fontsize = plt_size,
         color=my.colors,
         breaks=my.breaks
         )
dev.off()

############################################################################################
# heatmap for alternative variants
############################################################################################

var_names <- colnames(alt_variants)
variants <- as.matrix(alt_variants)

my.breaks <- c(seq(0,255, by=25))
my.colors <- c(colorRampPalette(colors = c("red","grey","white"))(length(my.breaks)))

# heatmap with dendograms only for samples
pdf(out_file_3,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")
pheatmap(alt_variants,         cluster_rows = FALSE,
         legend=TRUE,
         cellwidth = plt_size,
         cellheight = plt_size,
         fontsize_row = plt_size,
         fontsize_col = plt_size,
         fontsize = plt_size,
         color=my.colors,
         breaks=my.breaks,
)
dev.off()

# heatmap with dendogramm for variants and dendogramm for samples

pdf(out_file_4,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")
print(head(alt_variants))
pheatmap(alt_variants,
                cluster_rows =TRUE,
                legend=TRUE,
                cellwidth = plt_size,
                cellheight = plt_size,
                fontsize_row = plt_size,
                fontsize_col = plt_size,
                fontsize = plt_size,
                color=my.colors,
                breaks=my.breaks
)
dev.off()