This workflow performs a variant analysis on mitochondrial genomes using the bcftools variant caller. For human samples, the workflow also performs a variant analysis using mutserve.

public public 7mo ago Version: v1.0.1 0 bookmarks

This workflow performs a variant analysis on mitochondrial genomes using the bcftools variant caller. For human samples, the workflow also performs a variant analysis using mutserve. It is listed in the Snakemake Workflow Catalog where usage of standardized Snakemake workflows is described.

mt_analyses

Local installation

  1. Clone this repo
git clone https://github.com/tlenfers/multispecies_mitochondrial_variant_analysis.git
cd multispecies_mitochondrial_variant_analysis
  1. Install dependencies
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# setup channels 
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create & activate new env with installed deps
conda env create -n wf -f environment.yaml
source activate wf

Configuration

Config files :

Samples:

  • Put all sample names in a single column in samples.tsv .

  • Add the data folder to config.yaml where all files to be analysed are located

    • standard path is set to data/
  • assumed naming convention:

    • sampleName_R1.fastq.gz

    • sampleName_R2.fastq.gz

Reference:

To analyse dog, mouse or human samples the corresponding reference will be downloaded. Define the to be analysed species in config.yaml under reference.

If you want to analyse a different species or use your own reference, enter the name of the file and it's path in config.yaml .

Execute the workflow

cd workflow
# 'dry' run only checks I/O files
snakemake -np
# To run mutlipecies variant analysis
snakemake -j n all --use-conda --use-singularity
# where n is the numer of cores to use
# To run human variant analysis with mutserve
snakemake -j n all_human --use-conda --use-singularity
# where n is the numer of cores to use

Output

All output files are put in /results and in their own subfolder regarding the used reference and caller.
The results are in a sub-folder corresponding to the name of the reference file used.

  • /results/calls_bcftools contains all called variants using bcftools. The variants are normalized and saved as sample_name.vcf.gz . In addition, the file mergerd.vcf is created in which all variants are merged together.

    • Example file without header:
 #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	results/mapped/human/sample_name.bam
 chrM	73	.	A	G	225.417	.	DP=253;VDB=3.59147e-17;SGB=-0.693147;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,240,5;MQ=60	GT:PL	1:255,0
 chrM	146	.	T	C	225.422	.	DP=242;VDB=0.795672;SGB=-0.693147;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,165,52;MQ=60	GT:PL	1:255,0
  • /results/calls_mutserve contains all called variants using mutserve.

  • /results/mapped contains all aligned reads as sample_name.bam ad their index file sample_name.bam.bai .

  • /results/plots contains the created heatmap plots for the bctools caller. Example plots:

    • ref_heatmap.pdf

    • ref_heatmap_clusterrow.pdf

    • The name of the samples is on the X-axis, the variants on the Y-axis

    • The values of the heatmap refer to the Phred-scaled likelihood for homomorphic reference allele (scale 0-255; 255: reference is very unlikely -> alternative more likely).

    • The plots of alt_heatmap are containing the Phred-scaled likelihood for homomorphic alternative allele, i.e. that the variant is present at this position (scale 0-255; 0: variant is present).

  • /results/sequences contains the created consensus sequences for each sample in regard to the used reference in fasta format as sample_name.fa .

Snakedeploy usage

The usage of this workflow is described in the Snakemake Workflow Catalog .

# To run human variant analysis with mutserve
snakemake --cores all all_human --use-conda --use-singularity 

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.

Planned features

This pipeline is work in progress. The following features are planned:

  • mutserve caller:

    • extraction of common variants

    • plotting variant heatmaps based on variants called by mutserve

    • building the consensus sequence

  • additional quality control strep:

    • remove samples with low coverage from further analysis
  • dynamic sizing of the heatmap based on samples and variants

Code Snippets

16
17
shell:
    "bwa mem {input.ref} {input.fastq_r1} {input.fastq_r2} | samtools sort| samtools view -b > {output}"
32
33
shell:
    "samtools index {input}"
46
47
shell:
    "samtools faidx {input}"
16
17
shell:
    "bcftools merge -m none -O v {input} > {output}"
30
31
script:
    "../scripts/common_variants.R"
47
48
script:
    "../scripts/plot_variant_heatmap.R"
14
15
shell:
    "vcftools --gzvcf {input.vcf} --remove-indels --minQ 200.0 --recode --recode-INFO-all --stdout |bgzip > {output.vcf}; tabix -p vcf {output.vcf}"
30
31
shell:
    "bcftools consensus {input.vcf} < {input.ref} > {output.fa} "
13
14
shell:
    "bgzip -f {input}; tabix -f -p vcf {output}"
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
run:
    if config["reference"] == "mouse":
        shell(
            "wget -O- http://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.chromosome.MT.fa.gz | zcat > {output}"
        )
    if config["reference"] == "dog":
        shell(
            "wget -O- 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=U96639.2&db=nuccore&report=fasta' > {output}"
        )
    if config["reference"] == "human":
        shell(
            "wget -O- http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz |zcat > {output} && sed -i '1c\>chrM' {output}"
        )
38
39
shell:
    "bwa index {input}"
19
20
shell:
    "bcftools mpileup -f {input.ref} {input.bam} | bcftools call -mv --ploidy 1 -> {output}"
34
35
shell:
    "bcftools norm  --fasta-ref {input.ref} --check-ref -m {input.vcf} | bcftools view -Ov -o {output}"
23
24
shell:
    "mutserve call --reference {input.ref} --output {output} {input.bam}"
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
x <- readLines(snakemake@input[['in_file']])
chr <- 0
for (i in 1:length(x)){
  if (startsWith(x[i],"#CHROM")){
    chr <- i
  }
}
x <- x[chr:length(x)]
fileConn <- file(snakemake@output[['out']], "w")
writeLines(x,fileConn)
close(fileConn)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
library(pheatmap)
library(RColorBrewer)

in_file <- snakemake@input[[1]]
out_file_1 <- snakemake@output[[1]]
out_file_2 <- snakemake@output[[2]]
out_file_3 <- snakemake@output[[3]]
out_file_4 <- snakemake@output[[4]]


variants <- read.table(in_file)
first_line <- readLines(in_file)
first_line <- strsplit(first_line,"\t")[[1]]
colnames(variants) <- first_line

variants$chr_pos_ref_alt <- paste(variants$`#CHROM`,variants$POS,variants$REF,variants$ALT,sep="_")
variants[,c("INFO","ID","QUAL","FILTER","FORMAT","#CHROM","POS","REF","ALT")] <- NULL


rownames(variants) <- variants$chr_pos_ref_alt
variants$chr_pos_ref_alt <- NULL
variants$N_CHR<- NULL

variants[variants==".:."]<-510

alt_variants <- variants

for(i in 1:ncol(variants)){
  variants[,i] <-gsub("1:","",variants[,i])
  variants[,i] <- gsub("\\d*,","",variants[,i])
  variants[,i] <- as.numeric(variants[,i])
}

for(i in 1:ncol(alt_variants)){
  alt_variants[,i] <-gsub("1:","",alt_variants[,i])
  alt_variants[,i] <- gsub(",\\d*","",alt_variants[,i])
  alt_variants[,i] <- as.numeric(alt_variants[,i])
}

names_variants <- colnames(variants)
for(i in 1:length(names_variants)){
  names_variants[i] <- gsub("\\w*\\/","",names_variants[i])
  names_variants[i] <- gsub(".bam","",names_variants[i])
}
colnames(variants) <- names_variants
colnames(alt_variants) <- names_variants



var_names <- colnames(variants)
variants <- as.matrix(variants)

my.breaks <- c(seq(0,255, by=25)) 
my.colors <- c(colorRampPalette(colors = c("red","grey","white"))(length(my.breaks)))

plt_size <- 7

# heatmap with dendograms only for samples
pdf(out_file_1,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")     
print(head(variants))
pheatmap(variants,
         cluster_rows = FALSE,
         legend=TRUE,
         cellwidth = plt_size,
         cellheight = plt_size,
         fontsize_row = plt_size,
         fontsize_col = plt_size,
         fontsize = plt_size,
         color=my.colors,
         breaks=my.breaks,
         )
dev.off()

# heatmap with dendogramm for variants and dendogramm for samples

pdf(out_file_2,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")     
pheatmap(variants,
         cluster_rows =TRUE,
         legend=TRUE,
         cellwidth = plt_size,
         cellheight = plt_size,
         fontsize_row = plt_size,
         fontsize_col = plt_size,
         fontsize = plt_size,
         color=my.colors,
         breaks=my.breaks
         )
dev.off()

############################################################################################
# heatmap for alternative variants
############################################################################################

var_names <- colnames(alt_variants)
variants <- as.matrix(alt_variants)

my.breaks <- c(seq(0,255, by=25))
my.colors <- c(colorRampPalette(colors = c("red","grey","white"))(length(my.breaks)))

# heatmap with dendograms only for samples
pdf(out_file_3,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")
pheatmap(alt_variants,         cluster_rows = FALSE,
         legend=TRUE,
         cellwidth = plt_size,
         cellheight = plt_size,
         fontsize_row = plt_size,
         fontsize_col = plt_size,
         fontsize = plt_size,
         color=my.colors,
         breaks=my.breaks,
)
dev.off()

# heatmap with dendogramm for variants and dendogramm for samples

pdf(out_file_4,         # File name
    width = 8, height = 7, # Width and height in inches
    bg = "white",          # Background color
    paper = "A4")
print(head(alt_variants))
pheatmap(alt_variants,
                cluster_rows =TRUE,
                legend=TRUE,
                cellwidth = plt_size,
                cellheight = plt_size,
                fontsize_row = plt_size,
                fontsize_col = plt_size,
                fontsize = plt_size,
                color=my.colors,
                breaks=my.breaks
)
dev.off()
ShowHide 11 more snippets with no or duplicated tags.

Free

Created: 7mo ago
Updated: 7mo ago
Maitainers: public
URL: https://github.com/TLenfers/multispecies_mitochondrial_variant_analysis
Name: multispecies_mitochondrial_variant_analysis
Version: v1.0.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...