Variant Annotation Workflow for Pearl Millet Using Snakemake and SnpEff

public public 1yr ago 0 bookmarks

annotate-millet-variants-snakeflow

This is a little Snakemake workflow to annotate some variants from pearl millet. This task came up during a bioinformatics course, BZ582A, at CSU in the spring of 2022. SnpEff is pretty easy to use when there is a precompiled data base for your organism. It is a little clunkier when you need to build a data base from a GFF. This shows one approach to that. (Note that it would probably be good to check the results by comparing them to some coding sequences, which we have not done here, yet).

The purpose of this little workflow is to demonstrate how a Snakefile can be put together for this task, and also to show how to use a SLURM profile derived from jdblischak’s smk-simple-slurm profile. The profile included in the repository at /hpcc-profiles/slurm-summit is tailored to the SUMMIT supercomputer.

Here is a DAG of this workflow, condensed using SnakemakeDagR :

To try this out on SUMMIT:

  1. From scompile clone this repository from GitHub into your scratch directory.
git clone https://github.com/eriqande/annotate-millet-variants-snakeflow.git
  1. Then, on the scompile node, first do a dry run:
cd annotate-millet-variants-snakeflow
conda activate snakemake
snakemake -np 

The output of that should include a jobs table that looks like this:

Job stats:
job count min threads max threads
------------------------ ------- ------------- -------------
add_to_snpeff_config 1 1 1
all 1 1 1
annotate_each_chromosome 7 1 1
build_snpeff_database 1 1 1
catenate_anno_chroms 1 1 1
convert_to_vcf 7 1 1
download_genome 1 1 1
download_genotype_readme 1 1 1
download_genotypes 7 1 1
download_gff 1 1 1
headerize_vcfs 7 1 1
make_fai 1 1 1
total 36 1 1
  1. Use this line to install the conda environments:
snakemake --use-conda --conda-create-envs-only --cores 1
  1. Now, we will just have Snakemake launch each job on SLURM using sbatch , by way of the slurm profile (except for the rules marked as localrules at the top of the snakefile—those rules, which are simple text manipulations or downloads, will be run locally on scompile .) Make sure you do this next step in a tmux session on scompile , because this has to keep running, even after you log off.
snakemake --profile hpcc-profiles/slurm-summit

While these things are running or are in the SLURM queue you can see them, by opening up another shell (with tmux, for example) and doing this::

squeue -u $(whoami) -o "%.12i %.9P %.50j %.10u %.2t %.15M %.6D %.18R %.5C %.12m"

If you need more space for the job names, change the 50 above to a larger number.

In the end, the annotated VCF file is in the directory results/vcf , and all the other intermediate (VCF) files have been deleted.

Additionally, the snpEff reports (html files and TSV files of genes) are all in the directory snpeff_reports .

A few things could be done better here. We don’t need to compress after reheadering, because it just gets decompressed to concat it. Also, it might be better to create the whole VCF before running it through snpEff, because then you get just single summary report. But, that would take a little longer, since you lose the parallelizability over chromosomes.

At any rate. This shows a simple Snakefile that has made this analysis easily reproducible.

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
GENOME_VERSION_NAME=$1
SPECIES_NAME_STRING=$2
OUTPUT=$3

if [ $# -ne 3 ]; then
	echo "expected 3 arguments" > /dev/stderr
	exit 1;
fi


# get the directory that the alias lives in
FIRSTDIR=$(dirname $(dirname $(which snpEff))) &&
# get the directory of the value of the alias
TMP=$(dirname $(readlink -s `which snpEff`)) &&
# combine those to get the directory where the config file is
SNPEFFDIR=${FIRSTDIR}${TMP/../} &&



mkdir -p resources/SnpEff  &&

# copy the config file
cp $SNPEFFDIR/snpEff.config  $OUTPUT  &&

# add an entry for pearl millet
echo "

# added by snakemake rule add_to_snpeff_config
$GENOME_VERSION_NAME.genome : $SPECIES_NAME_STRING

" >> $OUTPUT
23
24
shell:
	"wget --no-check-certificate -O {output} {params.url} > {log} 2>&1"
SnakeMake From line 23 of main/Snakefile
35
36
shell:
	"wget --no-check-certificate -O {output} {params.url} > {log} 2>&1"
SnakeMake From line 35 of main/Snakefile
47
48
49
shell:
	" (wget --no-check-certificate -O {output}.gz {params.url}; "
	" gunzip {output}.gz) > {log} 2>&1;"
SnakeMake From line 47 of main/Snakefile
60
61
62
shell:
	"(wget --no-check-certificate -O {output}.gz {params.url}; "
	" gunzip {output}.gz) > {log} 2>&1; "
SnakeMake From line 60 of main/Snakefile
76
77
78
79
80
81
82
83
	shell:
		"""
		(
  			awk 'NR>=5 {{printf("SAMPLE %s\\n", $2);}}' {input.readme}; 
  			echo xxxxxxxxxxxxxxxxx; 
  			zcat {input.geno}
		) | awk -f script/genotypes2vcf.awk > {output} 2>{log}
		"""
SnakeMake From line 76 of main/Snakefile
96
97
shell:
	" samtools faidx {input.fasta} 2> {log} "
111
112
113
shell:
	" bcftools reheader -f {input.fai} {input.vcf} | "
	" bcftools view -Oz -  > {output} 2> {log};"
128
129
shell:
	"./script/add_to_config.sh {params.genome_version} {params.species_name_string} {output} 2> {log}"
SnakeMake From line 128 of main/Snakefile
146
147
148
149
150
151
shell:
	"( mkdir -p {output} && "
	" cp resources/genome.fasta {output}/sequences.fa && "
	" cp resources/genome.gff {output}/genes.gff && "
	" snpEff build -Xmx4g  -noCheckCds -noCheckProtein -gff3 "
	"    -c resources/SnpEff/snpEff.config  -v {params.genome_version} ) > {log} 2>&1"
171
172
shell:
	"snpEff ann -s {output.html} -c {input.config}  {params.genome_version} {input.vcf} > {output.vcf} 2> {log} "
186
187
188
shell:
	" (bcftools concat {input} | bcftools view -Oz > {output} && "
	" bcftools index -t {output} ) 2> {log} "
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/eriqande/annotate-millet-variants-snakeflow
Name: annotate-millet-variants-snakeflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...