M. tuberculosis variant identification (deprecated 2023). Use: https://github.com/ksw9/mtb-vars

public public 1yr ago Version: v1 0 bookmarks

M. tuberculosis variant identification

Deprecated 2023. Please use: https://github.com/ksw9/mtb-vars

Snakemake pipeline for M. tuberculosis variant identification from short-read data, mapping with bwa, variant identification with GATK.

Usage

Clone repository from GitHub.

git clone https://github.com/ksw9/mtb-call.git

Navigate to your project root directory.

Create a conda environment with snakemake. Installation instructions here .

module add anaconda/3_2022.05 # On Stanford's SCG.
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Activate snakemake environment

source activate snakemake
snakemake --help

Download SnpEff for gene annotation .

Update the config file (config/config.yml) with the correct SnpEff path.

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
# Unzip file
unzip snpEff_latest_core.zip

Download the updated M. tuberculosis annotations.

 java -jar snpEff.jar download Mycobacterium_tuberculosis_h37rv

Update the config file (config/config.yml) so that Snakemake uses the correct sample list as input. The test sample list is config/test_data.tsv.

Create a directory to store cluster run outputs for troubleshooting.

mkdir logs

List snakemake jobs that have not yet completed, but don't run.

snakemake -np

Running snakemake locally will stop at Kraken step, due to memory requirements. Instead, submit to the cluster. --use-conda indicates using conda libraries specified at each step.

nohup snakemake -j 5 -k --cluster-config config/cluster_config.yml --use-conda --rerun-triggers mtime --rerun-incomplete --cluster \
"sbatch -A {cluster.account} --mem={cluster.memory} -t {cluster.time} --cpus-per-task {threads} --error {cluster.error} --output {cluster.output} " \
> logs/snakemake_test_data.out &

Monitor jobs running on the cluster.

sq

Look at entire output once jobs are completed.

cat run_logs/snakemake_test_data.out

Each step also produces a log, for troubleshooting a specific step.

cat results/test_data/test/bams/test_bwa_H37Rv_map.log

If snakemake runs into an error or if a run is prematurely terminated, the directory will be locked.

snakmake --unlock

Example data

Sampled paired-end fastq files are in the test_data directory. An input sample .tsv file list is located at config/test_data.tsv.

Directory structure

Results will be organized in the results directory, in sub-directories organized by sequencing batch and then sample name.

├── .gitignore
├── README.md
├── workflow
 ├── rules
 ├── envs
|  ├── bwa.yml
|  ├── gatk.yml
|  ├── kraken2.yml
|  ├── picard.yml
|  ├── quanttb.yml
|  ├── samtools.yml
|  ├── TBprofilerv4.3.0.yml
|  ├── trim-galore.yml
|  ├── conda-meta
 ├── scripts
|  ├──process_stanford_tb.sh
|  ├──run_kraken.sh
|  ├──run_quanttb.sh
|  ├──reads_output.sh
|  ├──map_reads.sh
|  ├──cov_stats.sh
|  ├──mykrobe_predict.sh
|  ├──call_varsgatk.sh
|  ├──vcf2fasta.sh
 ├── refs
|  ├──H37Rv.fa
|  ├──H37Rv_ppe.bed.gz
|  ├──ppe_hdr.txt
|  ├──snpEff
| └── Snakefile
├── config
 ├── config.yml (run/user specific parameters)
 ├── cluster_config.yml (cluster specific parameters)
├── results
 ├── test_data/test (example organized by sequencing batch, then sample) 
|  ├──trim
|  ├──kraken
|  ├──bams
|  ├──vars
|  ├──fasta
|  ├──stats

Code Snippets

59
60
61
62
63
64
65
shell: 
  '''
  # Add test if environment exists already, create if not. 
  existing_env=$(conda env list | awk '{{print $1}}' | grep {params.enviro})
  [ -z "$existing_env" ] && mamba create -f {input}
  echo {params.enviro} > {output} 
  '''
81
82
shell:
  'trim_galore --basename {params.sample} --nextseq 20 --gzip --output_dir {params.output_dir} --paired {input.p1} {input.p2}'
98
99
shell:
  '{config[scripts_dir]}run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_report} &>> {log}'
112
113
shell:
  '{config[scripts_dir]}run_quanttb.sh {input.kr1} {input.kr2} {output.quanttb_out} &>> {log}'
131
132
133
134
135
shell:
  '''
  {config[scripts_dir]}map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}
  sambamba markdup -r -t {threads} --tmpdir={params.tmp_dir} {output.bam} {output.combined_bam}
  '''
147
148
149
150
shell:    
  '''
  {config[scripts_dir]}reads_output.sh {input.p1} {input.trim1} {input.kr1} {output.reads_stats} &>> {log}
  ''' 
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
shell:    
  '''
  {config[scripts_dir]}cov_stats.sh {input.ref_path} {input.bam} {output.cov_stats} &>> {log}

  #paste {input.kraken_stats} <(sed -n '7,8'p {output.cov_stats} ) > {output.combined_stats}

  # Combine with reads output
  #paste  -d ',' <( cat {output.combined_stats} | tr "\\t" "," ) {input.reads_stats} > {output.all_stats}

  # Combine reads output and coverage 
  paste  -d ',' {input.reads_stats}  <(sed -n '7,8'p {output.cov_stats} |  tr "\\t" "," ) > {output.combined_stats}

  paste -d ',' {output.combined_stats} {input.quanttb_stats}  <( cat {input.kraken_stats} | tr "\\t" ",") > {output.all_stats}

  # Mosdepth coverage along genome (for plotting)
  mosdepth --by 2000 -Q 30 {params.prefix} {input.bam}
  '''
193
194
shell:
  "cat {input.combined_stats} > {output}"
210
211
212
213
214
215
216
217
218
shell: 
  """
  # Make sure tb-profiler reference chromosome names match the name used in your reference FASTA.
  tb-profiler update_tbdb --match_ref {params.ref_path}

  tb-profiler profile --bam {input.combined_bam} --prefix {params.samp} --dir {params.outdir} --csv --spoligotype &>> {log}
  mv {params.tmp_file} {output.profile}    

  """
229
230
shell: 
  "{config[scripts_dir]}mykrobe_predict.sh {input.combined_bam} {output.amr_out} &>> {log}"
244
245
shell: 
  "{config[scripts_dir]}call_vars_gatk.sh {input.ref_path} {input.combined_bam} {params.ploidy} {output.vcf} &>> {log}"
262
263
264
265
  shell:
    '''
    {config[scripts_dir]}vcf2fasta.sh {input.ref_path} {params.sample_name} {input.unfilt_vcf} {params.bed} {output.fasta} {params.depth} {params.qual} &>> {log}    
	'''
281
282
283
284
285
286
287
288
289
290
291
292
293
shell:
  '''
  # Rename Chromosome to be consistent with snpEff/Ensembl genomes.
  zcat {input.vcf}| sed 's/NC_000962.3/Chromosome/g' | bgzip > {output.rename_vcf}
  tabix {output.rename_vcf}

  # Run snpEff and then rename Chromosome.
  java -jar -Xmx8g {config[snpeff]} eff {config[snpeff_db]} {output.rename_vcf} -config {config[snpeff_configpath]} -dataDir {config[snpeff_datapath]} -noStats -no-downstream -no-upstream -canon | sed 's/Chromosome/NC_000962.3/g' > {output.tmp_vcf}

  # Also use bed file to annotate vcf, zip.
  bcftools annotate -a {params.bed} -h {params.vcf_header} -c CHROM,FROM,TO,FORMAT/PPE {output.tmp_vcf} | bgzip  > {output.ann_vcf}

  '''
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ksw9/mtb-call
Name: mtb-call
Version: v1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...