M. tuberculosis variant identification (deprecated 2023). Use: https://github.com/ksw9/mtb-vars

public 1yr ago Version: v1 0 bookmarks

View Workflow

M. tuberculosis variant identification

Deprecated 2023. Please use: https://github.com/ksw9/mtb-vars

Snakemake pipeline for M. tuberculosis variant identification from short-read data, mapping with bwa, variant identification with GATK.

Usage

Clone repository from GitHub.

git clone https://github.com/ksw9/mtb-call.git

Navigate to your project root directory.

Create a conda environment with snakemake. Installation instructions here .

module add anaconda/3_2022.05 # On Stanford's SCG.
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Activate snakemake environment

source activate snakemake
snakemake --help

Download SnpEff for gene annotation .

Update the config file (config/config.yml) with the correct SnpEff path.

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
# Unzip file
unzip snpEff_latest_core.zip

Download the updated M. tuberculosis annotations.

 java -jar snpEff.jar download Mycobacterium_tuberculosis_h37rv

Update the config file (config/config.yml) so that Snakemake uses the correct sample list as input. The test sample list is config/test_data.tsv.

Create a directory to store cluster run outputs for troubleshooting.

mkdir logs

List snakemake jobs that have not yet completed, but don't run.

snakemake -np

Running snakemake locally will stop at Kraken step, due to memory requirements. Instead, submit to the cluster. --use-conda indicates using conda libraries specified at each step.

nohup snakemake -j 5 -k --cluster-config config/cluster_config.yml --use-conda --rerun-triggers mtime --rerun-incomplete --cluster \
"sbatch -A {cluster.account} --mem={cluster.memory} -t {cluster.time} --cpus-per-task {threads} --error {cluster.error} --output {cluster.output} " \
> logs/snakemake_test_data.out &

Monitor jobs running on the cluster.

sq

Look at entire output once jobs are completed.

cat run_logs/snakemake_test_data.out

Each step also produces a log, for troubleshooting a specific step.

cat results/test_data/test/bams/test_bwa_H37Rv_map.log

If snakemake runs into an error or if a run is prematurely terminated, the directory will be locked.

snakmake --unlock

Example data

Sampled paired-end fastq files are in the test_data directory. An input sample .tsv file list is located at config/test_data.tsv.

Directory structure

Results will be organized in the results directory, in sub-directories organized by sequencing batch and then sample name.

├── .gitignore
├── README.md
├── workflow
│ ├── rules
│ ├── envs
| │ ├── bwa.yml
| │ ├── gatk.yml
| │ ├── kraken2.yml
| │ ├── picard.yml
| │ ├── quanttb.yml
| │ ├── samtools.yml
| │ ├── TBprofilerv4.3.0.yml
| │ ├── trim-galore.yml
| │ ├── conda-meta
│ ├── scripts
| │ ├──process_stanford_tb.sh
| │ ├──run_kraken.sh
| │ ├──run_quanttb.sh
| │ ├──reads_output.sh
| │ ├──map_reads.sh
| │ ├──cov_stats.sh
| │ ├──mykrobe_predict.sh
| │ ├──call_varsgatk.sh
| │ ├──vcf2fasta.sh
│ ├── refs
| │ ├──H37Rv.fa
| │ ├──H37Rv_ppe.bed.gz
| │ ├──ppe_hdr.txt
| │ ├──snpEff
| └── Snakefile
├── config
│ ├── config.yml (run/user specific parameters)
│ ├── cluster_config.yml (cluster specific parameters)
├── results
│ ├── test_data/test (example organized by sequencing batch, then sample) 
| │ ├──trim
| │ ├──kraken
| │ ├──bams
| │ ├──vars
| │ ├──fasta
| │ ├──stats

Code Snippets

shell: 
  '''
  # Add test if environment exists already, create if not. 
  existing_env=$(conda env list | awk '{{print $1}}' | grep {params.enviro})
  [ -z "$existing_env" ] && mamba create -f {input}
  echo {params.enviro} > {output} 
  '''

SnakeMake From line 59 of workflow/Snakefile

shell:
  'trim_galore --basename {params.sample} --nextseq 20 --gzip --output_dir {params.output_dir} --paired {input.p1} {input.p2}'

SnakeMake Trim_Galore From line 81 of workflow/Snakefile

shell:
  '{config[scripts_dir]}run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_report} &>> {log}'

SnakeMake From line 98 of workflow/Snakefile

shell:
  '{config[scripts_dir]}run_quanttb.sh {input.kr1} {input.kr2} {output.quanttb_out} &>> {log}'

SnakeMake From line 112 of workflow/Snakefile

shell:
  '''
  {config[scripts_dir]}map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}
  sambamba markdup -r -t {threads} --tmpdir={params.tmp_dir} {output.bam} {output.combined_bam}
  '''

SnakeMake Sambamba From line 131 of workflow/Snakefile

shell:    
  '''
  {config[scripts_dir]}reads_output.sh {input.p1} {input.trim1} {input.kr1} {output.reads_stats} &>> {log}
  ''' 

SnakeMake From line 147 of workflow/Snakefile

shell:    
  '''
  {config[scripts_dir]}cov_stats.sh {input.ref_path} {input.bam} {output.cov_stats} &>> {log}

  #paste {input.kraken_stats} <(sed -n '7,8'p {output.cov_stats} ) > {output.combined_stats}

  # Combine with reads output
  #paste  -d ',' <( cat {output.combined_stats} | tr "\\t" "," ) {input.reads_stats} > {output.all_stats}

  # Combine reads output and coverage 
  paste  -d ',' {input.reads_stats}  <(sed -n '7,8'p {output.cov_stats} |  tr "\\t" "," ) > {output.combined_stats}

  paste -d ',' {output.combined_stats} {input.quanttb_stats}  <( cat {input.kraken_stats} | tr "\\t" ",") > {output.all_stats}

  # Mosdepth coverage along genome (for plotting)
  mosdepth --by 2000 -Q 30 {params.prefix} {input.bam}
  '''

SnakeMake mosdepth From line 169 of workflow/Snakefile

shell:
  "cat {input.combined_stats} > {output}"

SnakeMake From line 193 of workflow/Snakefile

shell: 
  """
  # Make sure tb-profiler reference chromosome names match the name used in your reference FASTA.
  tb-profiler update_tbdb --match_ref {params.ref_path}

  tb-profiler profile --bam {input.combined_bam} --prefix {params.samp} --dir {params.outdir} --csv --spoligotype &>> {log}
  mv {params.tmp_file} {output.profile}    

  """

SnakeMake From line 210 of workflow/Snakefile

shell: 
  "{config[scripts_dir]}mykrobe_predict.sh {input.combined_bam} {output.amr_out} &>> {log}"

SnakeMake From line 229 of workflow/Snakefile

shell: 
  "{config[scripts_dir]}call_vars_gatk.sh {input.ref_path} {input.combined_bam} {params.ploidy} {output.vcf} &>> {log}"

SnakeMake From line 244 of workflow/Snakefile

  shell:
    '''
    {config[scripts_dir]}vcf2fasta.sh {input.ref_path} {params.sample_name} {input.unfilt_vcf} {params.bed} {output.fasta} {params.depth} {params.qual} &>> {log}    
	'''

SnakeMake From line 262 of workflow/Snakefile

shell:
  '''
  # Rename Chromosome to be consistent with snpEff/Ensembl genomes.
  zcat {input.vcf}| sed 's/NC_000962.3/Chromosome/g' | bgzip > {output.rename_vcf}
  tabix {output.rename_vcf}

  # Run snpEff and then rename Chromosome.
  java -jar -Xmx8g {config[snpeff]} eff {config[snpeff_db]} {output.rename_vcf} -config {config[snpeff_configpath]} -dataDir {config[snpeff_datapath]} -noStats -no-downstream -no-upstream -canon | sed 's/Chromosome/NC_000962.3/g' > {output.tmp_vcf}

  # Also use bed file to annotate vcf, zip.
  bcftools annotate -a {params.bed} -h {params.vcf_header} -c CHROM,FROM,TO,FORMAT/PPE {output.tmp_vcf} | bgzip  > {output.ann_vcf}

  '''