This is an human RNAseq variant calling workflow, following the GATK pipeline. Also includes ADAR-site elimination.

public public 1yr ago 0 bookmarks

Download

Clone the git repository with the command:

git clone https://github.com/durrantmm/rnaseq_variant_calling_workflow.git

Installation

This workflow uses a conda environment to satisfy all the necessary dependencies.

Make sure you have anaconda installed. Download it here .

You should be able to install the workflow simply by entering:

bash install.sh

In your console. Now it's time to move on to configuration.

Configuration

This is a very important step in running the workflow properly.

Open the provided config.yaml file to get started

Set GATK and Picard execution paths

The first two parameters of the config.yaml file are

gatk_path: "java -jar /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar /path/to/Picard.jar"

These are two strings that allow the workflow to execute GATK and Picard. You can download these from https://broadinstitute.github.io/picard/ and https://software.broadinstitute.org/gatk/download/

GATK requires you to make an account to download.

Once they are both installed, you can enter their execution paths in the config.yaml file.

You may want to add an addition flag, -Xmx50G , to make sure that Java allocates enough memory. The final config.yaml file would look something like this:

gatk_path: "java -jar -Xmx50G /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar -Xmx50G /path/to/Picard.jar"

Set working directory

The next parameter of the config.yaml file is wd: test . This is the default working directory, and it contains some test data to try out. You can give this a try by typing

snakemake --cores 12

It should take a few minutes to run, but once it runs correctly you know you are all set.

Setting up a workflow

It is essential that you set up your workflow properly before you run it. Follow these directions exactly.

IMPORTANT RULES AND TIPS TO REMEMBER:

  • When naming files, periods must only be used where specified. They cannot be used as a general name delimiter.

    • myfile.R1.fq.gz - CORRECT - this is the right way to name a fastq file

    • myfile.Run1.Replicate2.R1.fq.gz - INCORRECT - this is the WRONG way to name a fastq file.

    • myfile_Run1_Replicate2.R1.fq.gz - CORRECT - Replacing the improper periods with underscores is appropriate.

  • Files can be be symbolic links to save disk space.

Overview

Navigate to your working directory. Make sure you have chosen a file system that has lots of free disk space available.

From within your working directory, create 5 directories as follows:

mkdir 0.fastq;
mkdir 0.reference_genome_fasta;
mkdir 0.gencode;
mkdir 0.dbsnp;
mkdir 0.adar_sites;

Or it may be easier to download a template working directory by executing

wget https://s3-us-west-1.amazonaws.com/mdurrant/biodb/bundles/rnaseq_variant_calling_workflow/wd_hg19.tar.gz;
tar -zxvf wd_hg19.tar.gz;

And then setting your wd parameter in the config file to this directory.

Now we'll go through each of the directories and describe how to populate them.

0.fastq

This is the directory that contains all of the paired-end FASTQ files to be analyzed. Single-end analysis is possible, but hasn't been built into this workflow automatically. You can figure that one out on your own.

These fastq files must follow this precise format.

<sample1>.R1.fq.gz
<sample1>.R2.fq.gz
<sample2>.R1.fq.gz
<sample2>.R2.fq.gz
...

This folder contains as many samples as you want. Make sure that the only periods are found in the .R#.fq.gz file extension.

0.reference_genome_fasta

This is the reference genome fasta file to be used in this analysis. You can have as many reference genomes as you want. Each genome must also have matching files in 0.gencode , 0.dbsnp , and 0.adar_sites . The identifying name must be the same between these directories.

They must be named as follows:

<genome>.fa

They cannot be gzipped.

0.gencode

This folder contains the gencode GTF file to be used by the STAR aligner.

It must follow the naming convention:

<genome>.gtf

It must have the same <genome> name as the corresponding reference genome in

0.reference_genome_fasta

A version corresponding to hg19 can be downloaded from http://www.gencodegenes.org/releases/19.html

0.dbsnp

This folder contains the dbsnp VCF file.

It must follow the naming convention

<genome>.vcf

It must have the same <genome> name as the corresponding reference genome in 0.reference_genome_fasta and 0.gencode .

0.adar_sites

This folder contains known ADAR A-to-I RNA editing sites. They are excluded outright from the resulting RNA-seq variant calls.

It must follow the naming convention

<genome>.txt

It must have the same <genome> name as the corresponding reference genome in 0.reference_genome_fasta , 0.gencode , and 0.dbsnp .

Run Snakemake

You should now be able to run the snakemake workflow.

You can run this with the command

snakemake

From the rnaseq_variant_calling_workflow directory.

You can set the number of cores used with

snakemake --cores 12

Cluster Job Submission

If you are using the SGE computing cluster system, you can submit a job using the provided script

qsub submission.sh

You may need to play with the parameters in cluster.json to make sure that each step has enough memory and wall time allocated to it.

Code Snippets

42
43
run:
    print("RNAVCW FINISHED WITH NO EXCEPTIONS!")
51
52
53
54
55
56
run:
    command = "{picard} CreateSequenceDictionary R={input} O={output}".format(picard=config['picard_path'],
              input=input, output=output)

    print(command)
    shell(command)
64
65
shell:
    "samtools faidx {input}"
73
74
shell:
    "bgzip {input}"
82
83
shell:
    "tabix -p vcf {input}"
91
92
93
shell:
    "grep -P \"^[^\t]+\t[^\t]+\texon\" {input} | bedtools sort -i stdin | "
    "bedtools merge -i stdin > {output}"
105
106
107
108
shell:
    "mkdir -p {output.dir}; "
    "STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.dir} "
    "--genomeFastaFiles {input.refgen} --sjdbGTFfile {input.gencode} --sjdbOverhang {params.overhang}"
123
124
125
126
127
128
129
130
131
132
shell:
    "mkdir -p {output.dir}; "
    "STAR --readFilesIn {input.f1} {input.f2} --outFileNamePrefix {params.out_prefix} "
    "--genomeDir {input.genome_dir} --readFilesCommand zcat --runThreadN {threads} "
    "--genomeLoad NoSharedMemory --outFilterType BySJout "
    "--outSAMunmapped Within "
    "--outSAMattributes NH HI AS NM MD NM "
    "--twopassMode Basic "
    "--sjdbOverhang {params.overhang} "
    "--sjdbGTFfile {input.gencode}"
140
141
142
143
144
145
146
run:
    command = "{picard} AddOrReplaceReadGroups I={input} O={output} SO=coordinate RGID=id " \
              "RGLB=library RGPL=ILLUMINA RGPU=machine RGSM=sample; ".format(picard=config['picard_path'],
              input=input, output=output)

    print(command)
    shell(command)
SnakeMake From line 140 of master/Snakefile
155
156
157
158
159
160
161
run:
    command = "{picard} MarkDuplicates I={input} O={output_bam} CREATE_INDEX=true " \
              "VALIDATION_STRINGENCY=SILENT M={output_metrics}".format(picard=config['picard_path'],
              input=input, output_bam=output.bam, output_metrics=output.metrics)

    print(command)
    shell(command)
SnakeMake From line 155 of master/Snakefile
172
173
174
175
176
177
178
179
run:
    command = "{gatk} -T SplitNCigarReads -I {input} -o {output} -R {refgen} " \
              "-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 " \
              "-U ALLOW_N_CIGAR_READS".format(gatk=config['gatk_path'],
              input=input.bam, output=output, refgen=input.refgen)

    print(command)
    shell(command)
SnakeMake From line 172 of master/Snakefile
191
192
193
194
195
196
197
198
run:
    command = "{gatk} -T BaseRecalibrator -I {input} -o {output} -R {refgen} " \
              "-knownSites {dbsnp} -L {exons_bed}".format(gatk=config['gatk_path'],
              input=input.bam, output=output, refgen=input.refgen, dbsnp=input.dbsnp,
              exons_bed=input.exons_bed)

    print(command)
    shell(command)
SnakeMake From line 191 of master/Snakefile
208
209
210
211
212
213
214
run:
    command = "{gatk} -T PrintReads -I {input} -o {output} -R {refgen} " \
    "-BQSR {recal}".format(gatk=config['gatk_path'],
    input=input.bam, output=output, refgen=input.refgen, recal=input.recal)

    print(command)
    shell(command)
SnakeMake From line 208 of master/Snakefile
223
224
225
226
227
228
229
run:
    command = "{gatk} -T HaplotypeCaller -I {input} -o {output} -R {refgen} " \
    "-dontUseSoftClippedBases -stand_call_conf 20.0".format(gatk=config['gatk_path'],
    input=input.bam, output=output, refgen=input.refgen)

    print(command)
    shell(command)
SnakeMake From line 223 of master/Snakefile
238
239
240
241
242
243
244
245
run:
    command = "{gatk} -T VariantFiltration -V {input} -o {output} -R {refgen} " \
    "-window 35 -cluster 3 -filterName FS -filter \"FS > 30.0\" -filterName QD " \
    "-filter \"QD < 2.0\"".format(gatk=config['gatk_path'],
    input=input.vcf, output=output, refgen=input.refgen)

    print(command)
    shell(command)
SnakeMake From line 238 of master/Snakefile
253
254
255
256
257
258
259
260
261
262
run:
    tmp_bed = output.adar_bed+'.tmp'
    with open(input.adar_txt) as infile:
        infile.readline()
        with open(tmp_bed, 'w') as outfile:
            for line in infile:
                line = line.strip().split()
                chrom, start, end = line[0], int(line[1])-1, line[1]
                outfile.write('\t'.join([chrom, str(start), end])+'\n')
    shell('bedtools sort -i {bed} > {output}; rm {bed}'.format(bed=tmp_bed, output=output.adar_bed))
271
272
shell:
    'bedtools intersect -header -v -a {input.vcf} -b {input.adar_bed} > {output}'
280
281
282
283
284
285
286
run:
    command = "vcftools --vcf {invcf} --min-alleles 2 --max-alleles 2 " \
    "--remove-indels --recode --stdout | grep -Ev '1/1:|0/0:' | grep -v '1|1:' | grep -v '0|0:' " \
    "> {outvcf}; ".format(invcf=input, outvcf=output)

    print(command)
    shell(command)
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/durrantmm/rnaseq_variant_calling_workflow
Name: rnaseq_variant_calling_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...