PPCG RNA working group

public public 1yr ago Version: v1.2.4 0 bookmarks

PPCG RNA working group

This repository contains the pipeline used by the Pan Prostate Cancer Group.

Requirements

All tools used in this pipeline are installed at runtime through a singularity container. The only requirements are the ones to run the Snakemake workflow.

  • mamba >= 1.2 (optional, for faster installation)

  • Snakemake >= 6.2

  • Singularity >=3.7 (Should work on previous versions although not tested)

  • Cookiecutter >= 1.7.2

With exception of Singularity, these can be easily installed using conda . Please do not install Singularity through conda .

Singularity must be owned by root in order to run this workflow. If you do not have access to a singularity installation owned by root try requesting to your system administrator .

Before you begin

When executing snakemake, try to avoid canceling it with ctrl + c . If an error occurs, snakemake will clean the files from the failed step before shutting down. Forcing it to shutdown will cause corrupted files to be left behind, potentially causing troubles. On the same note, avoid moving/creating/removing files under the results/ folder. Snakemake relies on the time stamp of the files to coordinate the order of execution of each step. If you must remove files, only remove files resulting from the last step ran, otherwise this will cause snakemake to reprocess all steps that relies on these files.

How to run

  1. Download the snakemake workflow.
wget https://github.com/panprostate/RNA/releases/download/v1.2.4/PPCG_RNA_v1.2.4.tar.gz
tar -xzvf PPCG_RNA_v1.2.4.tar.gz
  1. This workflow requires a sample metadata table (tab separated) containing the following fields:
  • sample_name = The sample name, by default anything preceding the "_L00" pattern.

  • unit_name = The unit/lane name, by default it matches the following pattern ".+_(L00.).+".

  • fq1 = Absolute path to the R1 file of the fastq pair.

  • fq2 = Absolute path to the R2 file of the fastq pair. For single-ended libraries this field should contain NA .

Create this table under the config/ folder. As reference, a helper script createTable.sh is included under config/ to create this table. Make sure the table format is correct before proceding.

  1. If your computing enviroment is managed by Slurm, a cluster profile is included at slurm_profile/ . There are two files that can be adjusted:
  • slurm_profile/config.yaml

Most likely you will want to adjust only two settings in this file:

# Number of cores used by Snakemake to manage the workflow - if increasing, be careful to not have your job killed if not running in an interactive session.
local-cores: 1
# Number of maximum simultaneous jobs that can be run.
jobs: 5
  • slurm_profile/settings.json

This file can be used to include center-specific parameters for your submissions. For example, to specify a partition panda on all jobs submissions.

"SBATCH_DEFAULTS": "partition=panda job-name={rule}_{wildcards} output=job_logs/slurm-%j.out"
  1. Edit the amount of resources requested and other general settings in config/config.yaml . Some of the options are: NEW - library_type: Process single-ended libraries. Please note that a mixed sample table is not supported at the moment, make sure all samples are SE or PE in your samples variable.
  • samples: path to the metadata table described in step 2.

  • buildIndex: True or False wether to build indexes from files or download built indexes.

  • adapters: adapter sequences used by cutadapt.

  • mem_*: Amount of RAM memory requested for the job in MB.

  • ncpus_*: Number of cores requested for the job.

  • rt_*: Maximum time a job is allowed to run before it is killed.

  • compression: Compression level from 1 to 9 for intermediated steps. Final outputs are always compressed at the default level.

  1. From the base directory execute Snakemake:
snakemake --use-conda --use-singularity --profile slurm_profile

The workflow will automatically pull and create a Singularity container from docker://condaforge/mambaforge:4.10.1-0 and create environments with the necessary packages for each module.

  1. If debugging is necessary, the folder logs/ contains the error log generated by the software run in each step. In some cases the errors are not directly caused by a software, but by the execution of a script or the cluster resource manager - in those cases the error will be output in job_logs/ with the name slurm-<job-number>.out .

Troubleshooting

Depending on the number of samples and read length, you might end up with an excessive amount of SJs detect. In this case you will encounter the error bellow: Fatal LIMIT error: the number of junctions to be inserted on the fly =XXXXXXX is larger than the limitSjdbInsertNsj=1000000 While you can increase this limit, this might cause you to encounter another error: EXITING because of FATAL ERROR: cannot insert junctions on the fly because of strand GstrandBit problem The solution for this is to decrease the number of SJs being inserted on the fly. You can do this by turning on the SJ filter in the config.yaml file, which by default will remove SJs supported by only 1 read (uniquely mapped). SJ_filter: True SJ_minCounts: 1

Workflow overview

Assuming a sample named S1 which has been split in two files by lane ( L001 and L002 ).

overview

Code Snippets

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
shell:
    """
    set -e
    echo "Downloading resources..."
    [[ -f {output.dbSNP} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf {output.dbSNP}
    [[ -f {output.dbSNPidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf.idx {output.dbSNPidx}
    [[ -f {output.knowIndels} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Mills_and_1000G_gold_standard.indels.b37.sites.vcf {output.knowIndels}
    [[ -f {output.knowIndelsidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx {output.knowIndelsidx}
    [[ -f {output.reference} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta {output.reference}
    [[ -f {output.faidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta.fai {output.faidx}
    [[ -f {output.dictn} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dict {output.dictn}
    [[ -f {output.paSites} ]] || wget -O resources/pasites_hg19.bed.gz https://www.dropbox.com/s/i55pzcuh6113vvq/pasites_hg19.bed.gz
    [[ -f {output.tx2gene} ]] || wget -O resources/tx2gene.tsv.gz https://www.dropbox.com/s/ays94bytt7r02ex/tx2gene.tsv.gz
    [[ -f {output.fc} ]] || wget -O resources/FANTOM_CAT.lv3_robust.gtf.gz https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.gtf.gz
    [[ -f {output.gtf} ]] || wget -O resources/gencode.v38lift37.annotation.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.annotation.gtf.gz
    [[ -f {output.transcripts} ]] || wget -O resources/gencode.v38lift37.transcripts.fa.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.transcripts.fa.gz
    [[ -f {output.blacklist} ]] || wget -O resources/arriba_v2.1.0.tar.gz https://github.com/suhrig/arriba/releases/download/v2.1.0/arriba_v2.1.0.tar.gz
    tar -xvf resources/arriba_v2.1.0.tar.gz --directory resources/
    mv resources/arriba_v2.1.0/database/blacklist_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz resources/arriba_v2.1.0/database/known_fusions_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz resources/arriba_v2.1.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.1.0.gff3 resources/
    rm -rf resources/arriba_v2.1.0/ resources/arriba_v2.1.0.tar.gz
    echo "Unzipping some files..."
    gunzip resources/gencode.v38lift37.annotation.gtf.gz resources/gencode.v38lift37.transcripts.fa.gz resources/FANTOM_CAT.lv3_robust.gtf.gz resources/pasites_hg19.bed.gz
    sed -i 's/^chr//g' resources/gencode.v38lift37.annotation.gtf
    sed -i 's/^chr//g' resources/FANTOM_CAT.lv3_robust.gtf
    echo "Done!"
    """
66
67
68
69
70
71
shell:
    """
    set -e
    wget -O resources/STARindex_hg19.tar https://www.dropbox.com/s/7zf1ad0hokpizfi/STARindex_hg19.tar
    tar -xvf resources/STARindex_hg19.tar --directory resources/
    """
85
86
87
88
89
90
shell:
    """
    set -e
    wget -O resources/salmon_hg19.tar https://www.dropbox.com/s/mybii6z71v9pt57/salmon_hg19.tar
    tar -xvf resources/salmon_hg19.tar --directory resources/
    """
104
105
106
107
108
shell:
    """
    wget -O resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz
    tar -xzvf resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz --directory resources/
    """
14
15
script:
    "../../scripts/gtf2bed.py"
SnakeMake From line 14 of PE/QC.smk
31
32
33
34
35
shell:
    """
    fastqc -o {params.dir} {input.bam}
    mv results/qc/fastqc/{wildcards.sample}/{wildcards.sample}_{wildcards.unit}.Aligned.sortedByCoord.out_fastqc.zip {output} 
    """
53
54
55
shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"
SnakeMake From line 53 of PE/QC.smk
74
75
76
shell:
    "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log} 2>&1"
SnakeMake From line 74 of PE/QC.smk
91
92
shell:
    "bam_stat.py -i {input} > {output} 2> {log}"
SnakeMake From line 91 of PE/QC.smk
108
109
shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 108 of PE/QC.smk
127
128
shell:
    "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 127 of PE/QC.smk
144
145
shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 144 of PE/QC.smk
162
163
shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 162 of PE/QC.smk
180
181
shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 180 of PE/QC.smk
212
213
214
215
216
217
shell:
    "multiqc"
    " --force"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"
248
249
250
251
252
253
254
255
256
shell:
    """
    set -e
    multiqc --force -o {params.outDir} -n {params.geneCount_gc} {input.gcg}
    multiqc --force -o {params.outDir} -n {params.geneCount_fc} {input.gcf}
    multiqc --force -o {params.outDir} -n {params.exonCount_gc} {input.ecg}
    multiqc --force -o {params.outDir} -n {params.exonCount_fc} {input.ecf}
    multiqc --force -o {params.outDir} -n {params.salmon} {input.salmon} {input.salmon_length}
    """
278
279
280
281
282
283
shell:
    "multiqc"
    " --force -fp"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"
22
23
24
25
26
27
28
29
30
shell:
    "cutadapt"
    " {params.adapters}"
    " -m 31"
    " --compression-level {params.compression}"
    " -o {output.fastq1}"
    " -p {output.fastq2}"
    " -j {threads}"
    " {input} > {output.qc} 2>>{log}"
51
52
53
54
55
56
57
58
shell:
    "STAR"
    " --runMode genomeGenerate"
    " --genomeDir {params.idx}"
    " --genomeFastaFiles {input.reference}"
    " --sjdbOverhang 250"
    " --sjdbGTFfile {input.gtf}"
    " --runThreadN {threads} 2>&1 | tee -a {log}"
79
80
81
82
83
84
85
86
87
shell:
    """
    set -e
    cat {input.transcripts} {input.reference} > resources/gentrome.fa
    grep "^>" {input.reference} | cut -d " " -f 1 > resources/decoys.txt
    sed -i -e 's/>//g' resources/decoys.txt
    salmon index -t resources/gentrome.fa -i {params.idx} -p {threads} --decoys resources/decoys.txt --gencode -k 31 2>&1 | tee -a {log}
    rm -f resources/gentrome.fa resources/decoys.txt
    """
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} {input.f2} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --outFilterType BySJout \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --outSAMtype None \
    --outSAMmode None \
    --outFileNamePrefix results/STAR_1p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """
157
158
159
160
161
shell:
    """
    set -e
    awk -F'\t' '$7 > {params.minCount}' {input.SJs} > {output.filtered}
    """
SnakeMake From line 157 of PE/RNASeq.smk
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
shell:
    """
    touch results/index/dummy.fastq
    mkdir -p results/index/STARindex_hg19_SJ
    cd results/index/STARindex_hg19_SJ
    ln -sf ../../../resources/STARindex_hg19/* .
    rm -f sjdbList.out.tab sjdbInfo.txt
    cd ../../../
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn results/index/dummy.fastq \
    --runThreadN 2 \
    --sjdbFileChrStartEnd {input.SJ} \
    --outFileNamePrefix results/index/ \
    --sjdbInsertSave Basic 2>&1 | tee -a {log}
    cp results/index/_STARgenome/sjdbList.out.tab results/index/STARindex_hg19_SJ/
    cp results/index/_STARgenome/sjdbInfo.txt results/index/STARindex_hg19_SJ/
    rm -rf results/index/_STARgenome
    """
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} {input.f2} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outBAMcompression {params.compression} \
    --outFilterType BySJout \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --outSAMtype BAM Unsorted \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    -–outSAMattrRGline ID:{params.RG} SM:{params.SM} PL:illumina \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --chimSegmentMin 10 \
    --chimOutJunctionFormat 1 \
    --chimOutType Junctions WithinBAM SoftClip \
    --chimJunctionOverhangMin 10 \
    --chimScoreDropMax 30 \
    --chimScoreJunctionNonGTAG 0 \
    --chimScoreSeparation 1 \
    --chimSegmentReadGapMax 3 \
    --chimMultimapNmax 20 \
    --outFileNamePrefix results/STAR_2p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """
287
288
289
290
291
shell:
    """
    samtools sort -@ {threads} {input.bam} -o {output.sbam} 2>>{log}
    samtools index {output.sbam}
    """
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
shell:
    """
    set -e
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
        cp {input.bams}.bai {output.mbam}.bai
        cp {input.chims} {output.chim}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
        samtools index {output.mbam}
        # Also merge the chimeric files for star-fusion
        cat {input.chims[0]} | head -n -2 > {output.chim}
        files=({input.chims})
        unset files[0]
        files=("${{files[@]}}")
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
        touch results/mergedBam/{wildcards.sample}chims
        cat {input.chims[0]} | tail -n 1 > results/mergedBam/{wildcards.sample}counts
        for file in $files; do
            cat $file | tail -n +2 | head -n -2 >> results/mergedBam/{wildcards.sample}chims
            cat $file | tail -n 1 >> results/mergedBam/{wildcards.sample}counts
        done
        cat results/mergedBam/{wildcards.sample}chims >> {output.chim}
        cat {input.chims[0]} | tail -n 2 |head -n 1 >> {output.chim}
        awk -F' ' '{{Nreads+=$3; NreadsUnique+=$5; NreadsMulti+=$7}}END{{print "# Nreads " Nreads "\t" "NreadsUnique " NreadsUnique "\t" "NreadsMulti " NreadsMulti}}' results/mergedBam/{wildcards.sample}counts >> {output.chim}
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
    fi
    """
368
369
shell:
    "salmon quant -p {threads} -i {params.idx} -l A -1 {input.f1} -2 {input.f2} --validateMappings --rangeFactorizationBins 4 --gcBias -o {params.dir} 2>&1 | tee -a {log}"
398
399
400
401
402
403
404
shell:
    """
    featureCounts -p -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts} {input.bam}
    featureCounts -p -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts_fc} {input.bam}
    featureCounts -p -f -O -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts} {input.bam}
    featureCounts -p -f -O -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts_fc} {input.bam}
    """
427
428
script:
    "../../scripts/txImport.R"
SnakeMake From line 427 of PE/RNASeq.smk
452
453
454
455
456
457
458
459
460
461
462
463
464
465
shell:
    """
    arriba \
    -x {input.bam} \
    -a {input.reference} \
    -g {input.gtf} \
    -b {input.bl} \
    -k {input.kf} \
    -t {input.kf} \
    -p {input.pd} \
    -o {output.fusion} -O results/fusion/arriba/{wildcards.sample}_discarded.tsv 2>&1 | tee -a {log}
    gzip -9 -c results/fusion/arriba/{wildcards.sample}_discarded.tsv > {output.discarded}
    rm -f results/fusion/arriba/{wildcards.sample}_discarded.tsv
    """
486
487
488
489
490
491
492
493
shell:
    """
    workflow/scripts/fixChim.awk < {input.cj} > {input.cj}.fix
    mv -f {input.cj}.fix {input.cj}
    STAR-Fusion --genome_lib_dir {params.ctat_path} \
    -J {input.cj} \
    --output_dir results/fusion/STAR_fusion/{wildcards.sample}
    """
SnakeMake From line 486 of PE/RNASeq.smk
508
509
510
511
512
shell:
    """
    megadepth {input.bam} --annotation {input.bed} --op sum > results/paQuant/{wildcards.sample}_paQuant.tsv
    gzip results/paQuant/{wildcards.sample}_paQuant.tsv
    """
SnakeMake From line 508 of PE/RNASeq.smk
32
33
script:
    "../../scripts/createInterval.R"
56
57
58
59
60
61
62
63
64
65
66
67
shell:
    """
    gatk FastqToSam \
    -F1 {input.f1} \
    -F2 {input.f2} \
    -O {output.ubam} \
    -RG {params.RG} \
    -SM {params.SM} \
    -PL illumina \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
shell:
    """
    gatk \
    MergeBamAlignment \
    --REFERENCE_SEQUENCE {input.reference} \
    --UNMAPPED_BAM {input.ubam} \
    --ALIGNED_BAM {input.bam} \
    --OUTPUT {output.merged} \
    --INCLUDE_SECONDARY_ALIGNMENTS false \
    --VALIDATION_STRINGENCY SILENT \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """
120
121
122
123
124
125
126
127
128
shell:
    """
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
    fi
    """
150
151
152
153
154
155
156
157
158
159
160
161
shell:
    """
    gatk \
    MarkDuplicates \
    --INPUT {input.bam} \
    --CREATE_INDEX true \
    --VALIDATION_STRINGENCY SILENT \
    --METRICS_FILE {output.metrics} \
    --OUTPUT {output.dedup} \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """
185
186
187
188
189
190
191
192
193
shell:
    """
    gatk \
    SplitNCigarReads \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    -O {output.sacr} 2>> {log}
    """
218
219
220
221
222
223
224
225
226
227
228
229
230
231
shell:
    """
    gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
    -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -Xms4000m" \
    BaseRecalibrator \
    -R {input.reference} \
    -I {input.bam} \
    --use-original-qualities \
    -O {output.table} \
    --tmp-dir {params.tmp_dir} \
    -known-sites {input.dbSNP} \
    -known-sites {input.knowIndels} 2>> {log}
    """
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
shell:
    """
    gatk \
    --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
    ApplyBQSR \
    --add-output-sam-program-record \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    --use-original-qualities \
    -O {output.recalbam} \
    --bqsr-recal-file {input.table} 2>> {log}
    """
297
298
299
300
301
302
303
304
305
306
307
308
309
shell:
    """
    gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ParallelGCThreads=1" \
    HaplotypeCaller \
    -R {input.reference} \
    -I {input.bam} \
    -L {input.intervals} \
    --tmp-dir {params.tmp_dir} \
    -O {output.vcf} \
    -dont-use-soft-clipped-bases \
    --standard-min-confidence-threshold-for-calling 20 \
    --dbsnp {input.dbSNP} 2>> {log}
    """
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
shell:
    """
    gatk \
    VariantFiltration \
    --R {input.reference} \
    --V {input.vcf} \
    --tmp-dir {params.tmp_dir} \
    --window 35 \
    --cluster 3 \
    --filter-name "FS" \
    --filter "FS > 30.0" \
    --filter-name "QD" \
    --filter "QD < 2.0" \
    -O {output.vcf_f}
    """
14
15
script:
    "../../scripts/gtf2bed.py"
SnakeMake From line 14 of SE/QC.smk
31
32
33
34
35
shell:
    """
    fastqc -o {params.dir} {input.bam}
    mv results/qc/fastqc/{wildcards.sample}/{wildcards.sample}_{wildcards.unit}.Aligned.sortedByCoord.out_fastqc.zip {output} 
    """
53
54
55
shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"
SnakeMake From line 53 of SE/QC.smk
74
75
76
shell:
    "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log} 2>&1"
SnakeMake From line 74 of SE/QC.smk
91
92
shell:
    "bam_stat.py -i {input} > {output} 2> {log}"
SnakeMake From line 91 of SE/QC.smk
108
109
shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 108 of SE/QC.smk
127
128
shell:
    "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 127 of SE/QC.smk
144
145
shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 144 of SE/QC.smk
162
163
shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 162 of SE/QC.smk
180
181
shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 180 of SE/QC.smk
212
213
214
215
216
217
shell:
    "multiqc"
    " --force"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"
248
249
250
251
252
253
254
255
256
shell:
    """
    set -e
    multiqc --force -o {params.outDir} -n {params.geneCount_gc} {input.gcg}
    multiqc --force -o {params.outDir} -n {params.geneCount_fc} {input.gcf}
    multiqc --force -o {params.outDir} -n {params.exonCount_gc} {input.ecg}
    multiqc --force -o {params.outDir} -n {params.exonCount_fc} {input.ecf}
    multiqc --force -o {params.outDir} -n {params.salmon} {input.salmon} {input.salmon_length}
    """
278
279
280
281
282
283
shell:
    "multiqc"
    " --force -fp"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"
21
22
23
24
25
26
27
28
shell:
    "cutadapt"
    " {params.adapters}"
    " -m 31"
    " --compression-level {params.compression}"
    " -o {output.fastq1}"
    " -j {threads}"
    " {input} > {output.qc} 2>>{log}"
49
50
51
52
53
54
55
56
shell:
    "STAR"
    " --runMode genomeGenerate"
    " --genomeDir {params.idx}"
    " --genomeFastaFiles {input.reference}"
    " --sjdbOverhang 250"
    " --sjdbGTFfile {input.gtf}"
    " --runThreadN {threads} 2>&1 | tee -a {log}"
77
78
79
80
81
82
83
84
85
shell:
    """
    set -e
    cat {input.transcripts} {input.reference} > resources/gentrome.fa
    grep "^>" {input.reference} | cut -d " " -f 1 > resources/decoys.txt
    sed -i -e 's/>//g' resources/decoys.txt
    salmon index -t resources/gentrome.fa -i {params.idx} -p {threads} --decoys resources/decoys.txt --gencode -k 31 2>&1 | tee -a {log}
    rm -f resources/gentrome.fa resources/decoys.txt
    """
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --outFilterType BySJout \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --outSAMtype None \
    --outSAMmode None \
    --outFileNamePrefix results/STAR_1p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """
154
155
156
157
158
shell:
    """
    set -e
    awk -F'\t' '$7 > {params.minCount}' {input.SJs} > {output.filtered}
    """
SnakeMake From line 154 of SE/RNASeq.smk
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
shell:
    """
    touch results/index/dummy.fastq
    mkdir -p results/index/STARindex_hg19_SJ
    cd results/index/STARindex_hg19_SJ
    ln -sf ../../../resources/STARindex_hg19/* .
    rm -f sjdbList.out.tab sjdbInfo.txt
    cd ../../../
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn results/index/dummy.fastq \
    --runThreadN 2 \
    --sjdbFileChrStartEnd {input.SJ} \
    --outFileNamePrefix results/index/ \
    --sjdbInsertSave Basic 2>&1 | tee -a {log}
    cp results/index/_STARgenome/sjdbList.out.tab results/index/STARindex_hg19_SJ/
    cp results/index/_STARgenome/sjdbInfo.txt results/index/STARindex_hg19_SJ/
    rm -rf results/index/_STARgenome
    """
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outBAMcompression {params.compression} \
    --outFilterType BySJout \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --outSAMtype BAM Unsorted \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    -–outSAMattrRGline ID:{params.RG} SM:{params.SM} PL:illumina \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --chimSegmentMin 10 \
    --chimOutJunctionFormat 1 \
    --chimOutType Junctions WithinBAM SoftClip \
    --chimJunctionOverhangMin 10 \
    --chimScoreDropMax 30 \
    --chimScoreJunctionNonGTAG 0 \
    --chimScoreSeparation 1 \
    --chimSegmentReadGapMax 3 \
    --chimMultimapNmax 20 \
    --outFileNamePrefix results/STAR_2p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """
284
285
286
287
288
shell:
    """
    samtools sort -@ {threads} {input.bam} -o {output.sbam} 2>>{log}
    samtools index {output.sbam}
    """
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
shell:
    """
    set -e
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
        cp {input.bams}.bai {output.mbam}.bai
        cp {input.chims} {output.chim}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
        samtools index {output.mbam}
        # Also merge the chimeric files for star-fusion
        cat {input.chims[0]} | head -n -2 > {output.chim}
        files=({input.chims})
        unset files[0]
        files=("${{files[@]}}")
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
        touch results/mergedBam/{wildcards.sample}chims
        cat {input.chims[0]} | tail -n 1 > results/mergedBam/{wildcards.sample}counts
        for file in $files; do
            cat $file | tail -n +2 | head -n -2 >> results/mergedBam/{wildcards.sample}chims
            cat $file | tail -n 1 >> results/mergedBam/{wildcards.sample}counts
        done
        cat results/mergedBam/{wildcards.sample}chims >> {output.chim}
        cat {input.chims[0]} | tail -n 2 |head -n 1 >> {output.chim}
        awk -F' ' '{{Nreads+=$3; NreadsUnique+=$5; NreadsMulti+=$7}}END{{print "# Nreads " Nreads "\t" "NreadsUnique " NreadsUnique "\t" "NreadsMulti " NreadsMulti}}' results/mergedBam/{wildcards.sample}counts >> {output.chim}
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
    fi
    """
364
365
shell:
    "salmon quant -p {threads} -i {params.idx} -l A -r {input.f1} --fldMean 275 --fldSD 50 --validateMappings --rangeFactorizationBins 4 --gcBias -o {params.dir} 2>&1 | tee -a {log}"
394
395
396
397
398
399
400
shell:
    """
    featureCounts -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts} {input.bam}
    featureCounts -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts_fc} {input.bam}
    featureCounts -f -O -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts} {input.bam}
    featureCounts -f -O -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts_fc} {input.bam}
    """
423
424
script:
    "../../scripts/txImport.R"
SnakeMake From line 423 of SE/RNASeq.smk
448
449
450
451
452
453
454
455
456
457
458
459
460
461
shell:
    """
    arriba \
    -x {input.bam} \
    -a {input.reference} \
    -g {input.gtf} \
    -b {input.bl} \
    -k {input.kf} \
    -t {input.kf} \
    -p {input.pd} \
    -o {output.fusion} -O results/fusion/arriba/{wildcards.sample}_discarded.tsv 2>&1 | tee -a {log}
    gzip -9 -c results/fusion/arriba/{wildcards.sample}_discarded.tsv > {output.discarded}
    rm -f results/fusion/arriba/{wildcards.sample}_discarded.tsv
    """
482
483
484
485
486
487
488
489
shell:
    """
    workflow/scripts/fixChim.awk < {input.cj} > {input.cj}.fix
    mv -f {input.cj}.fix {input.cj}
    STAR-Fusion --genome_lib_dir {params.ctat_path} \
    -J {input.cj} \
    --output_dir results/fusion/STAR_fusion/{wildcards.sample}
    """
SnakeMake From line 482 of SE/RNASeq.smk
504
505
506
507
508
shell:
    """
    megadepth {input.bam} --annotation {input.bed} --op sum > results/paQuant/{wildcards.sample}_paQuant.tsv
    gzip results/paQuant/{wildcards.sample}_paQuant.tsv
    """
SnakeMake From line 504 of SE/RNASeq.smk
32
33
script:
    "../../scripts/createInterval.R"
55
56
57
58
59
60
61
62
63
64
65
shell:
    """
    gatk FastqToSam \
    -F1 {input.f1} \
    -O {output.ubam} \
    -RG {params.RG} \
    -SM {params.SM} \
    -PL illumina \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
    """
    gatk \
    MergeBamAlignment \
    --REFERENCE_SEQUENCE {input.reference} \
    --UNMAPPED_BAM {input.ubam} \
    --ALIGNED_BAM {input.bam} \
    --OUTPUT {output.merged} \
    --INCLUDE_SECONDARY_ALIGNMENTS false \
    --VALIDATION_STRINGENCY SILENT \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """
118
119
120
121
122
123
124
125
126
shell:
    """
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
    fi
    """
148
149
150
151
152
153
154
155
156
157
158
159
shell:
    """
    gatk \
    MarkDuplicates \
    --INPUT {input.bam} \
    --CREATE_INDEX true \
    --VALIDATION_STRINGENCY SILENT \
    --METRICS_FILE {output.metrics} \
    --OUTPUT {output.dedup} \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """
183
184
185
186
187
188
189
190
191
shell:
    """
    gatk \
    SplitNCigarReads \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    -O {output.sacr} 2>> {log}
    """
216
217
218
219
220
221
222
223
224
225
226
227
228
229
shell:
    """
    gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
    -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -Xms4000m" \
    BaseRecalibrator \
    -R {input.reference} \
    -I {input.bam} \
    --use-original-qualities \
    -O {output.table} \
    --tmp-dir {params.tmp_dir} \
    -known-sites {input.dbSNP} \
    -known-sites {input.knowIndels} 2>> {log}
    """
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
shell:
    """
    gatk \
    --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
    ApplyBQSR \
    --add-output-sam-program-record \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    --use-original-qualities \
    -O {output.recalbam} \
    --bqsr-recal-file {input.table} 2>> {log}
    """
295
296
297
298
299
300
301
302
303
304
305
306
307
shell:
    """
    gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ParallelGCThreads=1" \
    HaplotypeCaller \
    -R {input.reference} \
    -I {input.bam} \
    -L {input.intervals} \
    --tmp-dir {params.tmp_dir} \
    -O {output.vcf} \
    -dont-use-soft-clipped-bases \
    --standard-min-confidence-threshold-for-calling 20 \
    --dbsnp {input.dbSNP} 2>> {log}
    """
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
shell:
    """
    gatk \
    VariantFiltration \
    --R {input.reference} \
    --V {input.vcf} \
    --tmp-dir {params.tmp_dir} \
    --window 35 \
    --cluster 3 \
    --filter-name "FS" \
    --filter "FS > 30.0" \
    --filter-name "QD" \
    --filter "QD < 2.0" \
    -O {output.vcf_f}
    """    
1
2
3
4
5
6
7
8
9
options(scipen=999)
gtf <- read.table(snakemake@input[["gtf"]], sep="\t")
gtf <- subset(gtf, V3 == "exon")
gtf <- data.frame(chrom=gtf[,'V1'], start=gtf[,'V4']-1, end=gtf[,'V5'])
gtf$chrom <- gsub("^chr", "", gtf$chrom)
gtf <- gtf[gtf$chrom != "M",]
write.table(gtf, file=snakemake@output[["tmp"]], quote = F, sep="\t", col.names = F, row.names = F)

system(paste0("gatk BedToIntervalList -I ", snakemake@output[["tmp"]], " -O ", snakemake@output[["intervals"]], " -SD ", snakemake@input[["refDict"]]))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import gffutils

db = gffutils.create_db(
    snakemake.input[0],
    dbfn=snakemake.output.db,
    force=True,
    keep_order=True,
    merge_strategy="merge",
    sort_attribute_values=True,
    disable_infer_genes=True,
    disable_infer_transcripts=True,
)

with open(snakemake.output.bed, "w") as outfileobj:
    for tx in db.features_of_type("transcript", order_by="start"):
        bed = [s.strip() for s in db.bed12(tx).split("\t")]
        bed[3] = tx.id
        outfileobj.write("{}\n".format("\t".join(bed)))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
options(scipen=999)
require(tximport)
require(GenomicFeatures)
transcripts <- read.table(snakemake@input[["quant"]], sep ="\t", header = T)
write.csv(transcripts[,c("Name", "NumReads")], file=snakemake@output[["transcript_raw"]], col.names = T, row.names = F)
write.csv(transcripts[,c("Name", "TPM")], file=snakemake@output[["transcript_TPM"]], col.names = T, row.names = F)

tx2gene <- read.table(snakemake@input[["tx2gene"]])

### Read counts and summarize to gene level
txi <- tximport(snakemake@input[["quant"]], type = "salmon", tx2gene = tx2gene)
txiScaled <- tximport(snakemake@input[["quant"]], type = "salmon", tx2gene = tx2gene,countsFromAbundance = "lengthScaledTPM")
### Rename columns
nms <- basename(gsub("\\/quant.sf", "", snakemake@input[["quant"]]))
mat <- txi$counts
matScaled <- txiScaled$counts
colnames(mat) <- nms
colnames(matScaled) <- nms

write.table(mat, file=snakemake@output[["gene_raw"]], row.names = T, col.names = T, sep = "\t")
write.table(matScaled, file=snakemake@output[["gene_scaled"]], row.names = T, col.names = T, sep="\t")
35
36
shell:
    "touch {output.complete}"
76
77
shell:
    "tar -czvf {output.tarfile} {input}"
116
117
118
119
120
121
122
123
124
shell:
    """
    set -e
    mv {params.map_gene_gc_stat} {output.map_gene_gc_stat}
    mv {params.map_gene_fc_stat} {output.map_gene_fc_stat}
    mv {params.map_exon_gc_stat} {output.map_exon_gc_stat}
    mv {params.map_exon_fc_stat} {output.map_exon_fc_stat}
    tar -czvf {output.tarfile} {input} {output.map_gene_gc_stat} {output.map_gene_fc_stat} {output.map_exon_gc_stat} {output.map_exon_fc_stat}
    """
ShowHide 75 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/panprostate/RNA
Name: rna
Version: v1.2.4
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Creative Commons Zero v1.0 Universal
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...