PPCG RNA working group

public 1yr ago Version: v1.2.4 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

PPCG RNA working group

This repository contains the pipeline used by the Pan Prostate Cancer Group.

Requirements

All tools used in this pipeline are installed at runtime through a singularity container. The only requirements are the ones to run the Snakemake workflow.

mamba >= 1.2 (optional, for faster installation)
Snakemake >= 6.2
Singularity >=3.7 (Should work on previous versions although not tested)
Cookiecutter >= 1.7.2

With exception of Singularity, these can be easily installed using conda . Please do not install Singularity through conda .

Singularity must be owned by root in order to run this workflow. If you do not have access to a singularity installation owned by root try requesting to your system administrator .

Before you begin

When executing snakemake, try to avoid canceling it with ctrl + c . If an error occurs, snakemake will clean the files from the failed step before shutting down. Forcing it to shutdown will cause corrupted files to be left behind, potentially causing troubles. On the same note, avoid moving/creating/removing files under the results/ folder. Snakemake relies on the time stamp of the files to coordinate the order of execution of each step. If you must remove files, only remove files resulting from the last step ran, otherwise this will cause snakemake to reprocess all steps that relies on these files.

How to run

Download the snakemake workflow.

wget https://github.com/panprostate/RNA/releases/download/v1.2.4/PPCG_RNA_v1.2.4.tar.gz
tar -xzvf PPCG_RNA_v1.2.4.tar.gz

This workflow requires a sample metadata table (tab separated) containing the following fields:

sample_name = The sample name, by default anything preceding the "_L00" pattern.
unit_name = The unit/lane name, by default it matches the following pattern ".+_(L00.).+".
fq1 = Absolute path to the R1 file of the fastq pair.
fq2 = Absolute path to the R2 file of the fastq pair. For single-ended libraries this field should contain NA .

Create this table under the config/ folder. As reference, a helper script createTable.sh is included under config/ to create this table. Make sure the table format is correct before proceding.

If your computing enviroment is managed by Slurm, a cluster profile is included at slurm_profile/ . There are two files that can be adjusted:

slurm_profile/config.yaml

Most likely you will want to adjust only two settings in this file:

# Number of cores used by Snakemake to manage the workflow - if increasing, be careful to not have your job killed if not running in an interactive session.
local-cores: 1
# Number of maximum simultaneous jobs that can be run.
jobs: 5

slurm_profile/settings.json

This file can be used to include center-specific parameters for your submissions. For example, to specify a partition panda on all jobs submissions.

"SBATCH_DEFAULTS": "partition=panda job-name={rule}_{wildcards} output=job_logs/slurm-%j.out"

Edit the amount of resources requested and other general settings in config/config.yaml . Some of the options are: NEW - library_type: Process single-ended libraries. Please note that a mixed sample table is not supported at the moment, make sure all samples are SE or PE in your samples variable.

samples: path to the metadata table described in step 2.
buildIndex: True or False wether to build indexes from files or download built indexes.
adapters: adapter sequences used by cutadapt.
mem_*: Amount of RAM memory requested for the job in MB.
ncpus_*: Number of cores requested for the job.
rt_*: Maximum time a job is allowed to run before it is killed.
compression: Compression level from 1 to 9 for intermediated steps. Final outputs are always compressed at the default level.

From the base directory execute Snakemake:

snakemake --use-conda --use-singularity --profile slurm_profile

The workflow will automatically pull and create a Singularity container from docker://condaforge/mambaforge:4.10.1-0 and create environments with the necessary packages for each module.

If debugging is necessary, the folder logs/ contains the error log generated by the software run in each step. In some cases the errors are not directly caused by a software, but by the execution of a script or the cluster resource manager - in those cases the error will be output in job_logs/ with the name slurm-<job-number>.out .

Troubleshooting

Depending on the number of samples and read length, you might end up with an excessive amount of SJs detect. In this case you will encounter the error bellow: Fatal LIMIT error: the number of junctions to be inserted on the fly =XXXXXXX is larger than the limitSjdbInsertNsj=1000000 While you can increase this limit, this might cause you to encounter another error: EXITING because of FATAL ERROR: cannot insert junctions on the fly because of strand GstrandBit problem The solution for this is to decrease the number of SJs being inserted on the fly. You can do this by turning on the SJ filter in the config.yaml file, which by default will remove SJs supported by only 1 read (uniquely mapped). SJ_filter: True SJ_minCounts: 1

Workflow overview

Assuming a sample named S1 which has been split in two files by lane ( L001 and L002 ).

overview

Code Snippets

shell:
    """
    set -e
    echo "Downloading resources..."
    [[ -f {output.dbSNP} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf {output.dbSNP}
    [[ -f {output.dbSNPidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf.idx {output.dbSNPidx}
    [[ -f {output.knowIndels} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Mills_and_1000G_gold_standard.indels.b37.sites.vcf {output.knowIndels}
    [[ -f {output.knowIndelsidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx {output.knowIndelsidx}
    [[ -f {output.reference} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta {output.reference}
    [[ -f {output.faidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta.fai {output.faidx}
    [[ -f {output.dictn} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dict {output.dictn}
    [[ -f {output.paSites} ]] || wget -O resources/pasites_hg19.bed.gz https://www.dropbox.com/s/i55pzcuh6113vvq/pasites_hg19.bed.gz
    [[ -f {output.tx2gene} ]] || wget -O resources/tx2gene.tsv.gz https://www.dropbox.com/s/ays94bytt7r02ex/tx2gene.tsv.gz
    [[ -f {output.fc} ]] || wget -O resources/FANTOM_CAT.lv3_robust.gtf.gz https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.gtf.gz
    [[ -f {output.gtf} ]] || wget -O resources/gencode.v38lift37.annotation.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.annotation.gtf.gz
    [[ -f {output.transcripts} ]] || wget -O resources/gencode.v38lift37.transcripts.fa.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.transcripts.fa.gz
    [[ -f {output.blacklist} ]] || wget -O resources/arriba_v2.1.0.tar.gz https://github.com/suhrig/arriba/releases/download/v2.1.0/arriba_v2.1.0.tar.gz
    tar -xvf resources/arriba_v2.1.0.tar.gz --directory resources/
    mv resources/arriba_v2.1.0/database/blacklist_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz resources/arriba_v2.1.0/database/known_fusions_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz resources/arriba_v2.1.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.1.0.gff3 resources/
    rm -rf resources/arriba_v2.1.0/ resources/arriba_v2.1.0.tar.gz
    echo "Unzipping some files..."
    gunzip resources/gencode.v38lift37.annotation.gtf.gz resources/gencode.v38lift37.transcripts.fa.gz resources/FANTOM_CAT.lv3_robust.gtf.gz resources/pasites_hg19.bed.gz
    sed -i 's/^chr//g' resources/gencode.v38lift37.annotation.gtf
    sed -i 's/^chr//g' resources/FANTOM_CAT.lv3_robust.gtf
    echo "Done!"
    """

SnakeMake From line 27 of rules/downloads.smk

shell:
    """
    set -e
    wget -O resources/STARindex_hg19.tar https://www.dropbox.com/s/7zf1ad0hokpizfi/STARindex_hg19.tar
    tar -xvf resources/STARindex_hg19.tar --directory resources/
    """

SnakeMake From line 66 of rules/downloads.smk

shell:
    """
    set -e
    wget -O resources/salmon_hg19.tar https://www.dropbox.com/s/mybii6z71v9pt57/salmon_hg19.tar
    tar -xvf resources/salmon_hg19.tar --directory resources/
    """

SnakeMake From line 85 of rules/downloads.smk

shell:
    """
    wget -O resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz
    tar -xzvf resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz --directory resources/
    """

SnakeMake From line 104 of rules/downloads.smk

script:
    "../../scripts/gtf2bed.py"

SnakeMake From line 14 of PE/QC.smk

shell:
    """
    fastqc -o {params.dir} {input.bam}
    mv results/qc/fastqc/{wildcards.sample}/{wildcards.sample}_{wildcards.unit}.Aligned.sortedByCoord.out_fastqc.zip {output} 
    """

SnakeMake FastQC From line 31 of PE/QC.smk

shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"

SnakeMake From line 53 of PE/QC.smk

shell:
    "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log} 2>&1"

SnakeMake From line 74 of PE/QC.smk

shell:
    "bam_stat.py -i {input} > {output} 2> {log}"

SnakeMake From line 91 of PE/QC.smk

shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 108 of PE/QC.smk

shell:
    "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 127 of PE/QC.smk

shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 144 of PE/QC.smk

shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 162 of PE/QC.smk

shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 180 of PE/QC.smk

shell:
    "multiqc"
    " --force"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"

SnakeMake MultiQC From line 212 of PE/QC.smk

shell:
    """
    set -e
    multiqc --force -o {params.outDir} -n {params.geneCount_gc} {input.gcg}
    multiqc --force -o {params.outDir} -n {params.geneCount_fc} {input.gcf}
    multiqc --force -o {params.outDir} -n {params.exonCount_gc} {input.ecg}
    multiqc --force -o {params.outDir} -n {params.exonCount_fc} {input.ecf}
    multiqc --force -o {params.outDir} -n {params.salmon} {input.salmon} {input.salmon_length}
    """

SnakeMake MultiQC From line 248 of PE/QC.smk

shell:
    "multiqc"
    " --force -fp"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"

SnakeMake MultiQC From line 278 of PE/QC.smk

shell:
    "cutadapt"
    " {params.adapters}"
    " -m 31"
    " --compression-level {params.compression}"
    " -o {output.fastq1}"
    " -p {output.fastq2}"
    " -j {threads}"
    " {input} > {output.qc} 2>>{log}"

SnakeMake Cutadapt From line 22 of PE/RNASeq.smk

shell:
    "STAR"
    " --runMode genomeGenerate"
    " --genomeDir {params.idx}"
    " --genomeFastaFiles {input.reference}"
    " --sjdbOverhang 250"
    " --sjdbGTFfile {input.gtf}"
    " --runThreadN {threads} 2>&1 | tee -a {log}"

SnakeMake STAR From line 51 of PE/RNASeq.smk

shell:
    """
    set -e
    cat {input.transcripts} {input.reference} > resources/gentrome.fa
    grep "^>" {input.reference} | cut -d " " -f 1 > resources/decoys.txt
    sed -i -e 's/>//g' resources/decoys.txt
    salmon index -t resources/gentrome.fa -i {params.idx} -p {threads} --decoys resources/decoys.txt --gencode -k 31 2>&1 | tee -a {log}
    rm -f resources/gentrome.fa resources/decoys.txt
    """

SnakeMake Salmon From line 79 of PE/RNASeq.smk

shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} {input.f2} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --outFilterType BySJout \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --outSAMtype None \
    --outSAMmode None \
    --outFileNamePrefix results/STAR_1p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """

SnakeMake STAR From line 110 of PE/RNASeq.smk

shell:
    """
    set -e
    awk -F'\t' '$7 > {params.minCount}' {input.SJs} > {output.filtered}
    """

SnakeMake From line 157 of PE/RNASeq.smk

shell:
    """
    touch results/index/dummy.fastq
    mkdir -p results/index/STARindex_hg19_SJ
    cd results/index/STARindex_hg19_SJ
    ln -sf ../../../resources/STARindex_hg19/* .
    rm -f sjdbList.out.tab sjdbInfo.txt
    cd ../../../
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn results/index/dummy.fastq \
    --runThreadN 2 \
    --sjdbFileChrStartEnd {input.SJ} \
    --outFileNamePrefix results/index/ \
    --sjdbInsertSave Basic 2>&1 | tee -a {log}
    cp results/index/_STARgenome/sjdbList.out.tab results/index/STARindex_hg19_SJ/
    cp results/index/_STARgenome/sjdbInfo.txt results/index/STARindex_hg19_SJ/
    rm -rf results/index/_STARgenome
    """

SnakeMake STAR From line 185 of PE/RNASeq.smk

shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} {input.f2} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outBAMcompression {params.compression} \
    --outFilterType BySJout \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --outSAMtype BAM Unsorted \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    -–outSAMattrRGline ID:{params.RG} SM:{params.SM} PL:illumina \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --chimSegmentMin 10 \
    --chimOutJunctionFormat 1 \
    --chimOutType Junctions WithinBAM SoftClip \
    --chimJunctionOverhangMin 10 \
    --chimScoreDropMax 30 \
    --chimScoreJunctionNonGTAG 0 \
    --chimScoreSeparation 1 \
    --chimSegmentReadGapMax 3 \
    --chimMultimapNmax 20 \
    --outFileNamePrefix results/STAR_2p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """

SnakeMake STAR From line 231 of PE/RNASeq.smk

shell:
    """
    samtools sort -@ {threads} {input.bam} -o {output.sbam} 2>>{log}
    samtools index {output.sbam}
    """

SnakeMake SAMtools From line 287 of PE/RNASeq.smk

shell:
    """
    set -e
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
        cp {input.bams}.bai {output.mbam}.bai
        cp {input.chims} {output.chim}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
        samtools index {output.mbam}
        # Also merge the chimeric files for star-fusion
        cat {input.chims[0]} | head -n -2 > {output.chim}
        files=({input.chims})
        unset files[0]
        files=("${{files[@]}}")
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
        touch results/mergedBam/{wildcards.sample}chims
        cat {input.chims[0]} | tail -n 1 > results/mergedBam/{wildcards.sample}counts
        for file in $files; do
            cat $file | tail -n +2 | head -n -2 >> results/mergedBam/{wildcards.sample}chims
            cat $file | tail -n 1 >> results/mergedBam/{wildcards.sample}counts
        done
        cat results/mergedBam/{wildcards.sample}chims >> {output.chim}
        cat {input.chims[0]} | tail -n 2 |head -n 1 >> {output.chim}
        awk -F' ' '{{Nreads+=$3; NreadsUnique+=$5; NreadsMulti+=$7}}END{{print "# Nreads " Nreads "\t" "NreadsUnique " NreadsUnique "\t" "NreadsMulti " NreadsMulti}}' results/mergedBam/{wildcards.sample}counts >> {output.chim}
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
    fi
    """

SnakeMake SAMtools From line 315 of PE/RNASeq.smk

shell:
    "salmon quant -p {threads} -i {params.idx} -l A -1 {input.f1} -2 {input.f2} --validateMappings --rangeFactorizationBins 4 --gcBias -o {params.dir} 2>&1 | tee -a {log}"

SnakeMake Quant Salmon From line 368 of PE/RNASeq.smk

shell:
    """
    featureCounts -p -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts} {input.bam}
    featureCounts -p -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts_fc} {input.bam}
    featureCounts -p -f -O -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts} {input.bam}
    featureCounts -p -f -O -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts_fc} {input.bam}
    """

SnakeMake FeatureCounts From line 398 of PE/RNASeq.smk

script:
    "../../scripts/txImport.R"

SnakeMake From line 427 of PE/RNASeq.smk

shell:
    """
    arriba \
    -x {input.bam} \
    -a {input.reference} \
    -g {input.gtf} \
    -b {input.bl} \
    -k {input.kf} \
    -t {input.kf} \
    -p {input.pd} \
    -o {output.fusion} -O results/fusion/arriba/{wildcards.sample}_discarded.tsv 2>&1 | tee -a {log}
    gzip -9 -c results/fusion/arriba/{wildcards.sample}_discarded.tsv > {output.discarded}
    rm -f results/fusion/arriba/{wildcards.sample}_discarded.tsv
    """

SnakeMake Arriba From line 452 of PE/RNASeq.smk

shell:
    """
    workflow/scripts/fixChim.awk < {input.cj} > {input.cj}.fix
    mv -f {input.cj}.fix {input.cj}
    STAR-Fusion --genome_lib_dir {params.ctat_path} \
    -J {input.cj} \
    --output_dir results/fusion/STAR_fusion/{wildcards.sample}
    """

SnakeMake From line 486 of PE/RNASeq.smk

shell:
    """
    megadepth {input.bam} --annotation {input.bed} --op sum > results/paQuant/{wildcards.sample}_paQuant.tsv
    gzip results/paQuant/{wildcards.sample}_paQuant.tsv
    """

SnakeMake From line 508 of PE/RNASeq.smk

script:
    "../../scripts/createInterval.R"

SnakeMake From line 32 of PE/variantCalling.smk

shell:
    """
    gatk FastqToSam \
    -F1 {input.f1} \
    -F2 {input.f2} \
    -O {output.ubam} \
    -RG {params.RG} \
    -SM {params.SM} \
    -PL illumina \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """

SnakeMake gatk From line 56 of PE/variantCalling.smk

shell:
    """
    gatk \
    MergeBamAlignment \
    --REFERENCE_SEQUENCE {input.reference} \
    --UNMAPPED_BAM {input.ubam} \
    --ALIGNED_BAM {input.bam} \
    --OUTPUT {output.merged} \
    --INCLUDE_SECONDARY_ALIGNMENTS false \
    --VALIDATION_STRINGENCY SILENT \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """

SnakeMake gatk From line 91 of PE/variantCalling.smk

shell:
    """
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
    fi
    """

SnakeMake SAMtools From line 120 of PE/variantCalling.smk

shell:
    """
    gatk \
    MarkDuplicates \
    --INPUT {input.bam} \
    --CREATE_INDEX true \
    --VALIDATION_STRINGENCY SILENT \
    --METRICS_FILE {output.metrics} \
    --OUTPUT {output.dedup} \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """

SnakeMake gatk From line 150 of PE/variantCalling.smk

shell:
    """
    gatk \
    SplitNCigarReads \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    -O {output.sacr} 2>> {log}
    """

SnakeMake gatk From line 185 of PE/variantCalling.smk

shell:
    """
    gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
    -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -Xms4000m" \
    BaseRecalibrator \
    -R {input.reference} \
    -I {input.bam} \
    --use-original-qualities \
    -O {output.table} \
    --tmp-dir {params.tmp_dir} \
    -known-sites {input.dbSNP} \
    -known-sites {input.knowIndels} 2>> {log}
    """

SnakeMake gatk From line 218 of PE/variantCalling.smk

shell:
    """
    gatk \
    --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
    ApplyBQSR \
    --add-output-sam-program-record \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    --use-original-qualities \
    -O {output.recalbam} \
    --bqsr-recal-file {input.table} 2>> {log}
    """

SnakeMake gatk From line 257 of PE/variantCalling.smk

shell:
    """
    gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ParallelGCThreads=1" \
    HaplotypeCaller \
    -R {input.reference} \
    -I {input.bam} \
    -L {input.intervals} \
    --tmp-dir {params.tmp_dir} \
    -O {output.vcf} \
    -dont-use-soft-clipped-bases \
    --standard-min-confidence-threshold-for-calling 20 \
    --dbsnp {input.dbSNP} 2>> {log}
    """

SnakeMake gatk From line 297 of PE/variantCalling.smk

shell:
    """
    gatk \
    VariantFiltration \
    --R {input.reference} \
    --V {input.vcf} \
    --tmp-dir {params.tmp_dir} \
    --window 35 \
    --cluster 3 \
    --filter-name "FS" \
    --filter "FS > 30.0" \
    --filter-name "QD" \
    --filter "QD < 2.0" \
    -O {output.vcf_f}
    """

SnakeMake gatk From line 333 of PE/variantCalling.smk

script:
    "../../scripts/gtf2bed.py"

SnakeMake From line 14 of SE/QC.smk

shell:
    """
    fastqc -o {params.dir} {input.bam}
    mv results/qc/fastqc/{wildcards.sample}/{wildcards.sample}_{wildcards.unit}.Aligned.sortedByCoord.out_fastqc.zip {output} 
    """

SnakeMake FastQC From line 31 of SE/QC.smk

shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"

SnakeMake From line 53 of SE/QC.smk

shell:
    "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log} 2>&1"

SnakeMake From line 74 of SE/QC.smk

shell:
    "bam_stat.py -i {input} > {output} 2> {log}"

SnakeMake From line 91 of SE/QC.smk

shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 108 of SE/QC.smk

shell:
    "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 127 of SE/QC.smk

shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"

SnakeMake From line 144 of SE/QC.smk

shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 162 of SE/QC.smk

shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"

SnakeMake From line 180 of SE/QC.smk

shell:
    "multiqc"
    " --force"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"

SnakeMake MultiQC From line 212 of SE/QC.smk

shell:
    """
    set -e
    multiqc --force -o {params.outDir} -n {params.geneCount_gc} {input.gcg}
    multiqc --force -o {params.outDir} -n {params.geneCount_fc} {input.gcf}
    multiqc --force -o {params.outDir} -n {params.exonCount_gc} {input.ecg}
    multiqc --force -o {params.outDir} -n {params.exonCount_fc} {input.ecf}
    multiqc --force -o {params.outDir} -n {params.salmon} {input.salmon} {input.salmon_length}
    """

SnakeMake MultiQC From line 248 of SE/QC.smk

shell:
    "multiqc"
    " --force -fp"
    " -o {params.outDir}"
    " -n {params.name}"
    " {input}"

SnakeMake MultiQC From line 278 of SE/QC.smk

shell:
    "cutadapt"
    " {params.adapters}"
    " -m 31"
    " --compression-level {params.compression}"
    " -o {output.fastq1}"
    " -j {threads}"
    " {input} > {output.qc} 2>>{log}"

SnakeMake Cutadapt From line 21 of SE/RNASeq.smk

shell:
    "STAR"
    " --runMode genomeGenerate"
    " --genomeDir {params.idx}"
    " --genomeFastaFiles {input.reference}"
    " --sjdbOverhang 250"
    " --sjdbGTFfile {input.gtf}"
    " --runThreadN {threads} 2>&1 | tee -a {log}"

SnakeMake STAR From line 49 of SE/RNASeq.smk

shell:
    """
    set -e
    cat {input.transcripts} {input.reference} > resources/gentrome.fa
    grep "^>" {input.reference} | cut -d " " -f 1 > resources/decoys.txt
    sed -i -e 's/>//g' resources/decoys.txt
    salmon index -t resources/gentrome.fa -i {params.idx} -p {threads} --decoys resources/decoys.txt --gencode -k 31 2>&1 | tee -a {log}
    rm -f resources/gentrome.fa resources/decoys.txt
    """

SnakeMake Salmon From line 77 of SE/RNASeq.smk

shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --outFilterType BySJout \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --outSAMtype None \
    --outSAMmode None \
    --outFileNamePrefix results/STAR_1p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """

SnakeMake STAR From line 107 of SE/RNASeq.smk

shell:
    """
    set -e
    awk -F'\t' '$7 > {params.minCount}' {input.SJs} > {output.filtered}
    """

SnakeMake From line 154 of SE/RNASeq.smk

shell:
    """
    touch results/index/dummy.fastq
    mkdir -p results/index/STARindex_hg19_SJ
    cd results/index/STARindex_hg19_SJ
    ln -sf ../../../resources/STARindex_hg19/* .
    rm -f sjdbList.out.tab sjdbInfo.txt
    cd ../../../
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn results/index/dummy.fastq \
    --runThreadN 2 \
    --sjdbFileChrStartEnd {input.SJ} \
    --outFileNamePrefix results/index/ \
    --sjdbInsertSave Basic 2>&1 | tee -a {log}
    cp results/index/_STARgenome/sjdbList.out.tab results/index/STARindex_hg19_SJ/
    cp results/index/_STARgenome/sjdbInfo.txt results/index/STARindex_hg19_SJ/
    rm -rf results/index/_STARgenome
    """

SnakeMake STAR From line 182 of SE/RNASeq.smk

shell:
    """
    STAR \
    --genomeDir {params.idx} \
    --readFilesIn {input.f1} \
    --runThreadN {threads} \
    --readFilesCommand zcat \
    --outBAMcompression {params.compression} \
    --outFilterType BySJout \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 10 \
    --alignIntronMax 500000 \
    --alignIntronMin 20 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --outSAMtype BAM Unsorted \
    --alignSJDBoverhangMin 5 \
    --genomeLoad NoSharedMemory \
    --outFilterMatchNminOverLread 0.1 \
    --outFilterScoreMinOverLread 0.1 \
    --sjdbOverhang 250 \
    --outSAMstrandField intronMotif \
    -–outSAMattrRGline ID:{params.RG} SM:{params.SM} PL:illumina \
    --peOverlapNbasesMin 10 \
    --alignSplicedMateMapLminOverLmate 0.5 \
    --alignSJstitchMismatchNmax 5 -1 5 5 \
    --chimSegmentMin 10 \
    --chimOutJunctionFormat 1 \
    --chimOutType Junctions WithinBAM SoftClip \
    --chimJunctionOverhangMin 10 \
    --chimScoreDropMax 30 \
    --chimScoreJunctionNonGTAG 0 \
    --chimScoreSeparation 1 \
    --chimSegmentReadGapMax 3 \
    --chimMultimapNmax 20 \
    --outFileNamePrefix results/STAR_2p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log}
    """

SnakeMake STAR From line 228 of SE/RNASeq.smk

shell:
    """
    samtools sort -@ {threads} {input.bam} -o {output.sbam} 2>>{log}
    samtools index {output.sbam}
    """

SnakeMake SAMtools From line 284 of SE/RNASeq.smk

shell:
    """
    set -e
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
        cp {input.bams}.bai {output.mbam}.bai
        cp {input.chims} {output.chim}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
        samtools index {output.mbam}
        # Also merge the chimeric files for star-fusion
        cat {input.chims[0]} | head -n -2 > {output.chim}
        files=({input.chims})
        unset files[0]
        files=("${{files[@]}}")
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
        touch results/mergedBam/{wildcards.sample}chims
        cat {input.chims[0]} | tail -n 1 > results/mergedBam/{wildcards.sample}counts
        for file in $files; do
            cat $file | tail -n +2 | head -n -2 >> results/mergedBam/{wildcards.sample}chims
            cat $file | tail -n 1 >> results/mergedBam/{wildcards.sample}counts
        done
        cat results/mergedBam/{wildcards.sample}chims >> {output.chim}
        cat {input.chims[0]} | tail -n 2 |head -n 1 >> {output.chim}
        awk -F' ' '{{Nreads+=$3; NreadsUnique+=$5; NreadsMulti+=$7}}END{{print "# Nreads " Nreads "\t" "NreadsUnique " NreadsUnique "\t" "NreadsMulti " NreadsMulti}}' results/mergedBam/{wildcards.sample}counts >> {output.chim}
        rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts
    fi
    """

SnakeMake SAMtools From line 312 of SE/RNASeq.smk

shell:
    "salmon quant -p {threads} -i {params.idx} -l A -r {input.f1} --fldMean 275 --fldSD 50 --validateMappings --rangeFactorizationBins 4 --gcBias -o {params.dir} 2>&1 | tee -a {log}"

SnakeMake Quant Salmon From line 364 of SE/RNASeq.smk

shell:
    """
    featureCounts -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts} {input.bam}
    featureCounts -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts_fc} {input.bam}
    featureCounts -f -O -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts} {input.bam}
    featureCounts -f -O -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts_fc} {input.bam}
    """

SnakeMake FeatureCounts From line 394 of SE/RNASeq.smk

script:
    "../../scripts/txImport.R"

SnakeMake From line 423 of SE/RNASeq.smk

shell:
    """
    arriba \
    -x {input.bam} \
    -a {input.reference} \
    -g {input.gtf} \
    -b {input.bl} \
    -k {input.kf} \
    -t {input.kf} \
    -p {input.pd} \
    -o {output.fusion} -O results/fusion/arriba/{wildcards.sample}_discarded.tsv 2>&1 | tee -a {log}
    gzip -9 -c results/fusion/arriba/{wildcards.sample}_discarded.tsv > {output.discarded}
    rm -f results/fusion/arriba/{wildcards.sample}_discarded.tsv
    """

SnakeMake Arriba From line 448 of SE/RNASeq.smk

shell:
    """
    workflow/scripts/fixChim.awk < {input.cj} > {input.cj}.fix
    mv -f {input.cj}.fix {input.cj}
    STAR-Fusion --genome_lib_dir {params.ctat_path} \
    -J {input.cj} \
    --output_dir results/fusion/STAR_fusion/{wildcards.sample}
    """

SnakeMake From line 482 of SE/RNASeq.smk

shell:
    """
    megadepth {input.bam} --annotation {input.bed} --op sum > results/paQuant/{wildcards.sample}_paQuant.tsv
    gzip results/paQuant/{wildcards.sample}_paQuant.tsv
    """

SnakeMake From line 504 of SE/RNASeq.smk

script:
    "../../scripts/createInterval.R"

SnakeMake From line 32 of SE/variantCalling.smk

shell:
    """
    gatk FastqToSam \
    -F1 {input.f1} \
    -O {output.ubam} \
    -RG {params.RG} \
    -SM {params.SM} \
    -PL illumina \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """

SnakeMake gatk From line 55 of SE/variantCalling.smk

shell:
    """
    gatk \
    MergeBamAlignment \
    --REFERENCE_SEQUENCE {input.reference} \
    --UNMAPPED_BAM {input.ubam} \
    --ALIGNED_BAM {input.bam} \
    --OUTPUT {output.merged} \
    --INCLUDE_SECONDARY_ALIGNMENTS false \
    --VALIDATION_STRINGENCY SILENT \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """

SnakeMake gatk From line 89 of SE/variantCalling.smk

shell:
    """
    INPUT=({input.bams})
    if ((${{#INPUT[@]}} == 1)); then
        cp {input.bams} {output.mbam}
    else
        samtools merge -f -1 {output.mbam} {input.bams} 2>>{log}
    fi
    """

SnakeMake SAMtools From line 118 of SE/variantCalling.smk

shell:
    """
    gatk \
    MarkDuplicates \
    --INPUT {input.bam} \
    --CREATE_INDEX true \
    --VALIDATION_STRINGENCY SILENT \
    --METRICS_FILE {output.metrics} \
    --OUTPUT {output.dedup} \
    --TMP_DIR {params.tmp_dir} \
    --COMPRESSION_LEVEL {params.compression} 2>> {log}
    """

SnakeMake gatk From line 148 of SE/variantCalling.smk

shell:
    """
    gatk \
    SplitNCigarReads \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    -O {output.sacr} 2>> {log}
    """

SnakeMake gatk From line 183 of SE/variantCalling.smk

shell:
    """
    gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
    -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -Xms4000m" \
    BaseRecalibrator \
    -R {input.reference} \
    -I {input.bam} \
    --use-original-qualities \
    -O {output.table} \
    --tmp-dir {params.tmp_dir} \
    -known-sites {input.dbSNP} \
    -known-sites {input.knowIndels} 2>> {log}
    """

SnakeMake gatk From line 216 of SE/variantCalling.smk

shell:
    """
    gatk \
    --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \
    -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
    ApplyBQSR \
    --add-output-sam-program-record \
    -R {input.reference} \
    -I {input.bam} \
    --tmp-dir {params.tmp_dir} \
    --use-original-qualities \
    -O {output.recalbam} \
    --bqsr-recal-file {input.table} 2>> {log}
    """

SnakeMake gatk From line 255 of SE/variantCalling.smk

shell:
    """
    gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ParallelGCThreads=1" \
    HaplotypeCaller \
    -R {input.reference} \
    -I {input.bam} \
    -L {input.intervals} \
    --tmp-dir {params.tmp_dir} \
    -O {output.vcf} \
    -dont-use-soft-clipped-bases \
    --standard-min-confidence-threshold-for-calling 20 \
    --dbsnp {input.dbSNP} 2>> {log}
    """

SnakeMake gatk From line 295 of SE/variantCalling.smk

shell:
    """
    gatk \
    VariantFiltration \
    --R {input.reference} \
    --V {input.vcf} \
    --tmp-dir {params.tmp_dir} \
    --window 35 \
    --cluster 3 \
    --filter-name "FS" \
    --filter "FS > 30.0" \
    --filter-name "QD" \
    --filter "QD < 2.0" \
    -O {output.vcf_f}
    """    

SnakeMake gatk From line 331 of SE/variantCalling.smk

options(scipen=999)
gtf <- read.table(snakemake@input[["gtf"]], sep="\t")
gtf <- subset(gtf, V3 == "exon")
gtf <- data.frame(chrom=gtf[,'V1'], start=gtf[,'V4']-1, end=gtf[,'V5'])
gtf$chrom <- gsub("^chr", "", gtf$chrom)
gtf <- gtf[gtf$chrom != "M",]
write.table(gtf, file=snakemake@output[["tmp"]], quote = F, sep="\t", col.names = F, row.names = F)

system(paste0("gatk BedToIntervalList -I ", snakemake@output[["tmp"]], " -O ", snakemake@output[["intervals"]], " -SD ", snakemake@input[["refDict"]]))

R gatk From line 1 of scripts/createInterval.R

import gffutils

db = gffutils.create_db(
    snakemake.input[0],
    dbfn=snakemake.output.db,
    force=True,
    keep_order=True,
    merge_strategy="merge",
    sort_attribute_values=True,
    disable_infer_genes=True,
    disable_infer_transcripts=True,
)

with open(snakemake.output.bed, "w") as outfileobj:
    for tx in db.features_of_type("transcript", order_by="start"):
        bed = [s.strip() for s in db.bed12(tx).split("\t")]
        bed[3] = tx.id
        outfileobj.write("{}\n".format("\t".join(bed)))

Python From line 1 of scripts/gtf2bed.py

options(scipen=999)
require(tximport)
require(GenomicFeatures)
transcripts <- read.table(snakemake@input[["quant"]], sep ="\t", header = T)
write.csv(transcripts[,c("Name", "NumReads")], file=snakemake@output[["transcript_raw"]], col.names = T, row.names = F)
write.csv(transcripts[,c("Name", "TPM")], file=snakemake@output[["transcript_TPM"]], col.names = T, row.names = F)

tx2gene <- read.table(snakemake@input[["tx2gene"]])

### Read counts and summarize to gene level
txi <- tximport(snakemake@input[["quant"]], type = "salmon", tx2gene = tx2gene)
txiScaled <- tximport(snakemake@input[["quant"]], type = "salmon", tx2gene = tx2gene,countsFromAbundance = "lengthScaledTPM")
### Rename columns
nms <- basename(gsub("\\/quant.sf", "", snakemake@input[["quant"]]))
mat <- txi$counts
matScaled <- txiScaled$counts
colnames(mat) <- nms
colnames(matScaled) <- nms

write.table(mat, file=snakemake@output[["gene_raw"]], row.names = T, col.names = T, sep = "\t")
write.table(matScaled, file=snakemake@output[["gene_scaled"]], row.names = T, col.names = T, sep="\t")

R Quant Salmon tximport From line 1 of scripts/txImport.R

shell:
    "touch {output.complete}"

SnakeMake From line 35 of workflow/Snakefile

shell:
    "tar -czvf {output.tarfile} {input}"

SnakeMake From line 76 of workflow/Snakefile

shell:
    """
    set -e
    mv {params.map_gene_gc_stat} {output.map_gene_gc_stat}
    mv {params.map_gene_fc_stat} {output.map_gene_fc_stat}
    mv {params.map_exon_gc_stat} {output.map_exon_gc_stat}
    mv {params.map_exon_fc_stat} {output.map_exon_fc_stat}
    tar -czvf {output.tarfile} {input} {output.map_gene_gc_stat} {output.map_gene_fc_stat} {output.map_exon_gc_stat} {output.map_exon_fc_stat}
    """