Analysis Pipeline for Ribosome Profiling-Based Identification of Open Reading Frames in Bacteria

public public 1yr ago Version: 2021 0 bookmarks

This repository contains all code to recreate the contents of [RiboReport - Benchmarking tools for ribosome profiling-based identification of open reading frames in bacteria](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab549/6509045). Following are descriptions of the required steps to reproduce our analysis.

Processing of High Throughput Sequencing Data

In the publication, the data required for the generation of our result figures was created using the snakemake workflow described below. After the creation of the input data, the figures from the publication were generated by using the evaluation scripts provided in the evaluation folder. The generation and usage of the final tables and figures is described below.

Generation of the Dataset

Dependencies

Input data

For running the workflow, several input files are required:

  • genome.fa

  • annotation.gff

  • samples.tsv

  • config.yaml

  • fastq files

The fastq files for our newly published dataset are available via NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131514). The data will be published together with the publication. Please contact us if you require the access token.

The rest of the required files can be found in the data folder.

Workflow

Note

Even though snakemake workflows are executable locally, we do not advise this due to high memory usage and runtime of some of the processing steps. We ran the workflow on a TORQUE cluster system (powered by bwHPC) and later on a de.NBI cloud instance.

For more information how to run the workflow using your prefered cluster setup, please refer to the snakemake documentation .

To run the provided snakemake workflow, follow the example for pseudomonas_aeroginosa below:

1. Setup the workflow folder and download the workflow:

mkdir pseudomonas_aeroginosa; cd pseudomonas_aeroginosa;
wget https://github.com/RickGelhausen/RiboReport/archive/2021.tar.gz;
tar -xzf 2021.tar.gz; mv RiboReport-2021 RiboReport; rm 2021.tar.gz;

2. Required tools

All required tools are downloaded automatically via conda and docker.

3. Fetch the annotation and genome files:

cp RiboReport/data/pseudomonas_aeroginosa/annotation.gff . ;
cp RiboReport/data/pseudomonas_aeroginosa/genome.fa . ;

4. Fetch the config.yaml and samples.tsv files:

cp RiboReport/data/pseudomonas_aeroginosa/config.yaml RiboReport/config.yaml ;
cp RiboReport/data/pseudomonas_aeroginosa/samples.tsv RiboReport/samples.tsv ;

5. Retrieve the sequencing data:

In order to download the fastq files, we used the SRA Toolkit .

We run fasterq-dump executable to download the according .fastq files, by providing the appropriate SRR ID .

If you do not have the SRA Toolkit , we suggest creating a conda environment:

conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz
conda activate sra-tools

(use source activate, if conda is not set-up for your bash)

For simplicity, we already collected the required fasterq-dump calls in a file.

cp RiboReport/data/pseudomonas_aeroginosa/download.sh .
bash download.sh

This will download all required fastq files into a fastq folder.

6. Run the snakemake workflow:

In order to run snakemake , the creation of a conda environment is required.

Once miniconda3 is installed. Create a snakemake environment:

conda create -n snakemake -c conda-forge -c bioconda snakemake singularity
conda activate snakemake

Next, we will show how we executed the RiboReport pipeline in both our bwHPC cluster system and on our de.NBI Cloud.

On the cloud or locally (tested on ubuntu)

If you have set up everything correctly, simply run:

snakemake -p -k --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s RiboReport/Snakefile --configfile RiboReport/config.yaml --directory ${PWD} -j 4 --latency-wait 60

You can also run it using nohup, as it might take some time to finish:

nohup snakemake -p -k --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s RiboReport/Snakefile --configfile RiboReport/config.yaml --directory ${PWD} -j 4 --latency-wait 60 &

Depending on your available resources you can also remove --greediness 0 and increase -j, to your preferences in order to boost performance.

On the cluster

Then you can copy and complete one of the provided submission scripts, or create your own.

cp RiboReport/torque.sh .

Example for torque.sh :

#!/bin/bash
#PBS -N benchmark
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <file path>/benchmark
#PBS -l nodes=1:ppn=1
#PBS -o <file path>/benchmark
#PBS -j oe
cd <file path>/benchmark
export PATH="<file path>/miniconda3/bin/:$PATH"
source activate snakemake
snakemake --latency-wait 600 --use-conda -s RiboReport/Snakefile --configfile RiboReport/config.yaml --directory ${PWD} -j 20 --cluster-config RiboReport/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <file path>/benchmark -l {cluster.resources} -o {cluster.logoutputdir} -j oe"

All file path statements have to be replaced by the path to your benchmark folder.

Please note that these scripts might need some extra changes depending on your cluster system. If your cluster system does not run SGE or TORQUE, these files will most likely not work at all. In the case they run SGE or TORQUE, there might be slightly different definitions for the resource statements (here #PBS -l nodes=1:ppn=1 ). This is then also the case for the configuration files sge.yaml and torque.yaml . Please check the snakemake documentation .

Excluding a prediction tool:

Currently, to remove a tool from the analysis simply remove the line from the input and shell part of the mergeConditions rule. You can find it in under rules/postprocessing.smk .

To remove IRSOM change:

rule mergeConditions:
 input:
 ribotish="tracks/{condition}.ribotish.gff",
 reparation="tracks/{condition}.reparation.gff",
 deepribo="tracks/{condition}.deepribo.gff",
 irsom="tracks/{condition}.irsom.gff",
 spectre="tracks/{condition}.spectre.gff",
 smorfer="tracks/{condition}.smorfer.gff",
 ribotricer="tracks/{condition}.ribotricer.gff",
 price="tracks/{condition}.price.gff"
 output:
 "tracks/{condition, [a-zA-Z]+}.merged.gff"
 conda:
 "../envs/bedtools.yaml"
 threads: 1
 shell:
 """
 mkdir -p tracks;
 cat {input.spectre} > {output}.unsorted;
 cat {input.ribotish} >> {output}.unsorted;
 cat {input.reparation} >> {output}.unsorted;
 cat {input.deepribo} >> {output}.unsorted;
 cat {input.irsom} >> {output}.unsorted;
 cat {input.smorfer} >> {output}.unsorted;
 cat {input.ribotricer} >> {output}.unsorted;
 cat {input.price} >> {output}.unsorted;
 bedtools sort -i {output}.unsorted > {output};
 """

to

rule mergeConditions:
 input:
 ribotish="tracks/{condition}.ribotish.gff",
 reparation="tracks/{condition}.reparation.gff",
 deepribo="tracks/{condition}.deepribo.gff",
 spectre="tracks/{condition}.spectre.gff",
 smorfer="tracks/{condition}.smorfer.gff",
 ribotricer="tracks/{condition}.ribotricer.gff",
 price="tracks/{condition}.price.gff"
 output:
 "tracks/{condition, [a-zA-Z]+}.merged.gff"
 conda:
 "../envs/bedtools.yaml"
 threads: 1
 shell:
 """
 mkdir -p tracks;
 cat {input.spectre} > {output}.unsorted;
 cat {input.ribotish} >> {output}.unsorted;
 cat {input.reparation} >> {output}.unsorted;
 cat {input.deepribo} >> {output}.unsorted;
 cat {input.smorfer} >> {output}.unsorted;
 cat {input.ribotricer} >> {output}.unsorted;
 cat {input.price} >> {output}.unsorted;
 bedtools sort -i {output}.unsorted > {output};
 """

smORFer output

As smORFer is modular and has different optional steps it is hard to add all combinations to the snakemake pipeline.

Some of these steps even require the creation of a calibrated bam file, which involves manual checking of offsets.

Rules for all the steps involving Ribo-seq libraries are included in rules/smorfer.smk . To use the steps involving manual work, simply add the calibrated bam file to a directory called calibrated_bam/method-condition-replicate_calibrated.bam .

By default, the filtering steps involving manual work are skipped. To add them the input/output of the rules in rules/smorfer.smk have to be updated accordingly.

Extract operon regions from the annotation:

This describes how we extracted the operons from the available annotation:

conda create -n operons -c conda-forge -c bioconda interlap
conda activate operons
./RiboReport/scripts/operons.py --in_gff_filepath RiboReport/data/pseudomonas_aeroginosa/annotation.gff > RiboReport/data/pseudomonas_aeroginosa/operons.gff ;

Visualisation and plotting of the Results

All data can be visualised using the following scripts inside the evaluation folder. Further, an evaluation_call.sh script, containing all calls for the plotting pipeline, is provided. This bash script only works if all python scripts are positioned in the same folder and the input gtf-files from the data folder are stored in a "data"-folder in the same location.

Dependencies

  • numpy =1.16.3

  • matplotlib =3.0.3

  • seaborn =0.9.0

  • pandas =0.24.2

  • simple_venn =0.1.0

  • bedtools =v2.28.0

statistics_pos_neg.py

Parameters:

  • reference_pos_data (-p) path to the .gtf file containing all genes of the investigated genome or the investigated subset of genes

  • reference_neg_data (-n) path to the .gtf file containing all genes of the investigated genome or the investigated subset of genes

  • tool_data (-t) path to gtf file containing all predictions for the tools: deepribo, ribotish, reparation and irsom

  • save_path (-o) path to a directory, where the result tables will be stored

  • overlap_cutoff (-c) allowed sequence overlap cutoff between gene and prediction

  • flag_subopt (-s) allowed allow suboptimals in statistical calculation. Set to 1 to enable flag. If not 0

  • flag_no_gene (-g) add predictions which do not overlap with any genen for the given cutoff. Set to 1 to enable flag. If not 0.

Output:

  • df_stat.csv -> main result table storing all computed statistical measurements. Needed for the bar plots.

  • SetX_labels_pos_predictions_overlap_fp.gtf -> all prdictions which do not overlap with the positive or the negative gene set. Therefore, predictions outside of the annotation borders for a given overlap (temporary result).

  • SetX_labels_pos_predictions_overlap_neg.gtf -> all genes of the negative set overlapping with a given percent with any prediction (temporary result).

  • SetX_labels_pos_predictions_overlap_pos.gtf -> all genes of the positive set overlapping with a given percent with any prediction (temporary result).

  • SetX_labels_pos_predictions_overlap_temp.gtf -> all prdictions which do not overlap with the positive gene set (temporary result).

  • df_venn_genes.csv -> table containing a column for each tool listing the number of genes counted as true positive (TP). Needed for the overlap plots.

  • toolX_score_list -> For each prediction, correct genen prediction and not predicted gene a tuple of: (score, overlap, label) is saved as a python object serialization (pickle.dump). Needed to compute the ROC and PRC.

This script generates several statistical measurements for a reference and tool prediction .gtf file. First true positives (TP), false positives (FP) and false negatives (FN) are predicted. One prediction will be associated with one gene and counted as one true positve. The association selection is based on the lowest p-value (0.05). All genes fulfilling the overlap cutoff will be counted as suboptimals and not as false positives. False positives are all predictions, where no gene fulfilling the overlap cutoff could be found and the false negatives vice versa. Based on this computations the recall, FNR, precision, FDR and F1 measure are calculated.

plot_barplots.py

Parameters:

  • input1_df (-i1) df_stat.csv of the statistics.py script for the first overlap cutoff.

  • input2_df (-i2) df_stat.csv of the statistics.py script for the second overlap cutoff.

  • save_path (-o) path to directory where the plots will be stored.

Output:

  • bar_FNR.pdf -> barplot of FNR measures for different tools and two cutoff conditions.

  • bar_recall.pdf -> barplot of recall measures for different tools and two cutoff conditions.

  • bar_precision.pdf -> barplot of precision measures for different tools and two cutoff conditions.

  • bar_FDR.pdf -> barplot of FDR measures for different tools and two cutoff conditions.

  • bar_F1.pdf -> barplot of F1 measures for different tools and two cutoff conditions.

This script generates barplots of statistical measures for the different tools and two overlap cutoff conditions. The output barplots consist of FNR, recall, precision, FDR and F1 measures. It is based on the statistics.py statistical output table.

venn_diagram.py

Parameters:

  • input_df (-i) .cvs containing the list of true positive genes for the used tools.

  • save_path (-o) path where the Venn diagramm will be saved.

  • name_folder (-n) name of the result folder of the investigated set. It will be used in the file name to make it unique.

  • coverage_percent (-c) percentage of overlap cutoff that was used to determine the true positves. It will be used in the file name to make it unique.

Output:

  • venn_diagram.pdf -> a 4 Venn diagram highlighting overlapping predictions of the tool.

This script will generate a 4 Venn diagram for the overlap of true positive predicted genes of the 4 investigated tools.

prc_plotting.py

Parameters:

  • input_path: path list of triplets (score, overlap, label) for each tool.

  • species_label: name of the bacterial datasets.

  • save_path: path to save data to.

  • percent_overlap: how much of the prediction should overlap with the gene and vice versa.

  • experiment: name of the subset.

Output:

  • datesetX_prc_overlapScore_subset_lables_.pdf

This scripts generates Precision-Recall-Curves (PRC) for all different tools and overlaps.

roc_plotting.py

Parameters:

  • input_path: path list of triplets (score, overlap, label) for each tool.

  • save_path: path to save data to.

  • percent_overlap: how much of the prediction should overlap with the gene and vice versa.

  • experiment: name of the subset.

Output:

  • datesetX_roc_overlapScore_.pdf

This scripts generates Receiver-Operator-Curves (ROC) for all different tools and overlaps.

Code Snippets

10
11
shell:
    "mkdir -p ribotish; RiboReport/scripts/createRiboTISHannotation.py -a {input.annotation} --genome_sizes {input.sizes} --annotation_output {output}"
21
22
shell:
    "mkdir -p ribotish; gff3ToGenePred {input} {output}"
32
33
shell:
    "mkdir -p ribotish; genePredToBed {input} {output}"
43
44
shell:
    "mkdir -p qc/all; RiboReport/scripts/annotation_featurecount.py -a {input.annotation} -o {output};"
54
55
shell:
    "mkdir -p auxiliary; RiboReport/scripts/enrich_annotation.py -a {input.annotation} -o {output}"
65
66
67
68
69
shell:
    """
    mkdir -p auxiliary;
    awk -F'\\t' '/^[^#]/ {{printf "%s\\t%s\\t%s\\t%s\\t%s\\t%s\\t%s\\t%s\\tID=uid%s;\\n", $1, $2, $3, $4, $5, $6, $7, $8, NR-1}}' {input} > {output}
    """
83
84
shell:
    "mkdir -p transcripts; cufflinks {input.bam} -p {threads} -o ./transcripts/{wildcards.condition}-{wildcards.replicate}/ -g {input.annotation}"
18
19
run:
    shell("mkdir -p deepribo; mv {input} deepribo/DeepRibo_model_v1.pt")
31
32
shell:
    "mkdir -p coverage_deepribo; RiboReport/scripts/coverage_deepribo.py --alignment_file {input.bam} --output_file_prefix coverage_deepribo/{wildcards.condition}-{wildcards.replicate}"
44
45
46
47
48
49
shell:
    """
    mkdir -p coverage_deepribo
    bedtools genomecov -bg -ibam {input.bam} -strand + > {output.covfwd}
    bedtools genomecov -bg -ibam {input.bam} -strand - > {output.covrev}
    """
64
65
66
67
68
69
shell:
    """
    mkdir -p deepribo/{wildcards.condition}-{wildcards.replicate}/0/;
    mkdir -p deepribo/{wildcards.condition}-{wildcards.replicate}/1/;
    DataParser.py {input.covS} {input.covAS} {input.asiteS} {input.asiteAS} {input.genome} deepribo/{wildcards.condition}-{wildcards.replicate} -g {input.annotation}
    """
79
80
shell:
    "mkdir -p deepribo; Rscript RiboReport/scripts/parameter_estimation.R -f {input} -o {output}"
95
96
97
98
99
shell:
    """
    mkdir -p deepribo;
    DeepRibo.py predict deepribo/ --pred_data {wildcards.condition}-{wildcards.replicate}/ -r {params.rpkm} -c {params.cov} --model {input.model} --dest {output} --num_workers {threads}
    """
109
110
shell:
    "mkdir -p tracks; RiboReport/scripts/deepriboGFF.py -c {wildcards.condition}  -i {input} -o {output}"
120
121
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}"
 9
10
shell:
    "samtools faidx {rules.retrieveGenome.output}"
21
22
shell:
    "mkdir -p genomes; cut -f1,2 {input[0]} > genomes/sizes.genome"
35
36
shell:
    "samtools index -@ {threads} maplink/{params.prefix}"
10
11
shell:
    "mkdir -p transcripts; RiboReport/scripts/transcriptsToBed.py -i {input} -o {output}"
22
23
shell:
    "mkdir -p transcripts; bedtools getfasta -fi {input.genome} -name -bed {input.transcripts} -fo {output}"
38
39
40
41
42
43
shell:
    """
    mkdir -p irsom/model/Escherichia_coli/
    mv forge.ibisc.univ-evry.fr/lplaton/IRSOM/raw/master/model/Escherichia_coli/SLSOM irsom/model/Escherichia_coli/
    mv forge.ibisc.univ-evry.fr/lplaton/IRSOM/raw/master/model/Escherichia_coli/SOM irsom/model/Escherichia_coli/
    """
50
51
shell:
    "mkdir -p irsom; mv {input} irsom/Featurer; chmod +x irsom/Featurer;"
67
68
69
70
71
shell:
    """
    mkdir -p irsom;
    predict.py --featurer={input.featurer} --file={input.transcripts} --model=irsom/model/Escherichia_coli/ --output=irsom/{wildcards.condition}-{wildcards.replicate}/
    """
81
82
shell:
    "mkdir -p tracks; RiboReport/scripts/irsomGFF.py -c {wildcards.condition}  -i {input} -o {output}"
92
93
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}"
10
11
shell:
    "mkdir -p maplink; ln -s {params.inlink} {params.outlink}"
22
23
shell:
    "mkdir -p maplink/RIBO/; ln -s {params.inlink} {params.outlink}"
34
35
shell:
    "mkdir -p maplink/RIBO/; ln -s {params.inlink} {params.outlink}"
46
47
shell:
    "mkdir -p maplink/RNA/; ln -s {params.inlink} {params.outlink}"
58
59
shell:
    "mkdir -p maplink/RNA/; ln -s {params.inlink} {params.outlink}"
10
11
shell:
    "mkdir -p maplink/TIS/; ln -s {params.inlink} {params.outlink}"
22
23
shell:
    "mkdir -p maplink/TIS/; ln -s {params.inlink} {params.outlink}"
34
35
shell:
    "mkdir -p maplink/RNATIS/; ln -s {params.inlink} {params.outlink}"
46
47
shell:
    "mkdir -p maplink/RNATIS/; ln -s {params.inlink} {params.outlink}"
11
12
shell:
    "mkdir -p genomeSegemehlIndex; echo \"Computing Segemehl index\"; segemehl.x --threads {threads} -x {output.index} -d {input.genome} 2> {log};"
28
29
shell:
    "mkdir -p sammulti; segemehl.x -e -d {input.genome} -i {input.genomeSegemehlIndex} -q {input.fastq} --threads {threads} -o {output.sammulti} 2> {log}"
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
shell:
    """
    set +e
    mkdir -p sam
    awk '$2 == "4"' {input.sammulti} > {input.sammulti}.unmapped
    gawk -i inplace '$2 != "4"' {input.sammulti}
    samtools view -H <(cat {input.sammulti}) | grep '@HD' > {output.sam}
    samtools view -H <(cat {input.sammulti}) | grep '@SQ' | sort -t$'\t' -k1,1 -k2,2V >> {output.sam}
    samtools view -H <(cat {input.sammulti}) | grep '@RG' >> {output.sam}
    samtools view -H <(cat {input.sammulti}) | grep '@PG' >> {output.sam}
    cat {input.sammulti} |grep -v '^@' | grep -w 'NH:i:1' >> {output.sam}
    exitcode=$?
    if [ $exitcode -eq 1 ]
    then
        exit 1
    else
        exit 0
    fi
    """
68
shell: "if [ \"{params.method}\" == \"NOTSET\" ]; then RiboReport/scripts/samstrandinverter.py --sam_in_filepath={input.sam} --sam_out_filepath={output.sam}; else cp {input.sam} {output.sam}; fi"
78
79
shell:
    "mkdir -p bammulti; samtools view -@ {threads} -bh {input.sam} | samtools sort -@ {threads} -o {output} -O bam"
89
90
shell:
    "mkdir -p rRNAbam; samtools view -@ {threads} -bh {input.sam} | samtools sort -@ {threads} -o {output} -O bam"
17
18
19
20
21
22
23
24
25
26
27
28
29
shell:
    """
    mkdir -p tracks;
    cat {input.spectre} > {output}.unsorted;
    cat {input.ribotish} >> {output}.unsorted;
    cat {input.reparation} >> {output}.unsorted;
    cat {input.deepribo} >> {output}.unsorted;
    cat {input.irsom} >> {output}.unsorted;
    cat {input.smorfer} >> {output}.unsorted;
    cat {input.ribotricer} >> {output}.unsorted;
    cat {input.price} >> {output}.unsorted;
    bedtools sort -i {output}.unsorted > {output};
    """
39
40
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input.mergedGff} -o {output}"
50
51
shell:
    "mkdir -p tracks; RiboReport/scripts/allToUniqueGTF.py -i {input} -o {output}"
7
8
shell:
    "mkdir -p genomes; mv genome.fa genomes/"
16
17
shell:
    "mkdir -p annotation; mv annotation.gff annotation/"
25
26
shell:
    "mkdir -p annotation; RiboReport/scripts/gtf2gff3.py -a {input} -o {output}"
34
35
shell:
    "mkdir -p annotation; RiboReport/scripts/gff2gtf.py -i {input} -o {output}"
10
11
12
13
14
shell:
    """
    mkdir -p price;
    gedi -e IndexGenome -s {input.genome} -a {input.annotation} -nobowtie -nokallisto -nostar -f price/ -o {output} -n price_genome
    """
27
28
29
30
31
shell:
    """
    mkdir -p price;
    gedi -e Price -reads {input.bam} -genomic {input.genomic} -prefix price/{wildcards.condition}-{wildcards.replicate}/results
    """
41
42
43
44
45
shell:
    """
    mkdir -p price;
    gedi -e ViewCIT -m bed -name 'd.getType()' {input.cit} > {output.bed}
    """
57
58
shell:
    "mkdir -p tracks; RiboReport/scripts/priceGFF.py -c {wildcards.condition}  -i {input.filtered} -u {input.unfiltered} -o {output.unfiltered} -f {output.filtered}"
70
71
72
73
74
75
shell:
    """
    mkdir -p tracks;
    RiboReport/scripts/concatGFF.py {input.unfiltered} -o {output.unfiltered}
    RiboReport/scripts/concatGFF.py {input.filtered} -o {output.filtered}
    """
13
14
shell:
    "mkdir -p qc/4unique; fastqc -o qc/4unique -t {threads} -f sam_mapped {input.sam}; mv qc/4unique/{params.prefix}_fastqc.html {output.html}; mv qc/4unique/{params.prefix}_fastqc.zip {output.zip}"
28
29
shell:
    "mkdir -p qc/3mapped; fastqc -o qc/3mapped -t {threads} -f sam_mapped {input.sam}; mv qc/3mapped/{params.prefix}_fastqc.html {output.html}; mv qc/3mapped/{params.prefix}_fastqc.zip {output.zip}"
43
44
shell:
    "mkdir -p qc/1raw; fastqc -o qc/1raw -t {threads} {input.fastq}; mv qc/1raw/{params.prefix}_fastqc.html {output.html}; mv qc/1raw/{params.prefix}_fastqc.zip {output.zip}"
58
59
shell:
    "mkdir -p qc/2trimmed; fastqc -o qc/2trimmed -t {threads} {input}; mv qc/2trimmed/{params.prefix}_fastqc.html {output.html}; mv qc/2trimmed/{params.prefix}_fastqc.zip {output.zip}"
73
74
shell:
    "mkdir -p qc/5removedrRNA; fastqc -o qc/5removedrRNA -t {threads} {input}; mv qc/5removedrRNA/{params.prefix}_fastqc.html {output.html}; mv qc/5removedrRNA/{params.prefix}_fastqc.zip {output.zip}"
87
88
89
90
91
92
93
94
95
96
97
shell:
    """
    mkdir -p qc/all;
    column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq)
    if [[ " ${{column3[@]}} " =~ "gene" ]];
    then
        featureCounts -T {threads} -t gene -g ID -a {input.annotation} -o {output.txt} {input.bam};
    else
        touch {output.txt};
    fi
    """
108
109
110
111
112
113
114
115
116
117
118
shell:
    """
    mkdir -p qc/trnainall;
    column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq)
    if [[ " ${{column3[@]}} " =~ "tRNA" ]];
    then
        featureCounts -T {threads} -t tRNA -g ID -a {input.annotation} -o {output.txt} {input.bam};
    else
        touch {output.txt};
    fi
    """
129
130
131
132
133
134
135
136
137
138
139
shell:
    """
    mkdir -p qc/rrnainall;
    column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq)
    if [[ " ${{column3[@]}} " =~ "rRNA" ]];
    then
        featureCounts -T {threads} -t rRNA -g ID -a {input.annotation} -o {output.txt} {input.bam};
    else
        touch {output.txt};
    fi
    """
150
151
152
153
154
155
156
157
158
159
160
shell:
    """
    mkdir -p qc/rrnainallaligned;
    column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq)
    if [[ " ${{column3[@]}} " =~ "rRNA" ]];
    then
        featureCounts -T {threads} -t rRNA -g ID -a {input.annotation} -o {output.txt} {input.bam};
    else
        touch {output.txt};
    fi
    """
171
172
173
174
175
176
177
178
179
180
181
shell:
    """
    mkdir -p qc/rrnainuniquelyaligned;
    column3=$(cut -f3 auxiliary/unambigous_annotation.gff | sort | uniq)
    if [[ " ${{column3[@]}} " =~ "rRNA" ]];
    then
        featureCounts -T {threads} -t rRNA -g ID -a {input.annotation} -o {output.txt} {input.bam};
    else
        touch {output.txt};
    fi
    """
191
192
shell:
    "mkdir -p coverage; bedtools genomecov -ibam {input} -bg > {output}"
215
216
shell:
    "export LC_ALL=en_US.utf8; export LANG=en_US.utf8; multiqc -f -d --exclude picard --exclude gatk -z -o {params.dir} qc/3mapped qc/1raw qc/2trimmed qc/5removedrRNA qc/4unique qc/all qc/trnainall qc/rrnainallaligned qc/rrnainuniquelyaligned qc/rrnainall trimmed  2> {log}"
11
12
13
14
15
16
17
18
19
20
21
22
23
24
shell:
    """
    mkdir -p auxiliary
    UNIQUE="$(cut -f3 {input.annotation} | sort | uniq)"
    IDENTIFIER="ID"
    LINE="$(sed '3q;d' {input.annotation})"
    if [[ $LINE == *"gene_id="* ]]; then IDENTIFIER="gene_id"; fi;
    for f in ${{UNIQUE}}
    do
        featureCounts -F GTF -s 1 -g $IDENTIFIER -O -t $f -M --fraction -a {input.annotation} {input.bam} -T {threads} -o auxiliary/annotation_total_reads.raw.tmp
        cat auxiliary/annotation_total_reads.raw.tmp | sed 1,2d | awk -v var=$f -FS'\\t' '{{print $0"\\t"var}}' >> {output}
        rm auxiliary/annotation_total_reads.raw.tmp
    done
    """
35
36
37
38
shell:
    """
    mkdir -p auxiliary; RiboReport/scripts/map_reads_to_annotation.py -i {input.reads} -a {input.annotation} -o {output}
    """
50
51
shell:
    "mkdir -p auxiliary; RiboReport/scripts/total_mapped_reads.py -b {input.bam} -m {output.mapped} -l {output.length}"
65
66
shell:
    "mkdir -p auxiliary; RiboReport/scripts/generate_output_annotation.py -t {input.total} -r {input.reads} -g {input.genome} -o {output.xlsx} --simple {output.annotation} --complete {output.complete}"
10
11
12
run:
    outputName = os.path.basename(input[0])
    shell("mkdir -p uniprotDB; mv {input} uniprotDB/{outputName}; gunzip uniprotDB/{outputName}")
34
35
shell:
    "mkdir -p reparation; if [ uniprotDB/uniprot_sprot.fasta.bak does not exist ]; then cp -p uniprotDB/uniprot_sprot.fasta uniprotDB/uniprot_sprot.fasta.bak; fi; mkdir -p {params.prefix}/tmp; reparation.pl -bam {input.bam} -g {input.genome} -gtf {input.gtf} -db {input.db} -out {params.prefix} -threads {threads}; if [ uniprotDB/uniprot_sprot.fasta does not exist ]; then cp -p uniprotDB/uniprot_sprot.fasta.bak uniprotDB/uniprot_sprot.fasta; fi;"
45
46
shell:
    "mkdir -p tracks; RiboReport/scripts/reparationGFF.py -c {wildcards.condition}  -r {wildcards.replicate} -i {input} -o {output}"
56
57
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}"
19
20
shell:
    "mkdir -p ribotish; ribotish quality -v -p {threads} -b {input.fp} -g {input.annotation} -o {params.reporttxt} -f {params.reportpdf} 2> {log} || true; if grep -q \"offdict = {{'m0': {{}}}}\" {params.offsetparameters}; then mv {params.offsetparameters} {params.offsetparameters}.unused; fi; touch {output.offsetdone}"
43
44
shell:
    "mkdir -p ribotish; ribotish predict -v {params.codons} -p {threads} -b {params.fplist} -g {input.annotation} -f {input.genome} -o {output.filtered} 2> {log}"
54
55
shell:
    "mkdir -p tracks; RiboReport/scripts/ribotishGFF.py -c {wildcards.condition} -i {input} -o {output}"
16
17
18
19
20
shell:
    """
    mkdir -p ribotricer;
    ribotricer prepare-orfs --gtf {input.annotation} --fasta {input.genome} --prefix ribotricer/ribotricer --min_orf_length 6 --start_codons ATG,GTG,TTG
    """
33
34
35
36
37
shell:
    """
    mkdir -p ribotricer/{wildcards.condition}-{wildcards.replicate};
    ribotricer learn-cutoff --ribo_bams {input.bam_ribo} --rna_bams {input.bam_rna}  --prefix ribotricer/ribotricer --ribotricer_index {input.index} | tail -n 1 | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'> {output}
    """
50
51
52
53
54
shell:
    """
    mkdir -p ribotricer/{wildcards.condition}-{wildcards.replicate}
    ribotricer detect-orfs --bam {input.bam_ribo} --ribotricer_index {input.index} --prefix ribotricer/{wildcards.condition}-{wildcards.replicate}/ribotricer --phase_score_cutoff {params.cutoff}
    """
64
65
shell:
    "mkdir -p tracks; RiboReport/scripts/ribotricerGFF.py -c {wildcards.condition}  -i {input} -o {output}"
75
76
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}"
 9
10
11
12
shell:
    """
    mkdir -p annotation; awk -F'\\t' '$3 == "rRNA" || $3 == "tRNA"' {input.annotation} | awk -F'\\t' '{{print $1 FS $4 FS $5 FS "." FS "." FS $7}}' > {output.annotation}
    """
23
24
shell:
    "bedtools intersect -v -a {input.mapuniq} -b {input.annotation} > {output.bam}"
20
21
22
23
24
shell:
    """
    mkdir -p smorfer;
    smorfer.sh -s putative_orfs.sh -g {input.genome} -c {params.genome_name} -n smorfer_pORFs -i 9-150 -o smorfer/
    """
35
36
37
38
39
shell:
    """
    mkdir -p smorfer;
    smorfer.sh -s FT_GCcontent.sh -g {input.genome} -o smorfer/ -b {input.bed}
    """
50
51
52
53
54
shell:
    """
    mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate};
    smorfer.sh -s count_RPF.sh -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}/
    """
65
66
67
68
69
shell:
    """
    mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate};
    smorfer.sh -s FT_RPF.sh high_expressed_ORFs.bed calibrated_RiboSeq.bam -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}/
    """
80
81
82
83
84
shell:
    """
    mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate};
    smorfer.sh -s find_best_start.sh -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}/
    """
 96
 97
 98
 99
100
shell:
    """
    mkdir -p smorfer/{wildcards.condition}-{wildcards.replicate}_noFT;
    smorfer.sh -s count_RPF.sh -b {input.bed} -a {input.bam} -o smorfer/{wildcards.condition}-{wildcards.replicate}_noFT/
    """
111
112
shell:
    "mkdir -p tracks; RiboReport/scripts/smorferGFF.py -c {wildcards.condition}  -i {input} -o {output}"
122
123
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}"
14
15
16
17
18
shell:
    """
    mkdir -p spectre;
    SPECtre.py --input {input.bam} --output {output.results} --log {output.log} --fpkm {input.fpkm} --gtf {input.gtf} --nt {threads}
    """
28
29
shell:
    "mkdir -p tracks; RiboReport/scripts/spectreGFF.py -c {wildcards.condition}  -i {input} -o {output}"
39
40
shell:
    "mkdir -p tracks; RiboReport/scripts/concatGFF.py {input} -o {output}"
16
17
shell:
    "mkdir -p trimlink; ln -s {params.inlink} {params.outlink};"
31
32
shell:
    "mkdir -p trimmed; cutadapt -j {threads} {params.adapter} {params.quality} {params.filtering} -o {output.fastq} {input.fastq}"
11
12
shell:
    "mkdir -p genomes; RiboReport/scripts/reverseComplement.py --input_fasta_filepath genomes/genome.fa --output_fasta_filepath genomes/genome.rev.fa"
23
24
shell:
    "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string ATG --output_gff3_filepath {output}"
35
36
shell:
    "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string GTG,TTG,CTG --output_gff3_filepath {output}"
48
49
shell:
    "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string TAG,TGA,TAA --output_gff3_filepath {output}"
60
61
shell:
    "mkdir -p tracks; RiboReport/scripts/motif2GFF3.py --input_genome_fasta_filepath {input.fwd} --input_reverse_genome_fasta_filepath {input.rev} --motif_string AAGG --output_gff3_filepath {output}"
82
83
shell:
   "mkdir -p globaltracks; mkdir -p globaltracks/raw; mkdir -p globaltracks/mil; mkdir -p globaltracks/min; RiboReport/scripts/mapping.py --mapping_style global --bam_path {input.bam} --wiggle_file_path globaltracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};"
94
95
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
106
107
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
118
119
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
130
131
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
142
143
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
154
155
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
176
177
shell:
     "mkdir -p centeredtracks; mkdir -p centeredtracks/raw; mkdir -p centeredtracks/mil; mkdir -p centeredtracks/min; RiboReport/scripts/mapping.py --mapping_style centered --bam_path {input.bam} --wiggle_file_path centeredtracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};"
188
189
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
200
201
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
212
213
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
224
225
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
236
237
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
248
249
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
270
271
shell:
    "mkdir -p fiveprimetracks; mkdir -p fiveprimetracks/raw; mkdir -p fiveprimetracks/mil; mkdir -p fiveprimetracks/min; RiboReport/scripts/mapping.py --mapping_style first_base_only --bam_path {input.bam} --wiggle_file_path fiveprimetracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};"
282
283
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
294
295
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
306
307
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
318
319
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
330
331
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
342
343
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
364
365
shell:
    "mkdir -p threeprimetracks; mkdir -p threeprimetracks/raw; mkdir -p threeprimetracks/mil; mkdir -p threeprimetracks/min; RiboReport/scripts/mapping.py --mapping_style last_base_only --bam_path {input.bam} --wiggle_file_path threeprimetracks/ --no_of_aligned_reads_file_path {input.stats} --library_name {params.prefix};"
376
377
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
388
389
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
400
401
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
412
413
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
424
425
shell:
    "wigToBigWig {input.fwd} {input.genomeSize} {output.fwd}"
436
437
shell:
    "wigToBigWig {input.rev} {input.genomeSize} {output.rev}"
447
448
shell:
    "mkdir -p tracks; multiBamSummary bins --smartLabels --bamfiles maplink/*.bam -o {output} -p {threads};"
458
459
shell:
    "mkdir -p figures; plotCorrelation -in {input.npz} --corMethod spearman --skipZeros --plotTitle \"Spearman Correlation of Read Counts\" --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o {output.correlation} --outFileCorMatrix SpearmanCorr_readCounts.tab"
469
470
shell:
    "mkdir -p tracks; cat {input[0]} | grep -v '\tgene\t' > tracks/annotation-woGenes.gtf; gtf2bed < tracks/annotation-woGenes.gtf > tracks/annotation.bed"
481
482
shell:
    "mkdir -p tracks; cut -f1-6 {input[0]} > tracks/annotationNScore.bed6;  awk '{{$5=1 ; print ;}}' tracks/annotation.bed6 > tracks/annotation.bed6; bedToBigBed -type=bed6 -tab tracks/annotation.bed6 {input[1]} tracks/annotation.bb"
ShowHide 122 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/RickGelhausen/RiboReport
Name: riboreport
Version: 2021
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...