riboCleaner: rRNA Contamination Detection and Quantification Workflow for RNA-Seq Data

public public 1yr ago Version: v2.0.0 0 bookmarks
Loading...

riboCleaner is an open-source workflow designed to identify and quantify rRNA abundance in RNA-Seq data. riboCleaner has two main workflows: identify and quantify.

The identify workflow identifies putative rDNA genes in a genome of interest. First, it uses barrnap to identify coordinates that contain rDNA elements in a genome (fasta) and then bundles these elements into non-overlapping operons. Then it compares these regions to the predicted genes (gff) to identify genes that were predicted in overlapping coordinate space as a predicted rDNA operon. These genes are flagged as putative rDNA contaminants. Then, the rest of the genes are compared to the rDNA contaminant gene list using blastn and any that pass a sequence identity and coverage threshold are also flagged as rDNA.

The quantify workflow quantifies the abundance of rRNA transcripts in RNA-Seq samples. For this we use salmon to quantify the abundance of transcripts in each sample (including the putative rDNA gene models). This allows any reads derived from rRNA to map to the rRNA transcripts and avoids biasing the counts of mRNA transcripts with homologous rRNA reads. Then, riboCleaner quantifies the abundance of the rRNA transcripts to give an estimate of rRNA prevalence in the RNA-seq data and visualizes the results. As a final step, riboCleaner removes the rRNA transcripts from the salmon counts table to yield a table of mRNA abundances that can be normalized and used for downstream analysis.

This workflow diagram shows how the different parts of the riboCleaner standard workflow fits together. You can also read a more detailed description of the methods or see the Snakemake DAG .

workflow_diagram

Quick Start

If you want to see the output riboCleaner produces, we have provided the report from running riboCleaner on a test dataset

To jump right in and run riboCleaner as fast as possible:

# clone this repo
git clone https://github.com/basf/riboCleaner.git
cd riboCleaner/test_data

Then follow the instructions in the test data README to make sure everything is set up properly.

Once the workflow runs on the test data, replace the test data inputs with your own, delete the contents of the results directory, and rerun.

Otherwise see our detailed documentation for installing and running riboCleaner .

Code Snippets

13
14
15
16
shell:
    """
    barrnap --kingdom {params.kingdom} --threads {threads} {input.reference} > {output.gff}
    """
32
33
34
35
shell:
    """
    bedtools merge -c 1 -o count -d {params.nor_distance} -i {input.gff} | grep -v ^unmapped | awk -v c={params.nor_counts} '{{if ($4>c) print}}' > {output.nor}
    """
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
shell:
    """
    bedtools intersect -a {input.gene_models} -b {input.barrnap} {input.nor} > {output.tmp}

    # grep will exit with a 1 if it finds no matches
    set +e
    grep {params.gff_feature} {output.tmp} > {output.rdna_models}
    exitcode=$?
    if [ $exitcode -eq 1 ]
    # no matches were found
    then
        exit 0
    else
        exit $exitcode
    fi
    """
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
run:
    # parse through the gff and extract the fields required for the BED
    # [chromosome, start, end, name]
    with open(input.gff, 'r') as IN, open(output.bed, 'w') as OUT:
        for line in IN:
            if line.startswith("#"):
                continue

            columns = line.strip().split("\t")

            chrom = columns[0]
            feature_type = columns[2]
            start = columns[3]
            end = columns[4]

            # only grab the features we want
            if feature_type == params.feature:
                # get a string like "ID=Glyma.13G020400.1.Wm82.a2.v1;Name=Glyma.13G020400.1;pacid=30501300;longest=1;Parent=Glyma.13G020400.Wm82.a2.v1"
                attrs = columns[8]

                # tokenize attrs
                tokens = {attr.split("=")[0]: attr.split("=")[1] for attr in attrs.split(";")}
                try:
                    name = tokens[params.name]
                except KeyError:
                    raise ValueError("'{name}' missing in '{line}'".format(name=params.name, line=line))

                # write the line in the BED file
                OUT.write("\t".join((chrom, start, end, name)) + "\n")
120
121
shell:
    "bedtools getfasta -name -fi {input.reference} -bed {input.models} -s > {output.fasta}"
136
137
shell:
    "dedupe.sh minidentity={params.identity} in={input.fasta} out={output.deduped}"
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
run:
    import re

    # gff/gtf have different ways to specify attributes
    gff_regex = re.compile('(?P<key>[^=]+) *= *(?P<value>[^;]+);? *')
    gtf_regex = re.compile('(?P<key>[^ ]+) *\"(?P<value>[^;\"]+)\"?;? *')

    filetype = None
    names = set()
    with open(input.models, 'r') as IN:
        # read the first line to determine if we are using gff/gtf
        first_line = IN.readline()
        if first_line:
            attrs = first_line.split("\t")[8]
            if gff_regex.match(attrs):
                attr_regex = gff_regex
                name_key = "ID"
            elif gtf_regex.match(attrs):
                attr_regex = gtf_regex
                name_key = "gene_id"
            else:
                raise ValueError("Couldn't filetype as either GFF/GTF")

        # reset to the beginning of the file
        IN.seek(0)

        for line in IN:
            # get a string like "ID=Glyma.13G020400.1.Wm82.a2.v1;Name=Glyma.13G020400.1;pacid=30501300;longest=1;Parent=Glyma.13G020400.Wm82.a2.v1"
            # or 'gene_id "FBgn0267517"; gene_symbol "5.8SrRNA-Psi:CR45857"';
            # see: http://uswest.ensembl.org/info/website/upload/gff.html
            attrs = line.split("\t")[8]

            # tokenize attrs
            #tokens = {attr.split("=")[0]: attr.split("=")[1] for attr in attrs.split(";")}

            for match in attr_regex.finditer(attrs):
                if match.group("key") == name_key:
                    names.add(match.group("value"))
                    break
            else:
                raise ValueError("Couldn't detect gene name in '{line}'".format(line=line))

    # write the names 1/line to output
    with open(output.names, 'w') as OUT:
        for n in names:
            OUT.write(n + "\n")
210
211
212
213
214
shell:
    """ 
    # -w gets the full trascript (+ UTR, etc)
    gffread -w {output.features} -g {input.fasta} {input.gff}
    """
227
228
shell:
    "blastn -query {input.query} -subject {input.subject} -outfmt '6 qseqid sseqid qstart qend pident qcovs' -out {output.hits} || touch {output.hits}"
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
run:
    rdna = set()
    with open(input.blast, 'r') as IN:
        current_query = ""
        current_subject = ""
        current_weight_id = 0
        current_weight_n = 0
        current_cov = 0
        for line in IN:
            query, subject, start, end, pident, qcovs = line.strip().split("\t")

            # trim the bedtools names at the first ":"
            #query = query.split(":")[0]

            # gather the HSPs to each subject for each query and check them against our thresholds
            if query != current_query or subject != current_subject:
                # check the coverage and weighted perc id of the hits
                if current_query != "" and float(current_cov) >= params.qcovs and current_weight_id / current_weight_n >= params.pident:
                    rdna.add(current_query)

                # reset
                current_query = query
                current_subject = subject
                current_weight_id = 0
                current_weight_n = 0
                current_cov = qcovs

            # add this hit; keeping a running % identity for the HSPs in this query/subject pairing
            current_weight_id += float(pident) * abs(int(end) - int(start))
            current_weight_n += abs(int(end) - int(start))

    # subtract the known names
    barrnap_names = set()
    with open(input.barrnap_names, 'r') as IN:
        for line in IN:
            barrnap_names.add(line.strip())

    with open(output.names, 'w') as OUT:
        for n in rdna.difference(barrnap_names):
            OUT.write(n + "\n")
291
292
shell:
    "cat {input.barrnap} {input.homology} > {output.names}"
301
302
shell:
    "/opt/bbmap/filterbyname.sh include=t substring=name in={input.features} names={input.names} out={output.fasta}"        
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
run:
    from collections import defaultdict
    # count non-partial rDNA predicted by barrnap
    barrnap_results = defaultdict(lambda: defaultdict(int))
    with open(input.barrnap_predictions, 'r') as IN:
        for line in IN:
            if not line.startswith("#"):
                # detect if this is partial
                partial = "partial" in line

                # quantify each rDNA type
                if "5S_rRNA" in line:
                    barrnap_results["5S"][partial] += 1
                elif "18S_rRNA" in line:
                    barrnap_results["18S"][partial] += 1
                elif "28S_rRNA" in line:
                    barrnap_results["28S"][partial] += 1
                elif "5_8S_rRNA" in line:
                    barrnap_results["5.8S"][partial] += 1
                # needs to be expanded for prokaryotes 
                else:
                    barrnap_results["other"][partial] += 1

    # count the rDNA regions
    rDNA_regions = 0
    with open(input.rdna_regions, 'r') as IN:
        for line in IN:
            rDNA_regions += 1

    # count the models added via BLAST homology
    homology_models = 0
    with open(input.homology_models, 'r') as IN:
        for line in IN:
            homology_models += 1

    # count the rDNA genes
    rDNA_gene_models = 0
    with open(input.gene_models, 'r') as IN:
        for line in IN:
            rDNA_gene_models += 1

    # output the summary
    with open(output.summary, 'w') as OUT:
        OUT.write("riboCleaner ran barrnap to detect the following counts of rDNA features:\n")

        for elem in ["18S", "5.8S", "28S", "5S", "other"]:
            OUT.write("    {elem}: {complete} complete, {partial} partial\n".format(elem=elem, complete=barrnap_results[elem][False], partial=barrnap_results[elem][True]))

        OUT.write("\n")
        OUT.write("These rDNA features were merged into {} rDNA-rich regions.\n".format(rDNA_regions))
        OUT.write("\n")
        OUT.write("The rDNA regions were found to overlap with {overlap} features with the `{tag}` tag in the GFF file.\n".format(overlap=rDNA_gene_models - homology_models, tag=params.gff_feature))
        OUT.write("Additionally, {homology} models were identified as having significant homology to these features for a total of {total} models flagged as putative rDNA.\n".format(homology=homology_models, total=rDNA_gene_models))

        OUT.write("\n")

        # begin the warning section
        if not rDNA_regions:
            OUT.write("WARNING: There were no rDNA regions detected. This may be because there are no rDNA sequences in your reference genome (detectable by barrnap) or because they do not form regular rDNA repeats. riboCleaner works best when it is able to detect rDNA regions.\n")

        if not rDNA_gene_models:
            OUT.write("WARNING: There were no rDNA gene models detected. This is often good news but means riboCleaner will detect no rDNA for this genome. However, since there are no false gene models originating from rDNA, your counts table resulting from standard transcriptome analysis won't be biased by false rDNA gene models!\n")
11
12
13
14
shell:
    """
    salmon index -p {threads} -t <(cat {input.features} {input.genome}) -d <(grep "^>" {input.genome} | cut -d " " -f 1 | sed 's/>//g') -i {output}
    """
30
31
32
33
34
35
shell:
    """
    salmon quant --threads {threads} -i {input.index} --libType A -1 {input.r1} -2 {input.r2} -o {output.tmp} 2>&1 | tee {log}
    cp -v {output.tmp}/quant.sf {output.sf}
    cp -v {output.tmp}/aux_info/meta_info.json {output.meta_info}
    """
48
49
50
51
52
53
shell:
    """
    salmon quant --threads {threads} -i {input.index} --libType A -r {input.r1} -o {output.tmp} 2>&1 | tee {log}
    cp -v {output.tmp}/quant.sf {output.sf}
    cp -v {output.tmp}/aux_info/meta_info.json {output.meta_info}
    """
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
run:
    import pandas as pd
    import json

    # get the rDNA names
    with open(input.rdna_names, 'r') as IN:
        names = [line.strip() for line in IN]

    sample_summary = {}

    # gather the counts for each file
    all_counts = []
    for sf, meta_info, sample in zip(input.sf, input.meta_info, params.samples):
        with open(meta_info) as IN:
            info = json.load(IN)

        d = pd.read_csv(sf, sep="\t", header=0, index_col=0)
        counts = d["TPM"]

        counts.name = sample
        all_counts.append(counts)

        # get the sample summary info
        sample_summary[sample] = {
            'percent_unmapped': 100 - float(info['percent_mapped']),
            # scale percent rDNA to the mapped portion
            'percent_rDNA': float(info['percent_mapped']) * counts.filter(items=names).sum() / counts.sum()
            }

        # the usable percent is whatever is left over
        sample_summary[sample]['percent_mRNA'] = 100 -  sample_summary[sample]['percent_unmapped'] - sample_summary[sample]['percent_rDNA']

    # and combine into one df
    all_counts = pd.DataFrame(all_counts).fillna(0).transpose()

    # make the index be just the gene names (not the position)
    all_counts.index = [g.split(":")[0] for g in all_counts.index]

    # get just the rdna
    all_counts_rdna = all_counts.filter(items=names, axis=0)
    all_counts_rdna.to_csv(output.rdna_abundance, sep="\t", index_label="sample")

    # done as is above to write out more details in the future, for now, just the TPM
    summary = pd.DataFrame.from_dict(sample_summary, orient="index").round(2)

    # write sorted by sample name
    summary.sort_index().to_csv(output.summary, sep="\t", index_label="sample", float_format="%.2f")
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
run:
    import pandas as pd
    import json

    # get the rDNA names
    with open(input.rdna_names, 'r') as IN:
        rdna = [line.strip() for line in IN]

    sample_summary = {}

    # gather the counts for each file
    all_counts = []
    for sf, sample in zip(input.sf, params.samples):
        d = pd.read_csv(sf, sep="\t", header=0, index_col=0)
        counts = d["TPM"]

        # get the total (including rDNA)
        total = counts.sum()

        # remove rDNA
        counts = counts.filter(items=[n for n in counts.index if n not in rdna])

        # rescale counts by the rDNA ratio (to get the TPM as if the rDNA was never there)
        counts = counts * (total / counts.sum())

        counts.name = sample
        all_counts.append(counts)

    # and combine into one df
    all_counts = pd.DataFrame(all_counts).fillna(0).transpose()

    # make the index be just the gene names (not the position)
    all_counts.index = [g.split(":")[0] for g in all_counts.index]

    all_counts.to_csv(output.clean_tpm, sep="\t", index_label="gene")
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
run:
    import pandas as pd
    from matplotlib import pyplot as plt

    summary = pd.read_csv(input.summary, sep="\t", index_col=0)[["percent_mRNA", "percent_rDNA", "percent_unmapped"]]
    summary.columns = ["mRNA", "rDNA", "unmapped"]

    # make the experiment level plot
    # adapted from: https://www.dataforeverybody.com/matplotlib-seaborn-pie-charts/
    overall_summary = summary.mean(axis=0)
    pie, ax = plt.subplots(figsize=[10,6])
    labels = overall_summary.keys()
    plt.pie(x=overall_summary, autopct="%.2f%%", explode=[0.05]*3, labels=labels, pctdistance=0.5)
    plt.title("Experiment Overview", fontsize=14)
    pie.savefig(output.experiment_summary)
    pie.savefig(output.experiment_summary_svg)

    # make the sample level plot
    # limited to 5 samples per inch of plot (and 3 inches for overhead)
    width = 3 + int(len(summary) / 5)
    summary.plot(kind="bar", stacked=True, figsize=(width, 8))
    plt.legend(loc='lower left', bbox_to_anchor=(1.0, 0.5))
    plt.xticks(rotation=45, ha="right")
    plt.title("Sample Composition")
    plt.ylabel("% of Reads")
    plt.tight_layout()
    plt.gcf().savefig(output.sample_summary)
    plt.gcf().savefig(output.sample_summary_svg)
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
run:
    # print the workflow docstring
    print("Snakefile: {p}".format(p=workflow.snakefile))
    if workflow.globals["__doc__"]:
        print(workflow.globals["__doc__"])

    # print each (public) rule
    print("\n##### RULES #####\n")
    for rule in workflow.rules:
        if not rule.name.startswith("_"):
            if rule.docstring:
                print(rule.name)
                print(rule.docstring)
                print("----------\n")
143
144
145
146
shell:
    """
    bbduk.sh in={input.r1} in2={input.r2} out={output.r1} out2={output.r2} reads={params.reads} threads={threads}
    """
157
158
159
160
shell:
    """
    bbduk.sh in={input.r1} out={output.r1} reads={params.reads} threads={threads}
    """
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/basf/riboCleaner
Name: ribocleaner
Version: v2.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...